Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification - INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement Access content directly
Conference Papers Year : 2023

Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification


We present EpidBioBERT, a biosurveillance epidemiological document tagger for disease surveillance over PADI-Web system. Our model is trained on PADI-Web corpus which contains news articles on Animal Diseases Outbreak extracted from the web. We train a classifier to discriminate between relevant and irrelevant documents based on their epidemiological thematic feature content in preparation for further epidemiology information extraction. Our approach proposes a new way to perform epidemiological document classification by enriching epidemiological thematic features namely disease, host, location and date, which are used as inputs to our epidemiological document classifier. We adopt a pre-trained biomedical language model with a novel fine tuning approach that enriches these epidemiological thematic features. We find these thematic features rich enough to improve epidemiological document classification over a smaller data set than initially used in PADI-Web classifier. This improves the classifiers ability to avoid false positive alerts on disease surveillance systems. To further understand information encoded in EpidBioBERT, we experiment the impact of each epidemiology thematic feature on the classifier under ablation studies. We compare our biomedical pre-trained approach with a general language model based model finding that thematic feature embeddings pre-trained on general English documents are not rich enough for epidemiology classification task. Our model achieves an F1-score of 95.5% over an unseen test set, with an improvement of +5.5 points on F1-Score on the PADI-Web classifier with nearly half the training data set.
Fichier principal
Vignette du fichier
Menya-2022.pdf (589.57 Ko) Télécharger le fichier
Origin Files produced by the author(s)

Dates and versions

hal-04006003 , version 1 (27-02-2023)


  • HAL Id : hal-04006003 , version 1
  • WOS : 000889371703088


Edmond Odhiambo Menya, Mathieu Roche, Roberto Interdonato, Dickson Odhiambo Owuor. Enriching Epidemiological Thematic Features For Disease Surveillance Corpora Classification. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, Jun 2022, Marseille, France. pp.3741. ⟨hal-04006003⟩
35 View
11 Download


Gmail Mastodon Facebook X LinkedIn More