Explainable epidemiological thematic features for event based disease surveillance - INRAE - Institut national de recherche pour l’agriculture, l’alimentation et l’environnement
Article Dans Une Revue Expert Systems with Applications Année : 2024

Explainable epidemiological thematic features for event based disease surveillance

Résumé

Event based disease surveillance (EBS) systems are biosurveillance systems that have the ability to detect and alert on (re)-emerging infectious diseases by monitoring acute public or animal health event patterns from sources such as blogs, online news reports and curated expert accounts. These information rich sources, however, are largely unstructured text data requiring novel text mining techniques to achieve EBS goals such as epidemiological text classification. The main objective of this research was to improve epidemiological text classification by proposing a novel technique of enriching thematic features using a weak supervision approach. In our approach, we train and test a mixed domain language model named EpidBioELECTRA to first enrich thematic features which are then used to improve epidemiological text classification. We train EpidBioELECTRA on a large dataset which we create consisting of 70,700 annotated documents that includes 70,400 labeled thematic features. We empirically compare EpidBioELECTRA with both general purpose language models and domain specific language models in the task of epidemiological corpus classification. Our findings shows that epidemiological classification systems work best with language models pre-trained using both epidemiological and biomedical corpora with a continual pre-training strategy. EpidBioELECTRA improves epidemiological document classification by 19.2 $F_1$ score points as compared to its vanilla implementation BioELECTRA. We observe this by the comparison of BioELECTRA verses EpidBioELECTRA on our most challenging dataset PADI-Web where our approach records 92.33 precision score, 94.62 recall score and 93.46 $F_1$ score. We also experiment the impact of increasing context length of train documents in epidemiological document classification and found out that this improves the classification task by 7.79 $F_1$ score points as recorded by EpidBioELECTRA’s performance. We also compute Almost Stochastic Order (ASO) scores to track EpidBioELECTRA’s statistical dominance. In addition, we carry out ablation studies on our proposed thematic feature enrichment approach using explainable AI techniques. We present explanations for the most critical thematic features and how they influence epidemiological classification task We found out that biomedical features (such as mentions of names of diseases and symptoms) are the most influential while spatio-temporal features (such as the mention of date of a given disease outbreak) are the least influential in epidemiological document classification. Our model can easily be extended to fit other domains.
Fichier principal
Vignette du fichier
Menya_Expert Systems With Applications_2024.pdf (1.95 Mo) Télécharger le fichier
Origine Fichiers éditeurs autorisés sur une archive ouverte
Licence

Dates et versions

hal-04687433 , version 1 (04-09-2024)

Licence

Identifiants

Citer

Edmond Menya, Roberto Interdonato, Dickson Odhiambo Owuor, Mathieu Roche. Explainable epidemiological thematic features for event based disease surveillance. Expert Systems with Applications, 2024, 250, pp.123894. ⟨10.1016/j.eswa.2024.123894⟩. ⟨hal-04687433⟩
146 Consultations
17 Téléchargements

Altmetric

Partager

More