Psylve - A Text-to-Ontology Information Extraction Framework for the Occurrence Distribution of Plant Pathogen Vectors
Résumé
Diseases due to insect-borne plant pathogens have a large negative effect on the world’s agricultural industry. An effective way to anticipate disease outbreaks can be to infer risk maps of vector introduction and spread from known occurrence data. However, compiling this type of data manually is time consuming and laborious, especially due to the recent spike in publicly available data. To address this issue, this work describes attempts at facilitating researchers’ workflows by using approaches to automate the extraction of vector related information from literature.
To carry out this automation, we developed PsylVe, a solution initially targeted at psyllid vectors that encompasses document recollection, Natural Language Processing (NLP) and Knowledge Representation (KR) techniques. PsylVe includes a working NLP pipeline, and a fully documented methodology. The NLP pipeline is based on the adaptation of an existing pipeline, Omnicrobe, on microbial biodiversity that bears many similarities with epidemic events.
We conducted a quantitative (precision, recall, and F1-score) and qualitative (six qualitative criteria for text mining pipeline evaluations) evaluation of results obtained with PsylVe and compared them to a manually compiled dataset of observations on Cacopsylla pruni responsible for the spread of a pathogenic bacterium in fruit tree orchards in Europe. From the outset, we designed the PsylVe Framework to be transferable to other plant disease vectors, as well as human and animal diseases. We have also designed an application for the extraction of texts from PDF documents and an original formal ontology that enables the representation of the data and of the knowledge on vector-borne diseases. Various projects in the MaIAGE department of INRAE have already started integrating the PsylVe framework in their workflow and concrete plans to develop it further were made in order to expand its usage to new biological domains.
Origine | Fichiers produits par l'(les) auteur(s) |
---|