Taec: a Manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature - MaIAGE Accéder directement au contenu
Pré-Publication, Document De Travail Année : 2024

Taec: a Manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature

Résumé

Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. Newly desirable wheat variety traits include disease resistance to reduce pesticide use, adaptation to climate change, resistance to heat and drought stresses, or low gluten content of grains. Wheat breeding experiments are documented by a large body of scientific literature and observational data obtained in-field and under controlled conditions. The cross-referencing of complementary information from the literature and observational data is essential to the study of the genotype-phenotype relationship and to the improvement of wheat selection. The scientific literature on genetic marker-assisted selection describes much information about the genotype-phenotype relationship. However, the variety of expressions used to refer to traits and phenotype values in scientific articles is a hinder to finding information and cross-referencing it. When trained adequately by annotated examples, recent text mining methods perform highly in named entity recognition and linking in the scientific domain. While several corpora contain annotations of human and animal phenotypes, currently, no corpus is available for training and evaluating named entity recognition and entity-linking methods in plant phenotype literature. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 540 PubMed references fully annotated for trait, phenotype, and species named entities using the Wheat Trait and Phenotype Ontology and the species taxonomy of the National Center for Biotechnology Information. A study of the performance of tools trained on the Triticum aestivum trait Corpus shows that the corpus is suitable for the training and evaluation of named entity recognition and linking.
Fichier principal
Vignette du fichier
Wheat corpus preprint.pdf (430.79 Ko) Télécharger le fichier
Origine Fichiers produits par l'(les) auteur(s)
licence

Dates et versions

hal-04412278 , version 1 (10-06-2024)

Licence

Identifiants

Citer

Claire Nédellec, Clara Sauvion, Robert Bossy, Mariya Borovikova, Louise Deléger. Taec: a Manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature. 2024. ⟨hal-04412278⟩
68 Consultations
4 Téléchargements

Altmetric

Partager

Gmail Mastodon Facebook X LinkedIn More