Improving methods for normalizing biomedical text entities with concepts from an ontology with (almost) no training data at BLAH5 the CONTES

Entity normalization, or entity linking in the general domain, is an information extraction task that aims to annotate/bind multiple words/expressions in raw text with semantic references, such as concepts of an ontology. An ontology consists minimally of a formally organized vocabulary or hierarchy of terms, which captures knowledge of a domain. Presently, machine-learning methods, often coupled with distributional representations, achieve good performance. However, these require large training datasets, which are not always available, especially for tasks in specialized domains. CONTES (CONcept-TErm System) is a supervised method that addresses entity normalization with ontology concepts using small training datasets. CONTES has some limitations, such as it does not scale well with very large ontologies, it tends to overgeneralize predictions, and it lacks valid representations for the out-of-vocabulary words. Here, we propose to assess different methods to reduce the dimensionality in the representation of the ontology. We also propose to calibrate parameters in order to make the predictions more accurate, and to address the problem of out-of-vocabulary words, with a specific method.

Mots clés

machine learning Natural Language Processing Ontology text mining

Information Extraction

Domaines

Intelligence artificielle [cs.AI] Apprentissage [cs.LG]

Migration ProdInra : Connectez-vous pour contacter le contributeur

https://hal.inrae.fr/hal-02947689

Soumis le : jeudi 24 septembre 2020-09:29:15

Dernière modification le : vendredi 17 mai 2024-16:36:03

Dates et versions

hal-02947689 , version 1 (24-09-2020)

Identifiants

HAL Id : hal-02947689 , version 1
DOI : 10.5808/GI.2019.17.2.e20
PRODINRA : 495152
PUBMED : 31307135
PUBMEDCENTRAL : PMC6808633

Citer

Arnaud Ferré, Mouhamadou Ba, Robert Bossy. Improving methods for normalizing biomedical text entities with concepts from an ontology with (almost) no training data at BLAH5 the CONTES. Genomics & Informatics, 2019, 17 (2), pp.e20. ⟨10.5808/GI.2019.17.2.e20⟩. ⟨hal-02947689⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS INRA LIMSI UNIV-PARIS-SACLAY INRAE ANR LISN GS-ENGINEERING GS-COMPUTER-SCIENCE MAIAGE

47 Consultations

0 Téléchargements