Extracting new spatial entities and relations from short messages
Extraction de nouvelles entités spatiales et relations à partir de messages courts
Résumé
In the past few years, texts have become an important spatial data resource, in addition to maps, satellite images and GPS. Electronic written texts used in mediated interactions, especially short messages (SMS, tweets, etc.), have triggered the emergence of new ways of writing. Extracting information from such short messages, which represent a rich source of information and opinion, is highly important due to the new and challenging text style. Short messages are, however, difficult to analyze because of their brief, unstructured and informal nature. The work presented in this paper is aimed at extracting spatial information from two authentic corpora of SMS and tweets in French in order to take advantage of the vast amount of geographical knowledge expressed in diverse natural language texts. We propose a process in which, firstly, we extract new spatial entities (e.g. Monpelier, Montpel are associated with the place name Montpellier). Secondly, we identify new spatial relations that precede these spatial entities (e.g. sur, par, etc.). Finally, we propose a general pattern for discovering spatial relations (e.g. SR + Preposition). The task is very challenging and complex due to the specificity of short messages language, which is based on weakly standardized modes of writing (lexical creation, massive use of abbreviations, textual variants, etc.). The experiments that were carried out on the two corpora 88milSMS and Tweets highlight the efficiency of our proposed strategy for identifying new kinds of spatial entities and relations. © 2016 ACM.