Quantifying diversity of language phenomena in corpora and system predictions: Case study of multiword expressions Master internship proposal, 2021-2022 - Domain : natural language processing (NLP) - Location : Université Paris-Saclay (LISN lab), Gif-sur-Yvette, France; with visits to the University of Tours (LIFAT lab, https://lifat.univ-tours.fr/) and the University of Orléans (LLL lab, https://www.univ-orleans.fr/fr/lll/le-laboratoire ) - Research teams : ILES https://www.limsi.fr/en/research/iles (Written and Sign Language Processing) of the LISN ; BdTln https://lifat.univ-tours.fr/lifat-english-version/teams/bdtin (Data Bases and Natural Language Processing) of the LIFAT and DDL https://lll.cnrs.fr/la-recherche/les-equipes/ddl/ (Language Description and Documentation) of the LLL - Supervisors: - Adam LION-BOUTON (LIFAT) - Agata SAVARY (LISN) http://www.info.univ-tours.fr/~savary/ - Emmanuel SCHANG (LLL) https://sites.google.com/site/emmanuelschang/ - Jean-Yves ANTOINE (LIFAT) http://www.info.univ-tours.fr/~antoine/ - Funding : Université Paris-Saclay - Duration : 3-6 months - Remuneration : around 606 ¤ / month Motivation and context Diversity of naturally occurring phenomena is a vital heritage to be preserved in the current progress- and optimization-driven globalization era. Diversity has been quantifiedin many domains: ecology, economy, information science, etc. but less so in natural language language processing (NLP). We are addressing this aspect with respect to a particular linguistic phenomenon: the one of multiword expressions(MWEs). MWEs are groups of words which exhibit unpredicted properties (Baldwin & Kim, 2010). Most prominently, their meaning does not straightforwardly derive from the meanings of their components. For instance, the meaning of casser sa pipe`to die' (literally to break one's pipe) or of sortir du lot'to be better than others' (literally to quit the batch)cannot be straightforwardly deduced from the meanings of the individual components. Due to these properties, MWEs are very challenging in applications like machine translation, information retrieval, opinion mining, etc. Language resources dedicated to MWEs include MWE lexicons and MWE-annotated corpora (Savary et al., 2017), while a major computational task is to automatically identify MWEs in running text. The PARSEME (https://gitlab.com/parseme/corpora/-/wikis/home) network has been addressing the MWE identificationtask via a series of shared tasks on automatic identification of verbal MWEs https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks (Ramisch et al. 2020). MWEs, like most other phenomena in human language, follow the so-called Zipf's law (Williams et al. 2015): few items are frequent and there is a long tail of rare ones. These few frequent items tend to be less diverse than the numerous items in the "Zipfian tail". Current models, including those for MWE identification, often favour the former and underperform in the latter. Hence, quality is overestimated and diversity is weakly accounted for. To meet this challenge, our recent work (Lion-Bouton, 2021) is explicitly dedicated to quantifying diversity in MWE language resources. We have adapted measures of variety (number of types in a system), balance (equity of items in various types) and disparity (differences between types), stemming notably from ecology and information theory (Morales 2021), to MWE lexicons extracted automatically from annotated corpora. Objectives The objective of this internship is to apply the aforementioned MWE diversity measures to MWE-annotated corpora and MWE identification tools. More precisely, the following steps are to be undertaken: - characterizing a corpus (annotated for morpho-syntax and MWEs) for variety, balance and disparity of the vocabulary (casser sa pipe, sortir du lot), morphological features (plural, future) and syntactic structures (verb-object, verb-prepositional-phrase) occurring in the MWEs contained therein - developing methods of diversity-driven corpus split, over-sampling and augmentation - designing evaluation scenarios for MWE identifiers so that diversity of the results is treated on par with global precision and recall - applying these scenarios to the system results of edition 1.2 of the PARSEME shared task https://gitlab.com/parseme/sharedtask-data/-/tree/master/1.2/system-results - analysing the evaluation outcome and characterizing the MWE identifiers as to their account of MWE diversity Candidate's profile - 2nd-year master student in computational linguistics, computer science or alike ; excellent 1st-year master ou 3rd year bachelor students will also be considered - Interests in linguistics and familiarity with language technology - Good programming skills, preferably in Python Important dates - Application deadline: 20 November 2021 (or until filled) - Notification: 30 November 2021 - Position starts: late January 2022 (at earliest) - Position ends: late July 2022 How to apply Send your CV, a cover letter and a transcript of your bachelor and master grades to Adam Lion-Bouton (adam.lion-bouton@etu.univ-tours.fr), Agata Savary (first.last@universite-paris-saclay.fr), Emmanuel Schang (first.last@univ-orleans.fr) and Jean-Yves Antoine (jean-yves.antoine@univ-tours.fr). References - Baldwin, T. and Kim, S. N. (2010) Multiword Expressions https://people.eng.unimelb.edu.au/tbaldwin/pubs/handbook2009.pdf, in Nitin Indurkhya and Fred J. Damerau (eds.) Handbook of Natural Language Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 267-292. - Matthieu Constant, Gülsen Eryigit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017. Multiword expression processing: A survey https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00302 Computational Linguistics, 43(4):837-892. - Adam Lion-Bouton (2021) Multi-criterion optimisation for multiword expression lexicon design promoting linguistic diversity, Technical report, University of Tours. - Morales P. L., Lamarche-Perrin R., Fournier-S'niehotta R., Poulain R., Tabourier L., Tarissan F. (2021) Measuring Diversity in Heterogeneous Information Networks https://pedroramaciotti.github.io/files/publications/2021_TCS.pdf , in Theoretical Computer Science, Elsevier. - Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna Bhatia, Uxoa Ińurrieta, Voula Giouli, Tunga Güngör, Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Sara Stymne, Abigail Walsh, Renata Ramisch, Hongzhi Xu (2020) Edition 1.2 of the PARSEME Shared Task on Semi-supervised Identification of Verbal Multiword Expressions https://www.aclweb.org/anthology/2020.mwe-1.14/ , in the Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons (MWE-LEX 2020), 13 December 2020, Barcelona, Spain (online). - Agata Savary, Marie Candito, Verginica Barbu Mititelu, Eduard Bejcek, Fabienne Cap, Slavomir Céplö, Silvio Ricardo Cordeiro, Gülsen Eryigit, Voula Giouli, Maarten van Gompel, Yaakov HaCohen-Kerner, Jolanta Kovalevskaite, Simon Krek, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Lonneke van der Plas, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Ivelina Stoyanova, Veronika Vincze (2018) "PARSEME multilingual corpus of verbal multiword expressions" http://langsci-press.org/catalog/view/204/1344/1319-1 , in Stella Markantonatou, Carlos Ramisch, Agata Savary, Veronika Vincze (Eds.) "Multiword expressions at length and in depth: Extended papers from the MWE 2017 workshop", Language Science Press, Berlin, pp. 87-147. - Williams J. R., Lessard P. R., Desu S., Clark E. M., Bagrow J. P., Danforth C. M., Dodds P. S. (2015). Zipf's law holds for phrases, not words. Scientific Reports, 5.