Location: ATILF, Nancy, France, https://www.atilf.fr Duration: 5 months Requirements: Candidates should be students in Master 2 Natural Language Processing, or Computational Linguistics, or computer science or applied mathematics (or equivalent). Application: Candidates should send their applications (cover letter, CV and master grades) to Mathieu Constant (Mathieu.Constant@univ-lorraine.fr) not later than December 9, 2024. Applications will be processed on a continuous-flow basis. Automatic identification of linguistic properties of Multiword Expressions Supervisor: Mathieu Constant (ATILF, Univ. Lorraine, France, Mathieu.Constant@univ-lorraine.fr) Motivation and context The term « multiword expression » (MWE) refers to a combination of multiple lexical items that displays irregular composition possibly on different linguistic levels (morphology, syntax, semantics, . . . ). They include a large variety of phenomena such as idioms (run around in circles `to keep doing or talking about the same thing without achieving anything' [Collins dictionary]), sup- port verb constructions (take a walk), nominal compounds (dry run `rehearsal'), complex function units (in spite of ). They have been the subject of extensive research work in the NLP community over several decades, in particular since the seminal paper of [SBB+02]. The identification of MWEs in texts requires operational criteria characterizing the idiosyncrasy of MWEs, notably to build MWE-annotated corpora or lexical resources. Such linguistic criteria are often based on formal tests [CCR+21], including for instance semantic ones such as *a dry run is a run (in the sense of rehearsal) vs. a dry shirt is a shirt. Automatic identification of MWEs in texts are nowadays mostly based on sequence classifiers trained on MWE-annotated datasets, on top of pretrained large language models, cf. [RSG+20]. Proposed approaches do not provide any cues indicating why a given occurrence of an MWE has been identified as such in the text, regarding linguistic criteria. Goals and Objectives The goal of this internship is to develop and evaluate methods to jointly identify MWE occurrences in texts and predict the linguistic criteria explaining their MWE-hood. For this purpose, the work will be based on the French corpus Sequoia [CS12] that has been annotated for MWEs, including the linguistic properties used to manually identify them [CCR+21]. The internship has the following objectives: - read the scientific literature on MWEs and their identification in texts - develop a baseline MWE model to simply identify MWEs in texts - adapt the model to also characterize the linguistic properties of the identified MWEs - quantitative and qualitative evaluation of the results A good knowledge of French is a plus. The duration of the internship is 5 months. The internship will take place at the ATILF, Nancy. References [CCR+21] Marie Candito, Mathieu Constant, Carlos Ramisch, Agata Savary, Bruno Guillaume, Yannick Parmentier, and Silvio Cordeiro. A french corpus annotated for multiword expressions and named entities. Journal of Language Modelling, 8(2):415-479, Feb. 2021. [CS12] Marie Candito and Djamé Seddah. Effectively long-distance dependencies in French : annotation and parsing evaluation. In TLT 11 - The 11th International Workshop on Treebanks and Linguistic Theories, Lisbon, Portugal, November 2012. [RSG+20] Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk, Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna Bhatia, Uxoa Inurrieta, Voula Giouli, Tunga Gungor, Menghan Jiang, Timm Lichte, Chaya Liebeskind, Johanna Monti, Renata Ramisch, Sara Stymne, Abigail Walsh, and Hongzhi Xu. Edition 1.2 of the PARSEME shared task on semi-supervised identification of verbal multiword expres- sions. In Stella Markantonatou, John McCrae, Jelena Mitrovic Carole Tiberius, Carlos Ramisch, Ashwini Vaidya, Petya Osenova, and Agata Savary, editors, Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons, pages 107- 118, online, December 2020. Association for Computational Linguistics. [SBB+02] Ivan A. Sag, Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger. Multiword expressions: A pain in the neck for nlp. In Alexander Gelbukh, editor, Com- putational Linguistics and Intelligent Text Processing, pages 1-15, Berlin, Heidelberg, 2002. Springer Berlin Heidelberg.