Semantic parsing of lexicographic definitions - Application to Trésor de la Langue Française informatisé Title: Semantic parsing of lexicographic definitions - Application to Trésor de la Langue Française informatisé Supervisors: Mathieu Constant (ATILF, CNRS/Univ. Lorraine, Mathieu.Constant@univ-lorraine.fr) and Alain Polguère (ATILF, CNRS/Univ. Lorraine, Alain.Polguere@univ-lorraine.fr) The internship will be in collaboration with Lucie Barque (LLF, Univ. Sorbonne Paris-Nord) and Alexis Nasr (LIS, Aix-Marseille Université). Motivation and Context The Trésor de la Langue Française informatisé (TLFi) is the digitized version of the Trésor de la Langue Française (TLF), a sixteen-volume dictionary of the French language of the 19th and 20th centuries, published between 1971 and 1994. The TLFi is freely accessible on the web since 2002: https://www.atilf.fr/ressources/tlfi/. Its entries are structured into several fields (forming its so-called microstructure): for instance, definitions, synonyms and antonyms, literary examples where the headwords appear, technical domain indicators, as well as semantic, etymological, historical, grammatical and stylistic features. The ATILF-funded Definiens project (2009-2010) targeted the convertion of TLFi's definitions into structured expressions (using XML tags). At that time, there was no widely accessible lexical database for the French language offering an explicitly internal structuring of the definition for each individual headword. Definiens set out to fill this gap by indicating for each definition in the TLFi (around 270,000): 1) its structuring in definitional components; 2) the role played by each component in characterizing the meaning of the headword. The Definiens project was never completed due to the starting fo the major RELIEF lexicographic project. It ended with approximatively 5% of the TLF's definitions manually-annotated. Goals and objectives The main goal of this internship is to develop and evaluate parsing models for annotating the semantic structure of the TLFi definitions, based on the set of already annotated definitions from the Definiens project. An example of such an annotation is given below for the lexical unit BROUETTE B.1 'wheelbarrow' , where CC stands for Composante Centrale 'central component' and PC stands for Composante Périphérique 'peripheral component': Véhicule à une roue et à deux brancards servant au transport des matériaux (Translation: vehicule with one wheel and two holding poles used to carry materials) In a second stage, the feasibility of enriching such basic segmentation tagging with tag indicating the semantic role of each PC should be explored. For instance, the enriched version of the above definition should be: Véhicule à une roue et à deux brancards servant au transport des matériaux (véhicule 'vehicle', parties caractéristiques 'characteristic parts', fonction 'function') The work may be divided in several objectives: - reading of the scientific literature on the topic and related areas - data preparation - develop simple baselines training off-the-shelf (syntactic) parsers with the provided data - adapting models to the specific nature of the task and the data - quantitative and qualitative evaluation of the models - produce semantic annotations for all TLFi definitions - make proposals + experiment the feasibility of semantic enrichment A reading knowledge of French is a plus. Location: ATILF, Nancy, France, https://www.atilf.fr Duration: 5 months Requirements: Candidates should be students in Master 2 Natural Language Processing, or Computational Linguistics, or computer science or applied mathematics (or equivalent). Application: Candidates should send their applications (cover letter, CV and master grades) to Mathieu Constant (Mathieu.Constant@univ-lorraine.fr) not later than January 8, 2024. Applications will be processed on a continuous-flow basis.