Location: ATILF, Nancy, France, https://www.atilf.fr
Duration: 5 months
Requirements: Candidates should be students in Master 2 Natural
Language Processing, or Computational Linguistics, or computer science
or applied mathematics (or equivalent).
Application: Candidates should send their applications (cover letter,
CV and master grades) to Mathieu Constant
(Mathieu.Constant@univ-lorraine.fr) not later than December 9,
2024. Applications will be processed on a continuous-flow basis.
Title: Semantic parsing of lexicographic definitions - Application to
Trésor de la Langue Française informatisé
Supervisors: Mathieu Constant (ATILF, CNRS/Univ. Lorraine,
Mathieu.Constant@univ-lorraine.fr) and Alain Polguère (ATILF,
CNRS/Univ. Lorraine, Alain.Polguere@univ-lorraine.fr)
The internship will be in collaboration with Lucie Barque (LLF,
Univ. Sorbonne Paris-Nord) and Alexis Nasr (LIS, Aix-Marseille
Université).
Motivation and Context
The Trésor de la Langue Française informatisé (TLFi) is the digitized
version of the Trésor de la Langue Française (TLF), a sixteen-volume
dictionary of the French language of the 19th and 20th centuries,
published between 1971 and 1994. The TLFi is freely accessible on the
web since 2002: https://www.atilf.fr/ressources/tlfi/. Its entries are
structured into several fields (forming its so-called microstructure):
for instance, definitions, synonyms and antonyms, literary examples
where the headwords appear, technical domain indicators, as well as
semantic, etymological, historical, grammatical and stylistic
features.
The ATILF-funded Definiens project (2009-2010) targeted the convertion
of TLFi's definitions into structured expressions (using XML tags). At
that time, there was no widely accessible lexical database for the
French language offering an explicitly internal structuring of the
definition for each individual headword. Definiens set out to fill
this gap by indicating for each definition in the TLFi (around
270,000): 1) its structuring in definitional components; 2) the role
played by each component in characterizing the meaning of the
headword.
The Definiens project was never completed due to the starting fo the
major RELIEF lexicographic project. It ended with approximatively 5%
of the TLF's definitions manually-annotated.
Goals and objectives
The main goal of this internship is to develop and evaluate parsing
models for annotating the semantic structure of the TLFi definitions,
based on the set of already annotated definitions from the Definiens
project. An example of such an annotation is given below for the
lexical unit BROUETTE B.1 `wheelbarrow' , where CC stands for
Composante Centrale `central component' and PC stands for Composante
Périphérique `peripheral component':
Véhicule
à une roue et à deux brancards
servant au transport des matériaux
(Translation: vehicule with one wheel and two holding
poles used to carry materials)
In a second stage, the feasibility of enriching such basic
segmentation tagging with tag indicating the semantic role of each PC
should be explored. For instance, the enriched version of the above
definition should be:
Véhicule
à une roue et à deux brancards
servant au transport des matériaux
(véhicule `vehicle', parties caractéristiques `characteristic parts',
fonction `function')
The work may be divided in several objectives:
- reading of the scientific literature on the topic and related areas
- data preparation
- develop simple baselines training off-the-shelf (syntactic) parsers
with the provided data
- adapting models to the specific nature of the task and the data
- quantitative and qualitative evaluation of the models
- produce semantic annotations for all TLFi definitions
- make proposals + experiment the feasibility of semantic enrichment
A reading knowledge of French is a plus.