Quantifying diversity of language phenomena in corpora and system
predictions: Case study of multiword expressions

Master internship proposal, 2021-2022


  - Domain : natural language processing (NLP)

  - Location : Université Paris-Saclay (LISN lab),
    Gif-sur-Yvette, France; with visits to the
    University of Tours (LIFAT lab, https://lifat.univ-tours.fr/) and
    the University of Orléans (LLL lab,
    https://www.univ-orleans.fr/fr/lll/le-laboratoire )

  - Research teams : ILES
    https://www.limsi.fr/en/research/iles
    (Written and Sign Language Processing) of the LISN ;
    BdTln
    https://lifat.univ-tours.fr/lifat-english-version/teams/bdtin
    (Data Bases and Natural Language Processing) of the LIFAT and
    DDL
    https://lll.cnrs.fr/la-recherche/les-equipes/ddl/
    (Language Description and Documentation) of the LLL

  - Supervisors:

      - Adam LION-BOUTON (LIFAT)

      - Agata SAVARY (LISN)
        http://www.info.univ-tours.fr/~savary/

      - Emmanuel SCHANG (LLL)
        https://sites.google.com/site/emmanuelschang/

      - Jean-Yves ANTOINE (LIFAT)
        http://www.info.univ-tours.fr/~antoine/


  - Funding : Université Paris-Saclay

  - Duration : 3-6 months

  - Remuneration : around 606 ¤ / month


      Motivation and context

Diversity of naturally occurring phenomena is a vital heritage to be
preserved in the current progress- and optimization-driven
globalization era. Diversity has been quantifiedin many domains:
ecology, economy, information science, etc. but less so in natural
language language processing (NLP). We are addressing this aspect with
respect to a particular linguistic phenomenon: the one of multiword
expressions(MWEs). MWEs are groups of words which exhibit unpredicted
properties (Baldwin & Kim, 2010). Most prominently, their meaning does
not straightforwardly derive from the meanings of their components. For
instance, the meaning of casser sa pipe`to die' (literally to break
one's pipe) or of sortir du lot'to be better than others' (literally to
quit the batch)cannot be straightforwardly deduced from the meanings of
the individual components. Due to these properties, MWEs are very
challenging in applications like machine translation, information
retrieval, opinion mining, etc.

Language resources dedicated to MWEs include MWE lexicons and
MWE-annotated corpora (Savary et al., 2017), while a major
computational task is to automatically identify MWEs in running text.
The PARSEME (https://gitlab.com/parseme/corpora/-/wikis/home) network
has been addressing the MWE identificationtask via a series of shared
tasks on automatic identification of verbal MWEs
https://gitlab.com/parseme/corpora/-/wikis/home#shared-tasks
(Ramisch et al. 2020).

MWEs, like most other phenomena in human language, follow the
so-called Zipf's law (Williams et al. 2015): few items are frequent and
there is a long tail of rare ones. These few frequent items tend to be
less diverse than the numerous items in the "Zipfian tail". Current
models, including those for MWE identification, often favour the former
and underperform in the latter. Hence, quality is overestimated and
diversity is weakly accounted for.

To meet this challenge, our recent work (Lion-Bouton, 2021) is
explicitly dedicated to quantifying diversity in MWE language
resources. We have adapted measures of variety (number of types in a
system), balance (equity of items in various types) and disparity
(differences between types), stemming notably from ecology and
information theory (Morales 2021), to MWE lexicons extracted
automatically from annotated corpora.


      Objectives

The objective of this internship is to apply the aforementioned MWE
diversity measures to MWE-annotated corpora and MWE identification
tools. More precisely, the following steps are to be undertaken:

  - characterizing a corpus (annotated for morpho-syntax and MWEs) for
    variety, balance and disparity of the vocabulary (casser sa pipe,
    sortir du lot), morphological features (plural, future) and
    syntactic structures (verb-object, verb-prepositional-phrase)
    occurring in the MWEs contained therein

  - developing methods of diversity-driven corpus split, over-sampling
    and augmentation

  - designing evaluation scenarios for MWE identifiers so that
    diversity of the results is treated on par with global precision
    and recall

  - applying these scenarios to the system results of edition 1.2
    of the PARSEME shared task
https://gitlab.com/parseme/sharedtask-data/-/tree/master/1.2/system-results

  - analysing the evaluation outcome and characterizing the MWE
    identifiers as to their account of MWE diversity


    Candidate's profile

  - 2nd-year master student in computational linguistics, computer
    science or alike ; excellent 1st-year master ou 3rd year bachelor
    students will also be considered

  - Interests in linguistics and familiarity with language technology

  - Good programming skills, preferably in Python


    Important dates

  - Application deadline: 20 November 2021 (or until filled)

  - Notification: 30 November 2021

  - Position starts: late January 2022 (at earliest)

  - Position ends: late July 2022


    How to apply

Send your CV, a cover letter and a transcript of your bachelor and
master grades to Adam Lion-Bouton (adam.lion-bouton@etu.univ-tours.fr),
Agata Savary (first.last@universite-paris-saclay.fr),
Emmanuel Schang (first.last@univ-orleans.fr) and
Jean-Yves Antoine (jean-yves.antoine@univ-tours.fr).


    References

  - Baldwin, T. and Kim, S. N. (2010) Multiword Expressions
https://people.eng.unimelb.edu.au/tbaldwin/pubs/handbook2009.pdf,
    in Nitin Indurkhya and Fred J. Damerau (eds.) Handbook of Natural
    Language Processing, Second Edition, CRC Press, Boca Raton, USA,
    pp. 267-292.

  - Matthieu Constant, Gülsen Eryigit, Johanna Monti, Lonneke van der
    Plas, Carlos Ramisch, Michael Rosner, and Amalia Todirascu. 2017.
    Multiword expression processing: A survey
    https://www.mitpressjournals.org/doi/full/10.1162/COLI_a_00302
    Computational Linguistics, 43(4):837-892.

  - Adam Lion-Bouton (2021) Multi-criterion optimisation for multiword
    expression lexicon design promoting linguistic diversity,
    Technical report, University of Tours.

  - Morales P. L., Lamarche-Perrin R., Fournier-S'niehotta R.,
    Poulain R., Tabourier L., Tarissan F. (2021) Measuring Diversity in
    Heterogeneous Information Networks
https://pedroramaciotti.github.io/files/publications/2021_TCS.pdf ,
    in Theoretical Computer Science, Elsevier.

  - Carlos Ramisch, Agata Savary, Bruno Guillaume, Jakub Waszczuk,
    Marie Candito, Ashwini Vaidya, Verginica Barbu Mititelu, Archna
    Bhatia, Uxoa Iñurrieta, Voula Giouli, Tunga Güngör, Menghan Jiang,
    Timm Lichte, Chaya Liebeskind, Johanna Monti, Sara Stymne, Abigail
    Walsh, Renata Ramisch, Hongzhi Xu (2020) Edition 1.2 of the PARSEME
    Shared Task on Semi-supervised Identification of Verbal Multiword
    Expressions
    https://www.aclweb.org/anthology/2020.mwe-1.14/ ,
    in the Proceedings of the Joint Workshop on Multiword Expressions
    and Electronic Lexicons (MWE-LEX 2020), 13 December 2020,
    Barcelona, Spain (online).

  - Agata Savary, Marie Candito, Verginica Barbu Mititelu, Eduard
    Bejcek, Fabienne Cap, Slavomir Céplö, Silvio Ricardo Cordeiro,
    Gülsen Eryigit, Voula Giouli, Maarten van Gompel, Yaakov
    HaCohen-Kerner, Jolanta Kovalevskaite, Simon Krek, Chaya
    Liebeskind, Johanna Monti, Carla Parra Escartín, Lonneke van der
    Plas, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati,
    Ivelina Stoyanova, Veronika Vincze (2018) "PARSEME multilingual
    corpus of verbal multiword expressions"
    http://langsci-press.org/catalog/view/204/1344/1319-1 ,
    in Stella Markantonatou, Carlos Ramisch, Agata Savary, Veronika
    Vincze (Eds.) "Multiword expressions at length and in depth:
    Extended papers from the MWE 2017 workshop", Language
    Science Press, Berlin, pp. 87-147.

  - Williams J. R., Lessard P. R., Desu S., Clark E. M., Bagrow J. P.,
    Danforth C. M., Dodds P. S. (2015). Zipf's law holds for phrases,
    not words. Scientific Reports, 5.