6-month NLP internship in Blois, France:

*Verbal Multiword Expression Discovery in French Based on Seen Data and
 Distributional Semantics*

* Scientific field: Natural Language Processing (NLP)
* Location: University of Tours, LIFAT (Laboratoire d'Informatique
  Fondamentale et Appliquée de Tours), Blois campus (41)
* Duration: 6 months
* Remuneration : 577 ¤ / month
* Detailed description: http://parsemefr.lis-lab.fr/doku.php?id=2018-lifat-m2-1

* Important dates
  - Application deadline: *15 December 2018* (or until filled)
  - Notification: 15 January 2018
  - Position starts: around February-March 2018
  - Position ends: around July-August 2018

* Requested candidate profile
  - 2nd-year master student in computational linguistics, computer
    science or alike
  - Interests in linguistics and familiarity with language technology
  - Good knowledge of French
  - Good programming skills, preferably in Python.

* Applications:
  Send your CV and a cover letter to Caroline Pasquer
  (first.last@etu.univ-tours.fr) and Agata Savary
  (first.last@univ-tours.fr).

*Motivation and objectives*

The internship will take place in the framework of the PARSEME-FR
project (http://parsemefr.lis-lab.fr), which involves several NLP teams
in France.

The aim is to boost applications in Natural Language Processing (NLP),
by focusing on one of their major challenges: multiword expressions
(MWEs).

MWEs are groups of words which exhibit unpredicted properties (Baldwin &
Kim, 2010). Most prominently, their meaning does not straightforwardly
derive from the meanings of their components, as in 'casser sa pipe'
(literally `to break one's pipe') `to die'.

Two major MWE-related NLP tasks include MWE discovery and MWE
identification. In the former, the input consists in large quantities of
raw texts and the output is a list of potential MWEs. In the latter, and
identifier takes a text on input and automatically annotates (points at)
the occurrences of MWEs in it. MWE identification is a pre-requisite for
downstream applications such as machine translation (which may want to
treat MWEs with dedicated procedures).
Automatic identification of MWEs in 19 languages was addressed by the
PARSEME shared task1 (Ramisch et al., 20182018), in which the BdTln team
participated with the VarIDE system (Pasquer et al., 2018a). The results
of the shared task show that identifying unseen MWEs (i.e. those MWEs
which do not occur in the training data) is particularly
challenging. Thus, identification should, ideally, exploit not only
annotated corpora but also MWE lexicons and MWE discovery methods.

This internship is dedicated to discovering how MWE discovery could
benefit from the previously seen data, rather than be performed from
scratch. The hypothesis to be tested is that new (unseen) MWEs of
certain types can be discovered due to their semantic similarity with
known (previously seen) MWEs. We focus on the domain of distributional
semantics, which is based on the hypothesis that words having a similar
meaning occur in similar contexts. Recent developments in distributional
semantics include the construction of "word embeddings", i.e. mappings
from words or expressions to low-dimensional vectors of real numbers,
which are expected to represent co-occurrence contexts of these
words/expressions in a compact way. Thus, an embedding of a
word/expression can be considered an abstract representation of its
meaning.

The objectives of this internship are to exploit word embeddings for
discovery of new MWEs based on their semantic proximity to the
previously seen MWEs, contained in a lexicon or in an annotated corpus
(resources of both types belong to the outcomes of the PARSEME-FR
project). The discovery should lead to (semi-)automatic enrichment of
these initial resources.