Testing Multilingual Language Representations for Parsing

Language diversity remains a challenge for NLP. Multilingual
distributional representations (multilingual word embeddings) are known
to be efficient to address different languages simultaneously, but other
approaches have also been proposed like language transfer [1]. These
techniques have been applied to many different NLP areas such as
morphological analysis, NER, Machine Translation and Dependency parsing
[2, 3, 4].

This internship is centered around a model-transfer dependency parsing
approach using multilingual feature representations previously developed
at LATTICE. This system obtained state-of-the art results during the
CoNLL shared task 2017 (see http://universaldependencies.org/conll17/ ).

The goal of the internship is to try to get a better knowledge of the
analysis process and, more specifically of the multilingual
approach. Neural nets are generally used as a black box but it is also
advisable to dive into the representations used and test for example the
quality of the multilingual word embeddings (have the different
languages been correctly aligned?). Different parameters (size of the
training corpora, dictionary, etc) will be tested to see how they affect
the result.

The exact content of the internship will depend on the interest and
skills of the student.

* Requirements are:

- excellent programming skills
- some knowledge of neural networks
- excellent English (both spoken and written, for this internship we
  will use English as the main communication language)

* Duration and conditions

3 to 5 months, Spring 2018. Normal working conditions (indemnités de
stage + transport)

* How to apply?

Send a CV, a recent grade statement and a short email explaining in a
few words your motivation to Thierry Poibeau (thierry.poibeau@ens.fr)
and Kyungtae Lim (kyungtae.lim@ens.fr). The position is opened until
filled (but applications after Feb 2018 will not be considered).

References

[1] Stanford CS224n: Natural Language Processing with Deep Learning
(http://web.stanford.edu/class/cs224n/lecture_notes/cs224n-2017-notes2.pdf)

[2] Joulin, Armand, et al. "FastText. zip: Compressing text
classification models." arXiv preprint arXiv:1612.03651
(2016). (https://github.com/facebookresearch/fastText)

[3] Mikel Artetxe, Gorka Labaka, and Eneko Agirre. 2017. Learning
bilingual word embeddings with (almost) no bilingual data. In
Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), pages 451-462.

[4] Lim, KyungTae, and Thierry Poibeau. "A system for multilingual
dependency parsing based on bidirectional LSTM feature representations."
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw
Text to Universal Dependencies(2017): 63-70.