Machine translation has made great progress in recent years thanks to
deep neural networks [1,2,3]. A conventional neural machine
translation (NMT) system uses a limited vocabulary of `tokens' and its
decoder generates a token in the vocabulary at each time step. The
`tokens' of current machine translation systems can be words,
characters [4] or subwords such as byte pair encodings (BPEs) [5]. The
latter have been particularly effective to deal with out-of-vocabulary
words and generally lead to state-of-the-art results. However, it is
not clear how many units1 should be kept for a particular MT task and
which is the optimal granularity (characters, subwords, words), if
any.

The goal of this internship is to investigate approaches that provide
models with several views (segmentations) of the text to strengthen
their robustness. This is particularly important for processing noisy
data such as user generated content (UGC - e.g., user reviews of
hotels or restaurants). Such a multiscale neural machine translation
model should take into account these different segmentation
granularities at both training and decoding stages. We also want the
proposed method to be applicable to the latest state-of-the-art NMT
based on transformer networks [3].

Requirements

- Student at Master (research-oriented) or PhD level.

- Knowledge of deep learning as applied to NLP.

- Good coding skills, including at least one of the major deep
learning toolkits (preferably Pytorch).

References.

[1] Sequence to Sequence Learning with Neural Networks. Ilya
Sutskever, Oriol Vinyals, Quoc V. Le. NIPS 2014.

[2] Neural Machine Translation by Jointly Learning to Align and Translate. Dzmitry Bahdanau, Kyunghyun Cho,
Yoshua Bengio. ICLR q2015.

[3] Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki
Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,
Illia Polosukhin. NIPS 2017.

[4] Jason Lee, Kyunghyun Cho, and Thomas Hofmann. 2017. Fully
character-level neural machine translation without explicit
segmentation. TACL 2017

[5] Neural machine translation of rare words with subword units. Rico Sennrich, Barry Haddow, and Alexandra Birch. ACL,2016.

[6] Improving Neural Machine Translation by Incorporating Hierarchical
Subword Features. Makoto Morishita, Jun Suzuki* and Masaaki
Nagata. COLING 2018.5

[7] Subword Regularization: Improving Neural Network Translation
Models with Multiple Subword Candidates.  Taku Kudo. ACL 2018.6

[8] Google's neural machine translation system: Bridging the gap
between human and machine translation.  Yonghui Wu, Mike Schuster, et
al. arXiv preprint arXiv:1609.08144 2016.

[9] Neural Lattice-to-Sequence Models for Uncertain Inputs. Matthias
Sperber, Graham Neubig, Jan Niehues, Alex Waibel. EMNLP 2017.

[10] On Using Monolingual Corpora in Neural Machine
Translation. Caglar Gulcehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho,
Loic Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, Yoshua
Bengio. arXiv preprint:1503.03535.

[11] Optimally Segmenting Inputs for NMT Shows Preference for
Character-Level Processing. Julia Kreutzer, Artem Sokolov. arXiv
preprint:1810.01480. 2018

Start Date
asap

Duration
5-6 months

Application instructions

To apply, please send a mail and CV to matthias.galle@naverlabs.com ,
marc.dymetman@naverlabs.com and
laurent.besacier@univ-grenoble-alpes.fr