- Title: Research internship in NLP and ML: Text generation with
  disentangled semantic and syntactic representations
- Duration: 5-6 months, during the year 2020
- Location: LIMSI, Orsay (south of Paris)
- Supervisor: Caio Corro http://caio-corro.fr/
- Team: Spoken Language Processing / Traitement Automatique de la Parole
- Contact: caio.corro@limsi.fr

*Context*

This internship will focus on text generation with deep generative
models, in particular Variational Auto-Encoders (VAEs) [1,2]. The goal
is to study how we can build a generative model for text generation
where the semantic and syntactic representations are disentangled [3].
That is, we aim to generate a sentence through the following process:

- sample z: a latent variable encoding a meaning,
- sample z': a latent variable encoding a surface structure (i.e. how
  the meaning is expressed),
- sample x from p(x | z, z'): a sentence conditioned on its meaning
  and syntactic structure.

This kind of models could be used for sentence simplification,
paraphrasing or generating diverse text responses [4,5]. Previous work
in the literature has explored models where z' is encoded as a discrete
combinatorial structure [6,7]. However, these methods require annotation
of linguistic structures to be available during training and they may
not be suitable for large scale learning as they are computationally
expensive.

Therefore, we aim to focus on techniques closer to the ones developed in
computer vision where both the semantic and syntactic representations
are encoded in a fixed size continuous latent space and learned in a
fully unsupervised setting. To this end, the successful candidate will
explore generative losses for disentangled representation learning and
propose neural architectures specifically developed for text generation
in this setting.

*Missions*

- review the literature on learning disentangled latent space with VAEs;
- reproduce the experiments from [3,8] with a transformer architecture
  instead of recurrent networks;
- explore VAE losses for learning disantangled representations;
- propose transformer architectures that isolates structural information
  from semantic information (e.g. distance, see Section 3 in [9])


[1] "Auto-Encoding Variational Bayes" Diederik P Kingma and Max Welling.
[2] "Stochastic backpropagation and approximate inference in deep
    generative models" Danilo Jimenez Rezende et al.
[3] "A Multi-Task Approach for Disentangling Syntax and Semantics in
    Sentence Representations" Mingda Chen et al.
[4]  "Generating Informative and Diverse Conversational Responses via
     Adversarial Information Maximization" Yizhe Zhang et al.
[5] "Jointly Measuring Diversity and Quality in Text Generation Models" 
    Danial Alihosseini et al.
[6] "StructVAE: Tree-structured Latent Variable Models for
    Semi-supervised Semantic Parsing" Pengcheng Yin et al.
[7] "Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a
    Structured Variational Autoencoder" Caio Corro and Ivan Titov
[8] "Effective Estimation of Deep Generative Language Models" Tom
    Pelsmaeker and Wilker Aziz
[9] "Constituency Parsing with a Self-Attentive Encoder" Nikita Kitaev
    and Dan Klein