- Title: Research internship in NLP and ML: Text generation with disentangled semantic and syntactic representations - Duration: 5-6 months, during the year 2020 - Location: LIMSI, Orsay (south of Paris) - Supervisor: Caio Corro http://caio-corro.fr/ - Team: Spoken Language Processing / Traitement Automatique de la Parole - Contact: caio.corro@limsi.fr *Context* This internship will focus on text generation with deep generative models, in particular Variational Auto-Encoders (VAEs) [1,2]. The goal is to study how we can build a generative model for text generation where the semantic and syntactic representations are disentangled [3]. That is, we aim to generate a sentence through the following process: - sample z: a latent variable encoding a meaning, - sample z': a latent variable encoding a surface structure (i.e. how the meaning is expressed), - sample x from p(x | z, z'): a sentence conditioned on its meaning and syntactic structure. This kind of models could be used for sentence simplification, paraphrasing or generating diverse text responses [4,5]. Previous work in the literature has explored models where z' is encoded as a discrete combinatorial structure [6,7]. However, these methods require annotation of linguistic structures to be available during training and they may not be suitable for large scale learning as they are computationally expensive. Therefore, we aim to focus on techniques closer to the ones developed in computer vision where both the semantic and syntactic representations are encoded in a fixed size continuous latent space and learned in a fully unsupervised setting. To this end, the successful candidate will explore generative losses for disentangled representation learning and propose neural architectures specifically developed for text generation in this setting. *Missions* - review the literature on learning disentangled latent space with VAEs; - reproduce the experiments from [3,8] with a transformer architecture instead of recurrent networks; - explore VAE losses for learning disantangled representations; - propose transformer architectures that isolates structural information from semantic information (e.g. distance, see Section 3 in [9]) [1] "Auto-Encoding Variational Bayes" Diederik P Kingma and Max Welling. [2] "Stochastic backpropagation and approximate inference in deep generative models" Danilo Jimenez Rezende et al. [3] "A Multi-Task Approach for Disentangling Syntax and Semantics in Sentence Representations" Mingda Chen et al. [4] "Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization" Yizhe Zhang et al. [5] "Jointly Measuring Diversity and Quality in Text Generation Models" Danial Alihosseini et al. [6] "StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing" Pengcheng Yin et al. [7] "Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder" Caio Corro and Ivan Titov [8] "Effective Estimation of Deep Generative Language Models" Tom Pelsmaeker and Wilker Aziz [9] "Constituency Parsing with a Self-Attentive Encoder" Nikita Kitaev and Dan Klein