Master Internship offer - Spring 2023 Personalized data-to-text neural generation Laure Soulier, Christophe Gravier Introduction Information Supervisors: laure.soulier@isir.upmc.fr, christophe.gravier@univ-st-etienne.fr Localization: Saint-Étienne (LaHC) or Saclay (LISN), France Duration: 6 months, between February and August 2023. Stipend: around 573,30 euros / month Expected profile: Master or engineering degree in Computer Science or Applied Mathematics related to machine learning/natural language processing. The candidate should have a strong scientific background with good technical skills in programming, and be fluent in reading and writing English. How to apply? Send a CV, a motivation letter and Master records to laure.soulier@isir.upmc.fr and christophe.gravier@univ-st-etienne.fr. Recommendation letters would be appreciated. Interviews will conducted as they arise and the position will be filled as soon as possible - the latest application date is set to 15th January. Context Based on prior works at Jacobs University Bremen in Germany and University of Montréal [2], a new novel neural architecture "transformer" (fully based on attention) had been devised in 2017 in a key paper from Google Brain [18]. The main idea of the attention mechanism is to alienate the limitations of training neural architecture for machine translation, that is the need to predict tokens until the n - 1 one, in order to predict the n - th word of a sequence (so-called recurrent networks) - thereby allowing parallel training on GPUs of (very) large NLP neural models. The attention mechanism removes the recurrent paradigm in the trained predictor, and instead try to learn the weights of surroundings tokens (i.e. word), depending on the token being processed at a given time. This paper is the building block of many NLP contributions nowadays (the "transformer" paper is cited 28, 403 as of September 2021!). The transformer architecture led to very large language models such as BERT [5] or RoBERTa [6], which are able to solve tasks such as text classification [16], question answering [19], etc. A tremendously exciting task is text generation, that is the ability to leverage such language models to create NLP systems that can generate free text - a long-lasting goal in the field of Artificial Intelligence. Among these models, GPT3 [3] is probably the most impressive and creative. Besides common limitations of such systems [7, 14], a key observation is that the text is generated in a left-to-right fashion - which is called auto-regressive. It is therefore not trivial to control on the generator (ie. set constraints as presence/absence of a token for instance). It is even harder to control the way the model express itself, that is to say the style in which it should generate text. The major way to control this is actually to use existing style annotated corpus and create generative models that learn to perform style transfer [1, 4, 8] (the problem is therefore cast as a domain transfer issue). A critical issue being how to evaluate style transfer system for text generation [9, 17]. Objectives In this internship we are interested in a special case of text generation, which is data-to-text generation. In this setting, the task is to generate sentences in natural language based on structured or semi-structured data. To provide a data to text example, a famous academic dataset is made of statistics of baseball games paired with human written summary of the game [11], that we ultimately want to the system to learn to generate. Beyond this toy example, data to text is of the utmost practical interests in many scenarios such as finance,... This task is a special case in text generation and comes with its own specific challenges. The data to text models are prone to hallucinations, that is generating grammatically correct but irrelevant and out of the blue sentences [12]. Moreover, the inputs being structured or semi-structured data, this calls for alternative solution to encode w.r.t. standard texts inputs made of sequence of tokens arranged as sentences. The objective of the internship is to develop a neural data to text system able to personal- ize the text generation. Based on a previous work, we will first focus on a movie dataset in which we dispose of movie tabular description (the data), and reviews. The objective will be to personalize review for a given user. Secondly, while there exists studies on how to evaluate data to text gener- ator [10, 13], to the best of our knowledge none consider style transfer/text personalization for text generation besides [15]. As such, finding means to perform style transfer evaluation for data to text generators is fully part of the internship, on top of finding neural solution to perform style transfer aware data to text. The evaluation we seek has to be automatic or semi automatic. For inspiration, a great example of a semi-automatic technique (for the task of summarisation and not data-to-text) is [20]. The workplan proposed to the student are as follows : 1. Literature review on data to text generation and author style transfer/personalization. 2. Become familiar with the work of the previous trainee, i.e. exploring the created dataset and the baselines already explored 3. Pursue the work by proposing novel models and enhancing the evaluation protocol. 4. Conduct experiments on the proposed solution and evaluation schemes with respect to baseline systems. 5. If the internship leads to publish work, we will provide support to go present your work in a conference. Recommendation for applicants If you want to know more about the direction of this research and this internship, you may consider reading first the following articles: - On style transfer: Xiang Ao et al. "PENS: A Dataset and Generic Framework for Personalized News Headline Generation". In: The Annual Meeting of the Association for Computational Linguistics (ACL). Aug. 2021. url: https://www.microsoft.com/en-us/research/publication/pens- a-dataset-and-generic-framework-for-personalized-news-headline-generation/ - On evaluating style transfer: Remi Mir et al. "Evaluating Style Transfer for Text". In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 495-504. doi: 10.18653/v1/N19-1049. url: https://aclanthology.org/N19-1049 - On semi-automatic evaluation of text generators: Shiyue Zhang and Mohit Bansal. "Finding a Balanced Degree of Automation for Summary Evaluation". In: The 2021 Conference on Empirical Methods in Natural Language Processing. 2021 References [1] Xiang Ao et al. "PENS: A Dataset and Generic Framework for Personalized News Headline Generation". In: The Annual Meeting of the Association for Computational Linguistics (ACL). Aug. 2021. url: https://www.microsoft.com/en- us/research/publication/pens- a- dataset- and- generic- framework- for- personalized- news- headline- generation/. [2] Dzmitry Bahdanau et al. "Neural machine translation by jointly learning to align and translate". In: arXiv preprint arXiv:1409.0473 (2014). [3] Tom B. Brown et al. "Language Models are Few-Shot Learners". In: (2020). arXiv: 2005.14165 [cs.CL]. [4] Kunal Chawla et al. "Semi-supervised Formality Style Transfer using Language Model Discriminator and Mutual Information Maximization". In: Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, Nov. 2020, pp. 2340-2354. doi: 10.18653/v1/2020.findings-emnlp.212. url: https://aclanthology.org/2020.findings-emnlp.212. [5] Jacob Devlin et al. "Bert: Pre-training of deep bidirectional transformers for language understanding". In: arXiv preprint arXiv:1810.04805 (2018). [6] Yinhan Liu et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 2019. arXiv: 1907.11692 [cs.CL]. [7] Li Lucy et al. "Gender and Representation Bias in GPT-3 Generated Stories". In: Proceedings of the Third Workshop on Narrative Understanding. Virtual: Association for Computational Linguistics, June 2021, pp. 48-55. doi: 10. 18653/v1/2021.nuse-1.5. url: https://aclanthology.org/2021.nuse-1.5. [8] Eric Malmi et al. "Unsupervised Text Style Transfer with Padded Masked Language Models". In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 2020, pp. 8671-8680. doi: 10 . 18653 / v1 / 2020 . emnlp - main . 699. url: https://aclanthology.org/2020.emnlp-main.699. [9] Remi Mir et al. "Evaluating Style Transfer for Text". In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 495-504. doi: 10.18653/v1/N19-1049. url: https://aclanthology.org/N19-1049. [10] Laura Perez-Beltrachini et al. "Analysing Data-To-Text Generation Benchmarks". In: Proceedings of the 10th Interna- tional Conference on Natural Language Generation. Santiago de Compostela, Spain: Association for Computational Linguistics, Sept. 2017, pp. 238-242. doi: 10.18653/v1/W17-3537. url: https://aclanthology.org/W17-3537. [11] Laura Perez-Beltrachini et al. "Bootstrapping Generators from Noisy Data". In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June 2018, pp. 1516-1527. doi: 10.18653/v1/N18-1137. url: https://aclanthology.org/N18-1137. [12] Clément Rebuffel et al. "A Hierarchical Model for Data-to-Text Generation". In: Advances in Information Retrieval. Ed. by Joemon M. Jose et al. Cham: Springer International Publishing, 2020, pp. 65-80. isbn: 978-3-030-45439-5. [13] Clément Rebuffel et al. "Data-QuestEval: A Referenceless Metric for Data to Text Semantic Evaluation". In: arXiv preprint arXiv:2104.07555 (2021). [14] Timo Schick et al. "It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners". In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, June 2021, pp. 2339-2352. doi: 10.18653/v1/2021.naacl-main.185. url: https://aclanthology.org/2021.naacl-main.185. [15] Sandeep Subramanian et al. "Multiple-Attribute Text Style Transfer". In: CoRR abs/1811.00552 (2018). arXiv: 1811.00552. url: http://arxiv.org/abs/1811.00552. [16] Chi Sun et al. "How to fine-tune bert for text classification?" In: China National Conference on Chinese Computational Linguistics. Springer. 2019, pp. 194-206. [17] Craig Thomson et al. "A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems". In: Proceedings of the 13th International Conference on Natural Language Generation. Dublin, Ireland: Association for Computational Linguistics, Dec. 2020, pp. 158-168. url: https://aclanthology.org/2020.inlg-1.22. [18] Ashish Vaswani et al. "Attention is all you need". In: Proc. of NIPS 2017. 2017, pp. 5998-6008. [19] Wei Yang et al. "End-to-End Open-Domain Question Answering with BERTserini". In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 72-77. doi: 10.18653/v1/N19- 4013. url: https://aclanthology.org/N19-4013. [20] Shiyue Zhang et al. "Finding a Balanced Degree of Automation for Summary Evaluation". In: The 2021 Conference on Empirical Methods in Natural Language Processing. 2021.