Research intern position (with the goal of pursuing in PhD) at Inria on generation of natural language datasets and metrics of representativeness *Place of work*: Inria center in Paris area (Paris / Saclay /Rocquencourt) *Duration*: 6 months internship + 3 years PhD *Starting date*: Anytime in 2023 *Keywords*: artificial intelligence, natural language processing, natural language generation, story generation, evaluation metrics *Context* This internship fits within the roadmap activities of Inria's Defense & Security Department. Inria's Defense & Security Department develops and maintains a serious game platform which can simulate the activity of a crisis monitoring cell. For instance, analysts in geopolitical crises are employed by the French Ministry of Armed Forces to better identify emerging or ongoing conflicts throughout the world. These analysts are typically overwhelmed by continuous streams of plain-text information and can be helped by Natural Language Processing (NLP) tooling. That platform enables, among other uses, to experiment with the NLP tools developed by researchers and partners, in order to get practical feedback from players on their usefulness during action. For these simulations, a fictive world has been created by Inria, including imaginary cities and an imaginary historical, social and political background. In order to test NLP solutions in a game scenario, it is necessary to create large amounts of text documents, which need to be both realistic in terms of their form and type of contents, and tailored to the context of that imaginary world. To date, this creation remains mostly manual, which is a very time-consuming process. To scale the scenarios up, it is thus planned to gradually automate this data creation, which is the ambition to which this internship will contribute. The intern will be supervised by Dr Lauriane Aufrant, who is the lead NLP researcher within Inria's Defense & Security Department. PhD supervision will be done jointly with Dr Frédérique Segond (Inria's Defense & Security Director) or with a researcher from another Inria team, depending on the exact chosen PhD topic (to be discussed, see below). *Candidate profile* - Pursuing a master's degree in Natural Language Processing, Computational Linguistics or Computer Science with a specialization in Machine Learning - Theoretical and practical knowledge of deep learning, as well as traditional machine learning and knowledge-driven AI - Strong programming skills (at least Python, git, Linux environment, command line and scripting) - Fluency in English. Knowledge or interest for the French language. Knowledge of a second foreign language would be appreciated. *How to apply* Send a CV and a cover letter to lauriane.aufrant and frederique.segond (both at inria.fr) Indications of referees or reference letters would be appreciated but are not mandatory. *Internship description* Two configurations for Natural Language Generation (NLG) will be considered throughout this work: topic-based NLG (to produce haystacks) and knowledge base-to-text NLG (to produce needles). The internship will be devoted to prepare the scientific framework within which the PhD work will be developed, and in particular metrics that will be used to guide and evaluate future contributions on the generation itself. In the current NLG literature, text generation is usually evaluated at the sentence- or document-level only, by considering aspects such as fluency (is it good French?) or coherence (does the text make sense?). But much more rarely is the setting of NLG evaluation at the dataset level considered. This includes accounting for higher-level properties, such as realism, consistency and diversity: - Diversity, as in generating documents with sufficient variability in form and content, is increasingly considered in the literature, in particular as a result of some NLG models' tendency to repeat themselves. However, existing metrics are often rather basic, and covering only some aspects of the diversity property (see for instance https://arxiv.org/pdf/2006.14799.pdf or https://aclanthology.org/2021.eacl-main.25.pdf ), so there is room for improvement. - Consistency, as in avoiding to generate one document where a given person is born on April 12 and another where the same person is born on April 13, is sometimes considered at the level of a single document (in the context of story generation), but not really across documents. Still, there is inspiration to be drawn here from the fact checking literature, for instance considering settings where each document in turn is to be fact checked against all others. - Realism is a much more complex, multi-faceted property, encompassing at least the adequacy of style (would the fictive author have written that way, in light of their background?), the nature of contents (would a transcript of political debate cite football match results?), the nature of facts (would a politician be aged 15, or a World cup football player aged 60?), and possibly many other aspects. On this part, most of the work remains to be done. The intern's work will be to develop new metrics to quantify dataset quality along those three properties, drawing inspiration both from conceptual investigations and user studies (to build a taxonomy of aspects to evaluate) and from empirical studies conducted with state-of-the-art NLG models to observe how the generated data reacts to various tentative metrics. *PhD follow-up* Based on the scientific methodology developed during the internship to evaluate NLG quality at dataset level, the work will be pursued as a PhD to propose new generation methods that better perform along those metrics. Considering both settings of topic-based and knowledge base-to-text NLG, the goal in each case is to compare pure generation methods (using GPT3-like models or more specialized models) with approaches based on well-focused Web crawling followed by text substitutions, paraphrasing and other automated transformations applied on the collected documents, to modify their style or their base information. The PhD student will work on designing and implementing new approaches along those lines, but also use the new metrics to evaluate and compare those approaches with existing models in NLG literature. As work progresses, it is also likely that the initially proposed metrics are refined or complemented to better match empirical evidence gathered during the PhD. The exact PhD topic will be written in coordination with the intern, to fit their primary interests within that broad objective. In any case, the work will focus as a first step on plain-text documents written in correct language, but then it will tackle the generation of more complex types of data, for which several options can be considered, such as: generating tweets, documents with corrupted language (simulating typos or grammar errors in the most realistic way), multimodal documents that include and discuss tables or pictures, corpora containing divergent views and (purposedly) inconsistent facts, etc. Since the PhD application processes are early in the year (February-April), the intern will be asked to commit early to that PhD follow-up, possibly even before the internship begins, and to be ready to devote some time for writing the application over that period.