We propose two internship positions at IRIT, Toulouse, in the MELODI team. Please send an email to apply, with a CV and a few lines explaining your motivation at: chloe.braud@irit.fr and philippe.muller@irit.fr Question generation based on discourse relations Question generation is the task of automatically producing varied questions about a document or a set of documents, to facilitate matching real users' questions looking for information contained in those documents, for instance in the content of Customer Relationship Management or producing FAQs. Existing approaches usually focus on simple questions where the answer is a simple fact. This internship aims at generating more complex questions, involving structured answers (how to realize a task, explaining why something is true for instance). To this end we plan to rely on discourse structure analysis of texts. Discourse structure corresponds to the semantic links between sentences e.g. "explanation", "contrast", that organize a document. Integrating additional explicit knowledge about the input document structure can be used to generate specific and complex questions, based on templates for different types of questions which are naturally linked to some discourse relations ("purpose", "explanation", "result", "manner" and similar functions). The goal of this internship is to build a system generating complex questions from a text excerpt, using the presence of discourse connectives / relations. The first phase of the internship will be to make a review of the state of the art on the domain, including the existing datasets with a focus on complex questions (cf HotPotQA, ELI5, TIDYQA, FQUAD references below), and methods for complex question generation using pretrained language models. The second step will be to develop an approach for question generation based on data annotated with discourse information. Using the Penn Discourse Treebank, we propose to focus on a few relations (e.g. manner, explanation...) and extract the pair of linked spans. A complementary approach will be based on texts automatically annotated with discourse connectives, using existing models. Then a pretrained language model suitable for text generation (e.g. BART) can be fine-tuned to generate questions. The model would be bootstrapped with templates, and pairs of relevant question-answers coming from existing QA corpora. An evaluation will be done on a corpus coming from the SYNAPSE company, specialized in the development of chatbots for FAQs, in the context of the ANR Project Quantum. Supervision: Chloé Braud chloé.braud@irit.fr, Philippe Muller philippe.muller@irit.fr Location: IRIT, University of Toulouse, France, within the Melodi team Duration: 5-6 months Compensation: 546¤/month Requirements: Master 2 or equivalent in Computer Science or Mathematics; good programming skills and knowledge of Machine Learning principles and tools. Some knowledge of NLP would be a plus but is not required. References: Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 2369-2380). https://aclanthology.org/D18-1259/ Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., & Auli, M. (2019, July). ELI5: Long Form Question Answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3558-3567). https://aclanthology.org/P19-1346/ TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages] (https://aclanthology.org/2020.tacl-1.30) (Clark et al., TACL 2020) https://aclanthology.org/2020.tacl-1.30/ Ruqing Zhang, Jiafeng Guo, Lu Chen, Yixing Fan, and Xueqi Cheng. 2021. A Review on Question Generation from Natural Language Text. ACM Trans. Inf. Syst. 40, 1, Article 14 (August 2021), 43 pages. https://doi.org/10.1145/3468889 BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension (Lewis et al., ACL 2020) https://aclanthology.org/2020.acl-main.703/