Stage de Master 2 Exploring Complexity Factors in Multi-Hop Conversational QA Frederic Bechet, Elie Antoine prenom.nom@lis-lab.fr LIS, Aix Marseille Université Context and Motivation Machine Reading Question Answering (MRQA) has emerged as a key task in Infor- mation Retrieval (IR), aiming to identify word spans in documents that answer posed questions. This task has gained significant traction with the introduction of large-scale datasets such as SQuAD [6], which consist of triplets (document, question, answer). While pretrained language models such as BERT have achieved near-perfect perfor- mance on these benchmarks, the simplicity of the task - questions typically requiring only single-span answers within the document - has limited its ability to challenge state-of-the-art models. To address this, research has expanded into two key areas of QA complexity: Conversational Question Answering (CQA) and Multi-Hop Question Answering (MHQA) [7, 9, 7, 5, 8]. CQA introduces interrelated sequences of questions and answers, requiring systems to address linguistic phenomena such as coreference resolution, ellipsis, and im- plicit references. MHQA, in contrast, necessitates the aggregation of evidence from multiple document spans to construct an answer. Both tasks present unique challenges that push QA systems to perform deeper reasoning and more sophisticated context modeling. The recently introduced Calor-Dial [2] corpus for French integrates these challenges into a unified framework, focusing on conversational QA within encyclopedic documents. In addition to multi-hop QA, Calor-Dial provides annotations for question rewriting and answer paraphrasing, enabling an unprecedented exploration of linguistic complexity in QA. Objective This internship aims to investigate the complexity factors underlying multi-hop conversational QA tasks using the Calor-Dial corpus. The research will focus on understanding the interplay between linguistic phenomena (e.g., coreference resolution, ellipsis, paraphrasing) and multi-span reasoning. By identifying the challenges faced by current MRQA models based on Large Language Models (LLMs), the project seeks to propose directions for advancing QA systems. Methodology 1. Literature Review The first stage will consists in a literature review on MRQA, CQA, and MHQA. Following recent studies on the challenges LLMs face when dealing with complex QA, special attention will be given to studies highlighting the limitations of current models in handling complex linguistic structures and multi-hop reasoning [3, 4]. 2. Data Exploration The Calor-Dial corpus will be analyzed to: - Quantify the prevalence of multi-hop questions and categorize their structural patterns. - Identify the linguistic phenomena present in questions, including coreferences, ellipsis, and paraphrasing. - Examine the relationship between question complexity and the nature of answers (e.g., single-span vs. multi-span). 3. Model Benchmarking State-of-the-art MRQA models will be evaluated on the Calor-Dial corpus. Key evaluation metrics will include: - Accuracy in detecting multi-span answers. - Performance comparison with question rewriting and answer paraphrasing. - Effectiveness in addressing specific linguistic phenomena. 4. Complexity Analysis Metrics will be developed to quantify the complexity of questions and answers, considering factors such as the number of reasoning steps and linguistic features. Model performance will be correlated with these metrics to identify critical bottlenecks. Following recent work done in our research team [1], this study will focus on stuying complexity factors that can affect all models, regardless of their size in terms of parameters. Expected Outcomes - A detailed characterization of complexity factors in multi-hop conversational QA tasks. - Benchmarks of MRQA models on the Calor-Dial corpus. - Insights into the limitations of existing QA systems and recommendations for overcoming them. Practicalities The internship will be funded 500 euros per month for a duration of 6 months. It will take place in Marseille within the TALEP research group at LIS/CNRS on the Luminy campus. The intern will collaborate with other interns, PhD students and researchers from the research group in Marseille, as well as the research group of Orange Lab in Lannion, partner of LIS in the CALOR project. How to apply: send an application letter, transcripts and CV to: frederic.bechet@univ-amu.fr - Application deadline: December 20th, 2024 - Expected start: early spring 2025 References [1] Elie Antoine, Frederic Bechet, Géraldine Damnati, and Philippe Langlais. A linguistically-motivated evaluation methodology for unraveling model's abilities in reading comprehension tasks. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18376-18392, Miami, Florida, USA, Novem- ber 2024. Association for Computational Linguistics. [2] Frédéric Béchet, Ludivine Robert, Lina Rojas-Barahona, and Géraldine Damnati. Calor-Dial : a corpus for Conversational Question Answering on French encyclopedic documents. In CIRCLE (Joint Conference of the Information Retrieval Communities in Europe), Samatan, France, July 2022. [3] Neeladri Bhuiya, Viktor Schlegel, and Stefan Winkler. Seemingly plausible distractors in multi-hop reasoning: Are large language models attentive readers? In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3 2514-2528, Miami, Florida, USA, November 2024. Association for Computational Linguistics. [4] Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir Globerson. Hopping too late: Exploring the limitations of large language models on multi- hop queries. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 14113-14130, Miami, Florida, USA, November 2024. Association for Computational Linguistics. [5] E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174-2184, Brussels, Belgium, October 2018. Association for Computational Linguistics. [6] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383-2392. Association for Computational Linguistics, 2016. [7] S. Reddy, D. Chen, and C.D. Manning. CoQA: A Conversational Question Answering Challenge. Transactions of the Association for Computational Linguistics, 7:249-266, May 2019. [8] A. Saha, V. Pahuja, M.M. Khapra, K. Sankaranarayanan, and S. Chandar. Complex sequential question answering: Towards learning to converse over linked question answer pairs with a knowledge graph. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [9] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369-2380, 2018.