Stage de Master 2
Exploring Complexity Factors in Multi-Hop Conversational QA

Frederic Bechet, Elie Antoine
prenom.nom@lis-lab.fr

LIS, Aix Marseille Université

Context and Motivation

Machine Reading Question Answering (MRQA) has emerged as a key task in
Infor- mation Retrieval (IR), aiming to identify word spans in
documents that answer posed questions. This task has gained
significant traction with the introduction of large-scale datasets
such as SQuAD [6], which consist of triplets (document, question,
answer).  While pretrained language models such as BERT have achieved
near-perfect perfor- mance on these benchmarks, the simplicity of the
task - questions typically requiring only single-span answers within
the document - has limited its ability to challenge state-of-the-art
models. To address this, research has expanded into two key areas of
QA complexity: Conversational Question Answering (CQA) and Multi-Hop
Question Answering (MHQA) [7, 9, 7, 5, 8].

CQA introduces interrelated sequences of questions and answers,
requiring systems to address linguistic phenomena such as coreference
resolution, ellipsis, and im- plicit references. MHQA, in contrast,
necessitates the aggregation of evidence from multiple document spans
to construct an answer. Both tasks present unique challenges that push
QA systems to perform deeper reasoning and more sophisticated context
modeling.  The recently introduced Calor-Dial [2] corpus for French
integrates these challenges into a unified framework, focusing on
conversational QA within encyclopedic documents. In addition to
multi-hop QA, Calor-Dial provides annotations for question rewriting
and answer paraphrasing, enabling an unprecedented exploration of
linguistic complexity in QA.

Objective

This internship aims to investigate the complexity factors underlying
multi-hop conversational QA tasks using the Calor-Dial corpus. The
research will focus on understanding the interplay between linguistic
phenomena (e.g., coreference resolution, ellipsis, paraphrasing) and
multi-span reasoning. By identifying the challenges faced by current
MRQA models based on Large Language Models (LLMs), the project seeks
to propose directions for advancing QA systems.

Methodology

1. Literature Review The first stage will consists in a literature
review on MRQA, CQA, and MHQA. Following recent studies on the
challenges LLMs face when dealing with complex QA, special attention
will be given to studies highlighting the limitations of current
models in handling complex linguistic structures and multi-hop
reasoning [3, 4].

2. Data Exploration The Calor-Dial corpus will be analyzed to:

- Quantify the prevalence of multi-hop questions and categorize their
  structural patterns.

- Identify the linguistic phenomena present in questions, including
  coreferences, ellipsis, and paraphrasing.

- Examine the relationship between question complexity and the nature
  of answers (e.g., single-span vs. multi-span).

3. Model Benchmarking State-of-the-art MRQA models will be evaluated on
the Calor-Dial corpus. Key evaluation metrics will include:

- Accuracy in detecting multi-span answers.

- Performance comparison with question rewriting and answer
  paraphrasing.

- Effectiveness in addressing specific linguistic phenomena.

4. Complexity Analysis Metrics will be developed to quantify the
complexity of questions and answers, considering factors such as the
number of reasoning steps and linguistic features. Model performance
will be correlated with these metrics to identify critical
bottlenecks. Following recent work done in our research team [1], this
study will focus on stuying complexity factors that can affect all
models, regardless of their size in terms of parameters.


Expected Outcomes

- A detailed characterization of complexity factors in multi-hop
conversational QA tasks.

- Benchmarks of MRQA models on the Calor-Dial corpus.

- Insights into the limitations of existing QA systems and
recommendations for overcoming them.

Practicalities

The internship will be funded 500 euros per month for a duration of 6
months. It will take place in Marseille within the TALEP research
group at LIS/CNRS on the Luminy campus. The intern will collaborate
with other interns, PhD students and researchers from the research
group in Marseille, as well as the research group of Orange Lab in
Lannion, partner of LIS in the CALOR project.

How to apply: send an application letter, transcripts and CV to:
frederic.bechet@univ-amu.fr

- Application deadline: December 20th, 2024

- Expected start: early spring 2025

References

[1] Elie Antoine, Frederic Bechet, Géraldine Damnati, and Philippe
Langlais. A linguistically-motivated evaluation methodology for
unraveling model's abilities in reading comprehension tasks. In Yaser
Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors, Proceedings of
the 2024 Conference on Empirical Methods in Natural Language
Processing, pages 18376-18392, Miami, Florida, USA, Novem- ber
2024. Association for Computational Linguistics.

[2] Frédéric Béchet, Ludivine Robert, Lina Rojas-Barahona, and
Géraldine Damnati.  Calor-Dial : a corpus for Conversational Question
Answering on French encyclopedic documents. In CIRCLE (Joint
Conference of the Information Retrieval Communities in Europe),
Samatan, France, July 2022.

[3] Neeladri Bhuiya, Viktor Schlegel, and Stefan Winkler. Seemingly
plausible distractors in multi-hop reasoning: Are large language
models attentive readers? In Yaser Al-Onaizan, Mohit Bansal, and
Yun-Nung Chen, editors, Proceedings of the 2024 Conference on
Empirical Methods in Natural Language Processing, pages 3 2514-2528,
Miami, Florida, USA, November 2024. Association for Computational
Linguistics.

[4] Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, and Amir
Globerson.  Hopping too late: Exploring the limitations of large
language models on multi- hop queries. In Yaser Al-Onaizan, Mohit
Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference
on Empirical Methods in Natural Language Processing, pages
14113-14130, Miami, Florida, USA, November 2024. Association for
Computational Linguistics.

[5] E. Choi, H. He, M. Iyyer, M. Yatskar, W. Yih, Y. Choi, P. Liang,
and L. Zettlemoyer. QuAC: Question Answering in Context. In
Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pages 2174-2184, Brussels, Belgium, October
2018. Association for Computational Linguistics.

[6] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy
Liang. Squad: 100,000+ questions for machine comprehension of text. In
Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing, pages 2383-2392. Association for Computational
Linguistics, 2016.

[7] S. Reddy, D. Chen, and C.D. Manning. CoQA: A Conversational
Question Answering Challenge. Transactions of the Association for
Computational Linguistics, 7:249-266, May 2019.

[8] A. Saha, V. Pahuja, M.M. Khapra, K. Sankaranarayanan, and
S. Chandar. Complex sequential question answering: Towards learning to
converse over linked question answer pairs with a knowledge graph. In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[9] Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William
Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A
dataset for diverse, explainable multi-hop question answering. In
Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing, pages 2369-2380, 2018.