We propose two internship positions at IRIT, Toulouse, in the MELODI
team.
Please send an email to apply, with a CV and a few lines explaining
your motivation at: chloe.braud@irit.fr and philippe.muller@irit.fr

Question generation based on discourse relations

Question generation is the task of automatically producing varied
questions about a document or a set of documents, to facilitate
matching real users' questions looking for information contained in
those documents, for instance in the content of Customer Relationship
Management or producing FAQs.

Existing approaches usually focus on simple questions where the answer
is a simple fact. This internship aims at generating more complex
questions, involving structured answers (how to realize a task,
explaining why something is true for instance). To this end we plan to
rely on discourse structure analysis of texts. Discourse structure
corresponds to the semantic links between sentences e.g. "explanation",
"contrast", that organize a document. Integrating additional explicit
knowledge about the input document structure can be used to generate
specific and complex questions, based on templates for different types
of questions which are naturally linked to some discourse relations
("purpose", "explanation", "result", "manner" and similar functions).
The goal of this internship is to build a system generating complex
questions from a text excerpt, using the presence of discourse
connectives / relations.

The first phase of the internship will be to make a review of the state
of the art on the domain, including the existing datasets with a focus
on complex questions (cf HotPotQA, ELI5, TIDYQA, FQUAD references
below), and methods for complex question generation using pretrained
language models.

The second step will be to develop an approach for question generation
based on data annotated with discourse information. Using the Penn
Discourse Treebank, we propose to focus on a few relations (e.g.
manner, explanation...) and extract the pair of linked spans. A
complementary approach will be based on texts automatically annotated
with discourse connectives, using existing models. Then a pretrained
language model suitable for text generation (e.g. BART) can be
fine-tuned to generate questions. The model would be bootstrapped with
templates, and pairs of relevant question-answers coming from existing
QA corpora. An evaluation will be done on a corpus coming from the
SYNAPSE company, specialized in the development of chatbots for FAQs,
in the context of the ANR Project Quantum.

Supervision: Chloé Braud chloé.braud@irit.fr, Philippe Muller
philippe.muller@irit.fr

Location: IRIT, University of Toulouse, France, within the Melodi team

Duration: 5-6 months

Compensation: 546¤/month

Requirements: Master 2 or equivalent in Computer Science or
Mathematics; good programming skills and knowledge of Machine Learning
principles and tools. Some knowledge of NLP would be a plus but is not
required.


References:

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R.,
& Manning, C. D. (2018). HotpotQA: A Dataset for Diverse, Explainable
Multi-hop Question Answering. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing (pp. 2369-2380).
https://aclanthology.org/D18-1259/

Fan, A., Jernite, Y., Perez, E., Grangier, D., Weston, J., & Auli, M.
(2019, July). ELI5: Long Form Question Answering. In Proceedings of the
57th Annual Meeting of the Association for Computational Linguistics
(pp. 3558-3567). https://aclanthology.org/P19-1346/

TyDi QA: A Benchmark for Information-Seeking Question Answering in
Typologically Diverse Languages]
(https://aclanthology.org/2020.tacl-1.30)
(Clark et al., TACL 2020) https://aclanthology.org/2020.tacl-1.30/

Ruqing Zhang, Jiafeng Guo, Lu Chen, Yixing Fan, and Xueqi Cheng. 2021.
A Review on Question Generation from Natural Language Text. ACM Trans.
Inf. Syst. 40, 1, Article 14 (August 2021), 43 pages.
https://doi.org/10.1145/3468889

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
Generation, Translation, and Comprehension (Lewis et al., ACL 2020)
https://aclanthology.org/2020.acl-main.703/