Title: Text-image QA corpus based on educational content

SUPERVISORS
Thomas Gerald, Sahar Ghannay, and Anne Vilnat

Keywords: Natural Language Processing, Question Answering, Visual
Question Answering, Corpus annotation, Deep-Learning

Project Description:
Using deep learning architectures to embed text and images has matured
today, especially in visual question-answering. While working
efficiently on photographs, few approaches are efficient when working
with complex images or schema (such as diagrams or maps) jointly with
textual information. The objectives of this project are twofold: (a)
create a multi-modal corpus containing both images (in particular
schematics) and text on educational documents such as school books or
encyclopedic resources; (b) provide insightful experiments to embed
text and images for questions and answers generation.
The first step will be to consider previous works achieved in
question-answering based on schoolbook content to propose alignment
methods between existing questions and complex images (maps,
schematics, diagrams and others) leveraging deep architectures for
text-image alignment (as the clip family of models [1]).
As previously collected annotated data only rely on textual
information, we will augment the dataset by considering multi-modal
generative approaches [2]. To this end, we plan to use/develop an
automatic labelling framework to annotate in-context content. To
validate annotations, human experts will evaluate the annotation to
ensure the consistency of the data.

Internship objectives:
The primary objective of the internship will be to develop/propose
methods to leverage automatic annotation of complex documents,
particularly the study of approaches to generate questions and answers
from visual and textual contexts.
In parallel, the intern and the laboratory team will propose/define
annotation schemes, for instance, define a typology of
questions/answers [3], e.g. some answers to questions could rely on a
different part of the document (such as in multi-hop question-answering)
[4], require deduction or reasoning, or rely on different types of
media (maps, diagrams or others).
The intern will start from a collection of documents based on Wikipedia
content related to education topics (mainly high-school topics
beginning with history, geography and biology) containing the text of
articles and images (open-source images only). A filtering approach
will be developed to focus on infographic content by leveraging
different information from the resources (visual, metadata, Wikipedia
categories or others). Other sources, such as university courses and
content, will be envisioned depending on the progress.
For the last part of the internship, the intern will focus on
evaluation metrics and approaches to assess the relevance of the
corpus. An evaluation of LLMs adapted to the content could be
envisioned if time allows.


Practicalities
The internship will be funded 659,76 ¤ per month for a duration of
5 or 6 months (starting in March or April 2025) and will take place at
LISN within the SEME and LIPS teams.

Candidate Profile:
We are looking for a highly motivated candidate with the following
qualifications:
-   Education: Master's degree (M2) in Computer Science/NLP, with a
    preference for candidates experienced in Natural Language
    Processing (NLP) or Artificial Intelligence (AI).
-   Technical Skills:
    -   Proficiency in Python and familiarity with deep learning
        libraries such as TensorFlow, PyTorch, or Keras.
    -   Experience in data analysis and information extraction tools,
-   Soft Skills: Strong analytical abilities and the ability to work
    independently and collaboratively in a research environment.


Contact
To apply, please send your resume and a cover letter to :
-   thomas.gerald@universite-paris-saclay.fr
-   sahar.ghannay@universite-paris-saclay.fr
-   anne.vilnat@universite-paris-saclay.fr


Bibliography
-   [1] Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual
    instruction tuning. In NeurIPS
-   [2] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
    Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,
    Krueger, G., and Sutskever, I. (2021). Learning transferable visual
    models from natural language supervision.
-   [3] Gerald, T., Vilnat, A., Ettayeb, S., Tamames, L., and Paroubek,
    P. (2024). Introducing cquae: A new French contextualised
    question-answering corpus for the education domain. In LREC/COLING,
    pages 9234-9244. ELRA and ICCL.
-   [4]  Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W.,
    Salakhutdinov, R., and Manning, C. D. (2018). Hotpotqa: A dataset
    for diverse, explainable multi-hop question answering. In EMNLP,
    pages 2369-2380. Association for Computational Linguistics.