Title: Text-image QA corpus based on educational content SUPERVISORS Thomas Gerald, Sahar Ghannay, and Anne Vilnat Keywords: Natural Language Processing, Question Answering, Visual Question Answering, Corpus annotation, Deep-Learning Project Description: Using deep learning architectures to embed text and images has matured today, especially in visual question-answering. While working efficiently on photographs, few approaches are efficient when working with complex images or schema (such as diagrams or maps) jointly with textual information. The objectives of this project are twofold: (a) create a multi-modal corpus containing both images (in particular schematics) and text on educational documents such as school books or encyclopedic resources; (b) provide insightful experiments to embed text and images for questions and answers generation. The first step will be to consider previous works achieved in question-answering based on schoolbook content to propose alignment methods between existing questions and complex images (maps, schematics, diagrams and others) leveraging deep architectures for text-image alignment (as the clip family of models [1]). As previously collected annotated data only rely on textual information, we will augment the dataset by considering multi-modal generative approaches [2]. To this end, we plan to use/develop an automatic labelling framework to annotate in-context content. To validate annotations, human experts will evaluate the annotation to ensure the consistency of the data. Internship objectives: The primary objective of the internship will be to develop/propose methods to leverage automatic annotation of complex documents, particularly the study of approaches to generate questions and answers from visual and textual contexts. In parallel, the intern and the laboratory team will propose/define annotation schemes, for instance, define a typology of questions/answers [3], e.g. some answers to questions could rely on a different part of the document (such as in multi-hop question-answering) [4], require deduction or reasoning, or rely on different types of media (maps, diagrams or others). The intern will start from a collection of documents based on Wikipedia content related to education topics (mainly high-school topics beginning with history, geography and biology) containing the text of articles and images (open-source images only). A filtering approach will be developed to focus on infographic content by leveraging different information from the resources (visual, metadata, Wikipedia categories or others). Other sources, such as university courses and content, will be envisioned depending on the progress. For the last part of the internship, the intern will focus on evaluation metrics and approaches to assess the relevance of the corpus. An evaluation of LLMs adapted to the content could be envisioned if time allows. Practicalities The internship will be funded 659,76 ¤ per month for a duration of 5 or 6 months (starting in March or April 2025) and will take place at LISN within the SEME and LIPS teams. Candidate Profile: We are looking for a highly motivated candidate with the following qualifications: - Education: Master's degree (M2) in Computer Science/NLP, with a preference for candidates experienced in Natural Language Processing (NLP) or Artificial Intelligence (AI). - Technical Skills: - Proficiency in Python and familiarity with deep learning libraries such as TensorFlow, PyTorch, or Keras. - Experience in data analysis and information extraction tools, - Soft Skills: Strong analytical abilities and the ability to work independently and collaboratively in a research environment. Contact To apply, please send your resume and a cover letter to : - thomas.gerald@universite-paris-saclay.fr - sahar.ghannay@universite-paris-saclay.fr - anne.vilnat@universite-paris-saclay.fr Bibliography - [1] Liu, H., Li, C., Wu, Q., and Lee, Y. J. (2023). Visual instruction tuning. In NeurIPS - [2] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. (2021). Learning transferable visual models from natural language supervision. - [3] Gerald, T., Vilnat, A., Ettayeb, S., Tamames, L., and Paroubek, P. (2024). Introducing cquae: A new French contextualised question-answering corpus for the education domain. In LREC/COLING, pages 9234-9244. ELRA and ICCL. - [4] Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. (2018). Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In EMNLP, pages 2369-2380. Association for Computational Linguistics.