Title: Justify your answer! Disentangle internal and contextual
knowledge within Multi-modal question-answering

Internship location: LISN Paris Saclay

Start: february/march

Duration: 5/6 months

Link to the description:
https://www.lisn.upsaclay.fr/job-offers/justify-your-answer/


The objective of the internship is to develop/propose new explanability
methods considering the generation of question-answer pairs in VLLM
approaches. The first month of the internship will be dedicated to
producing a bibliography on explainability in LLMs and VLLMs
architectures. To build explainability modules or methods, a deep
understanding of VLLM-like architectures is required. In a second step,
the intern will work on a collection of mid-sized documents (~ 4000
tokens) and test the different explainability approaches on an education
QA corpus. In the final step, the student will propose a new pipeline
for explainability, focusing on the contextual content involved in
generation, and, with the aid of their supervisor, develop a method to
evaluate the quality of the designed methods. If necessary, a human
evaluation campaign could be envisioned to strengthen the study.

Supervisors

    Thomas Gerald, Université Paris Saclay, LISN
    Sahar Ghannay, Université Paris Saclay, LISN

Keywords

Natural Language Processing, Question Answering, Visual Question
Answering, Corpus annotation, Deep-Learning

Project Description

A fundamental question in generative textual approaches is to retrieve
what part of the internal knowledge or contextual knowledge is used
for the generation.  To illustrate our research question, let us
consider the figure below1 and the question "What are the sources of
Atmospheric sulfur ?". A short answer could be "Atmospheric sulfur
came from soil decomposition, human emissions and less frequently from
volcanic eruption; to provide this answer, the model should gather
specific information from the figure and the text. Nonetheless, some
prior knowledge is necessary, such as linking the arrow to a flow or
process text included in the image.

Figure 1: Sulfur dioxide from the atmosphere becomes available to terrestrial and marine ecosystems when it is dissolved in precipitation as weak sulfurous acid or when it falls directly to the Earth as fallout. Weathering of rocks also makes sulfates available to terrestrial ecosystems. Decomposition of living organisms returns sulfates to the ocean, soil, and atmosphere (credit: modification of work by John M. Evans and Howard Perlman, USGS).

This example illustrates our two objectives: identifying which part of
the context the model leverages to answer the question, and
determining which part of preliminary knowledge (from a pre-trained
model) is utilised to answer the question.

Deep learning generative architectures that leverage both text and
image have led to substantial improvements in the complex
question-answering task [Radford et al., 2021]. While working
efficiently on photographs, a few approaches are practical when
working with complex images or schematics (such as diagrams or maps)
in conjunction with textual information. Most importantly,
disentangling part of the knowledge serving generation remains
difficult, e.g. does the model use pre-trained knowledge or a specific
part of the context? In the literature, previous studies have begun to
explore the topic of explainability. Beyond them, different approaches
arose:

observing internal signal in LLMs activation (e.g. attention weights)
[Bibal et al., 2022], enforcing generation of
explanation/justification alongside the answer [Wei et al., 2022,
DeepSeek-AI et al., 2025] or disentangling memory from reasoning [Jin
et al., 2025]. While these methods may support the content, they are
primarily developed to improve performance, rather than to provide an
explanation. The goal of the internship is thus twofold: to evaluate
grounding explanation methods and estimate how these approaches depict
the internal behaviour of the model; and to propose a new framework
pipeline that allows for discovering what supporting context is used
for generation and what part leverages internal model capacities or
knowledge.

Today, we have gathered a dataset that considers, simultaneously,
schematics and text passages collected from the schoolbook corpus (on
biology and history). The first experiments conducted on generating
questions and answers from multimodal documents, alongside an
evaluation of specific domain criteria, will serve as a starting point
for the subject.

To discover the context implication in models (versus internal
knowledge), the intern could start with a simple pipeline to measure
the impact of the different contexts on the generation, for instance :

- Modify/noising specific part of the context (blurring images,
  replacing named entities, ...)

- Measure the difference in the generated answer with/without noisy
  context

- Identify signals in activation (or attention mechanisms) correlating
  to the type of noise

- Depending on the advances, experiments to control
  internal/contextual knowledge in generation would be led.

Internship objectives


The objective of the internship is to develop/propose new
explanability methods considering the generation of question-answer
pairs in VLLM approaches. The first month of the internship will be
dedicated to producing a bibliography on explainability in LLMs and
VLLMs architectures. To build explainability modules or methods, a
deep understanding of VLLM-like architectures is required. In a second
step, the intern will work on a collection of mid-sized documents (~
4000 tokens) and testDéfinition courte Lorem ipsum the different
explainability approaches on an education QA corpus. In the final
step, the student will propose a new pipeline for explainability,
focusing on the contextual content involved in generation, and, with
the aid of their supervisor, develop a method to evaluate the quality
of the designed methods efficiently. If necessary, a human evaluation
campaign could be envisioned to strengthen the study. A good
internship would lead to the realisation of a PhD.

Practicalities

The internship will be funded at 659.76 ¤ per month for a duration of
6 months (starting in February or March 2025) and will take place at
the LISN laboratory (CNRS, Paris Saclay).

Candidate Profile

We are looking for highly motivated candidates with a strong interest
in academic research and the will to pursue a PhD. We also require the
following qualifications:

    Education: Master's degree (M2) in Computer Science, with a
    preference for candidates experienced in Natural Language
    Processing (NLP), Computer Vision (CV), or Artificial Intelligence
    (AI).

Technical Skills:

    - Proficiency in Python and familiarity with deep learning libraries
    such as TensorFlow, PyTorch, or Keras.
    
    - Experience in data analysis and information extraction tools.
    Soft Skills: Strong analytical abilities, an interest in
    accessibility and human-centric AI, and the ability to work
    independently and collaboratively in a research environment.

To apply

please send your resume and a cover letter to :

    thomas.gerald@universite-paris-saclay.fr
    sahar.ghannay@universite-paris-saclay.fr

References

[Bibal et al., 2022] Bibal, A., Cardon, R., Alfter, D., Wilkens, R.,
Wang, X., Francois, T., andWatrin, P. (2022).  Is attention
explanation? an introduction to the debate. In Muresan, S., Nakov, P.,
and Villavicencio, A., editors, Proceedings of the 60th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 3889-3900, Dublin, Ireland. Association for
Computational Linguistics.

[DeepSeek-AI et al., 2025] DeepSeek-AI, Guo, D., Yang, D., Zhang, H.,
Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., Zhang,
X., Yu, X., Wu, Y., Wu, Z. F., Gou, Z., Shao, Z., Li, Z., Gao, Z.,
Liu, A., Xue, B., Wang, B., Wu, B., Feng, B., Lu, C., Zhao, C., Deng,
C., Zhang, C., Ruan, C., Dai, D., Chen, D., Ji, D., Li, E., Lin, F.,
Dai, F., Luo, F., Hao, G., Chen, G., Li, G., Zhang, H., Bao, H., Xu,
H., Wang, H., Ding, H., Xin, H., Gao, H., Qu, H., Li, H., Guo, J., Li,
J., Wang, J., Chen, J., Yuan, J., Qiu, J., Li, J., Cai, J. L., Ni, J.,
Liang, J., Chen, J., Dong, K., Hu, K., Gao, K., Guan, K., Huang, K.,
Yu, K., Wang, L., Zhang, L., Zhao, L., Wang, L., Zhang, L., Xu, L.,
Xia, L., Zhang, M., Zhang, M., Tang, M., Li, M., Wang, M., Li, M.,
Tian, N., Huang, P., Zhang, P., Wang, Q., Chen, Q., Du, Q., Ge, R.,
Zhang, R., Pan, R., Wang, R., Chen, R. J., Jin, R. L., Chen, R., Lu,
S., Zhou, S., Chen, S., Ye, S., Wang, S., Yu, S., Zhou, S., Pan, S.,
Li, S. S., Zhou, S., Wu, S., Ye, S., Yun, T., Pei, T., Sun, T., Wang,
T., Zeng, W., Zhao, W., Liu, W., Liang, W., Gao, W., Yu, W., Zhang,
W., Xiao, W. L., An, W., Liu, X., Wang, X., Chen, X., Nie, X., Cheng,
X., Liu, X., Xie, X., Liu, X., Yang, X., Li, X., Su, X., Lin, X., Li,
X. Q., Jin, X., Shen, X., Chen, X., Sun, X., Wang, X., Song, X., Zhou,
X., Wang, X., Shan, X., Li, Y. K., Wang, Y. Q., Wei, Y. X., Zhang, Y.,
Xu, Y., Li, Y., Zhao, Y., Sun, Y., Wang, Y., Yu, Y., Zhang, Y., Shi,
Y., Xiong, Y., He, Y., Piao, Y., Wang, Y., Tan, Y., Ma, Y., Liu, Y.,
Guo, Y., Ou, Y., Wang, Y., Gong, Y., Zou, Y., He, Y., Xiong, Y., Luo,
Y., You, Y., Liu, Y., Zhou, Y., Zhu, Y. X., Xu, Y., Huang, Y., Li, Y.,
Zheng, Y., Zhu, Y., Ma, Y., Tang, Y., Zha, Y., Yan, Y., Ren, Z. Z.,
Ren, Z., Sha, Z., Fu, Z., Xu, Z., Xie, Z., Zhang, Z., Hao, Z., Ma, Z.,
Yan, Z., Wu, Z., Gu, Z., Zhu, Z., Liu, Z., Li, Z., Xie, Z., Song, Z.,
Pan, Z., Huang, Z., Xu, Z., Zhang, Z., and Zhang, Z. (2025).
Deepseek-r1: Incentivizing reasoning capability in llms via
reinforcement learning.

[Jin et al., 2025] Jin, M., Luo, W., Cheng, S., Wang, X., Hua, W.,
Tang, R., Wang, W. Y., and Zhang, Y.  (2025). Disentangling memory and
reasoning ability in large language models.  [Radford et al., 2021]
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal,
S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and
Sutskever, I. (2021). Learning transferable visual models from natural
language supervision. In ICML, volume 139 of Proceedings of Machine
Learning Research, pages 8748-8763. PMLR.

[Wei et al., 2022] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia,
F., Chi, E., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought
prompting elicits reasoning in large language models. Advances in
neural information processing systems, 35:24824-24837.

Contact: thomas.gerald@universite-paris-saclay.fr