Master 2 Internship Proposal: Using
interpretability methods to explain
Vision-Language models for medical applications

Advisors: Emmanuelle Salin, Stephane Ayache, Benoit Favre

October 4, 2021


Context

Recent advances in Deep Learning have led to the growth of
interpretable Machine Learning, which seeks to help understand the
decisions of a model. Indeed, in various fields such as Medicine,
Finance and Security, it is important for models to be trustworthy and
reliable. As part of this internship, we want to develop a method to
explain the decisions of medical Vision-Language models.

State-of-the-Art Vision-Language models such as UNITER [3] are built
based on the transformer architecture to extract representations from
texts and images. Those representations are then used in multimodal
applications such as visual question answering [1] and image
captioning [7]. However, due to the complex architecture of those
models, explaining them remains a challenge. Applying interpretability
methods to those models can be a way to make them more reliable. In
the context of medical data, we want to be able to explain why a
radiology report does or does not fit a X-ray. To this end, we will
rely on medical datasets such as MIMIC-CXR [4]. This internship is
focused on the development of interpretability methods for
Vision-Language models. 

Problem Statement

The goal of this internship is to explain transformer-based
Vision-Language models such as UNITER. Current explainability
methods for transformer-based models mostly rely on attention
weights. However, studies show that attention weights by themselves
are a limited tool for transformer model interpretability [2], and
additional tools are necessary to explain model predictions.

We decide to focus on model-agnostic methods for this internship, as
they don't use model internals such as attention weights. We will
study how local model- agnostic interpretability methods such as LIME
[6] can explain Vision-Language models by attributing the model
decision to parts of the input. In particular, the intern will focus
on explaining the predictions of the model on the Image-Text Matching
task. The goal is to explain why the model predicts that a
image-caption pair is matching or not, using text tokens and image
superpixels. The interpretability method should help :

- Highlight matching textual and visual information such as objects

- Show if concepts such as color, number, position and size are
understood by the model at a multimodal level

- Establish how the image and text contradict each other if they do
  not match

- Determine the importance of the language and vision modalities in
the model prediction

- Study how dataset bias, and in particular textual bias, impacts the
model prediction

- Study how the model reacts to perturbations (e.g textual
descriptions that are similar yet distinct from the visual
information)

- Show if simple logical operations (or, and ...) are understood by
  the model

To that end, the intern will use a dataset based on Clevr [5] to
evaluate the interpretability method on true and adversarial
examples. The work will first be evaluated on a carefully designed
synthetic dataset before being tested on real world data such as chest
X-rays and their reports.

Profile

The intern will propose, implement and analyse interpretability
methods for Vision-Language models. The work will be implemented using
Pytorch. It is assumed that the candidate has the following qualities:

- Excellent knowledge of deep learning methods

- Extensive experience with implementing Pytorch models

- Great scientific writing skills

- A hunch for the challenges of doing research

The internship will be a six-month internship at at LIS/CNRS in Marseille during
spring 2022. It will be held in the context of Emmanuelle Salin's thesis on under-
standing the generality of multimodal representations. Pointers on Interpretable
Machine Learning are available.


Contact

Please send a CV and letter of application to benoit.favre@lis-lab.fr,
emmanuelle.salin@lis-lab.fr, and stephane.ayache@lis-lab.fr before the
05/11/21. Do not hesitate to contact us if you have any question.

References

[1] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell,
Dhruv Ba- tra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual
question answering.  In Proceedings of the IEEE international
conference on computer vision, pages 2425-2433, 2015.

[2] Gino Brunner, YangLiu, Damian Pascual, Oliver Richter,
Massimiliano Cia- ramita, and Roger Wattenhofer. On identifiability in
transformers. arXiv preprint arXiv:1908.04211, 2019.

[3] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal
Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Learning universal
image-text repre- sentations. 2019.

[4] Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R
Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven
Horng. Mimic- cxr, a de-identified publicly available database of
chest radiographs with free-text reports. Scientific data, 6(1):1-8,
2019.

[5] Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li
Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic
dataset for compo- sitional language and elementary visual
reasoning. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 2901-2910, 2017.

[6] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. " why
should i trust you?" explaining the predictions of any classifier. In
Proceedings of the 22nd ACM SIGKDD international conference on
knowledge discovery and data min- ing, pages 1135-1144, 2016.


[7] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru
Erhan. Show and tell: Lessons learned from the 2015 mscoco image
captioning challenge. IEEE transactions on pattern analysis and
machine intelligence, 39(4):652-663, 2016.