Title : Automatic Generation of Alternative Text for Images: Toward
Enhanced Visual Accessibility

*Supervisors
Camille Guinaudeau
Frédéric Dufaux

*Project description*
In a world where visual content is increasingly central, visually
impaired individuals face significant barriers in accessing the
information contained in images. Alternative text, designed to provide
a textual description of visual elements, is essential for these users,
as it replaces the image where a caption merely accompanies it.
However, the descriptions currently available online, often integrated
as alt-text tags, are frequently incomplete or imprecise and often
indistinguishable from standard, uninformative captions. This
shortcoming significantly limits visual accessibility and the digital
autonomy of blind and visually impaired individuals. The automatic
generation of relevant and informative alternative text has thus become
an urgent necessity to enhance digital accessibility and ensure an
inclusive experience for all users, regardless of their visual
abilities. Additionally, generating alternative text corresponds to
providing a global semantic description of an image, which can also be
valuable for other applications beyond accessibility, such as detecting
semantic alterations in multimedia content or addressing
question-answering tasks on multimodal datasets. The generation of
alternative text is a challenging task that has not been sufficiently
explored, except for a few studies on datasets scraped from the
internet (Twitter, Wikipedia, etc.) or videos (Kreiss, 2022;
Srivatsan, 2024; Han, 2023). This area deserves particular attention
from the multimodal community. An initial step in this direction
involved defining an initial dataset, AD2AT (Audio Description to
Alternative Text), built from annotations by experts. This dataset,
specifically designed for alternative text generation, enabled
preliminary experiments with some of the most advanced vision-language
models, such as Llava (Liu, 2024) and InstructBLIP (Dai, 2024).
However, these initial experiments revealed the limitations of these
models, which often produce overly detailed descriptions or include
undesired assumptions (Lincker, 2025). Previous research has
highlighted the critical importance of considering the context in
which an image appears to generate high-quality alternative text. A
primary research direction will therefore explore how to integrate this
context into the generation process. Which information contained in the
image is already known to the user, and which elements need to be
described to fill informational gaps? To deepen our understanding of
alternative text, the intern will build on the work of Muehlbradt and
Kane (2022), who examined the strategies employed by users to produce
alternative text, as well as on our own dataset, to analyze the image
elements effectively included in descriptions. Additionally, using our
annotated dataset, the intern can leverage tools such as the InfoMetIC
metric (Hu, 2023)-developed to evaluate image captions-to identify the
parts of the image that appear in the annotated alternative texts and
better pinpoint the key visual information to include in alternative
text. Simultaneously, we will consider image saliency detection methods
(Ullah, 2020) to identify important visual elements that are not
explicitly described within the context of the image.

*Practicalities
The internship will be funded 659,76 ¤ per month + reimbursement of
transport costs (75% monthly or annual Navigo pass) for a duration of
5 or 6 months (starting in March or April 2025) and will take place at
LISN within the LIPS team.

Candidate Profile:
We are seeking highly motivated candidates with the following
qualifications:
-   Education: Master's degree (M2) in Computer Science, with a
    preference for candidates experienced in Natural Language
    Processing (NLP), Computer Vision (CV), or Artificial
    Intelligence (AI).
-   Technical Skills:
    -   Proficiency in Python and familiarity with deep learning
        libraries such as TensorFlow, PyTorch, or Keras.
    -   Experience in data analysis and handling multimodal datasets is
        a plus.
-   Soft Skills: Strong analytical abilities, an interest in
    accessibility and human-centric AI, and the ability to work
    independently and collaboratively in a research environment.

To apply, please send your CV, a cover letter and your M1 and M2
transcripts (if available) by email to
Camille Guinaudeau camille.guinaudeau@universite-paris-saclay.fr and
Frédéric Dufaux frederic.dufaux@centralesupelec.fr.