Title : Automatic Generation of Alternative Text for Images: Toward Enhanced Visual Accessibility *Supervisors Camille Guinaudeau Frédéric Dufaux *Project description* In a world where visual content is increasingly central, visually impaired individuals face significant barriers in accessing the information contained in images. Alternative text, designed to provide a textual description of visual elements, is essential for these users, as it replaces the image where a caption merely accompanies it. However, the descriptions currently available online, often integrated as alt-text tags, are frequently incomplete or imprecise and often indistinguishable from standard, uninformative captions. This shortcoming significantly limits visual accessibility and the digital autonomy of blind and visually impaired individuals. The automatic generation of relevant and informative alternative text has thus become an urgent necessity to enhance digital accessibility and ensure an inclusive experience for all users, regardless of their visual abilities. Additionally, generating alternative text corresponds to providing a global semantic description of an image, which can also be valuable for other applications beyond accessibility, such as detecting semantic alterations in multimedia content or addressing question-answering tasks on multimodal datasets. The generation of alternative text is a challenging task that has not been sufficiently explored, except for a few studies on datasets scraped from the internet (Twitter, Wikipedia, etc.) or videos (Kreiss, 2022; Srivatsan, 2024; Han, 2023). This area deserves particular attention from the multimodal community. An initial step in this direction involved defining an initial dataset, AD2AT (Audio Description to Alternative Text), built from annotations by experts. This dataset, specifically designed for alternative text generation, enabled preliminary experiments with some of the most advanced vision-language models, such as Llava (Liu, 2024) and InstructBLIP (Dai, 2024). However, these initial experiments revealed the limitations of these models, which often produce overly detailed descriptions or include undesired assumptions (Lincker, 2025). Previous research has highlighted the critical importance of considering the context in which an image appears to generate high-quality alternative text. A primary research direction will therefore explore how to integrate this context into the generation process. Which information contained in the image is already known to the user, and which elements need to be described to fill informational gaps? To deepen our understanding of alternative text, the intern will build on the work of Muehlbradt and Kane (2022), who examined the strategies employed by users to produce alternative text, as well as on our own dataset, to analyze the image elements effectively included in descriptions. Additionally, using our annotated dataset, the intern can leverage tools such as the InfoMetIC metric (Hu, 2023)-developed to evaluate image captions-to identify the parts of the image that appear in the annotated alternative texts and better pinpoint the key visual information to include in alternative text. Simultaneously, we will consider image saliency detection methods (Ullah, 2020) to identify important visual elements that are not explicitly described within the context of the image. *Practicalities The internship will be funded 659,76 ¤ per month + reimbursement of transport costs (75% monthly or annual Navigo pass) for a duration of 5 or 6 months (starting in March or April 2025) and will take place at LISN within the LIPS team. Candidate Profile: We are seeking highly motivated candidates with the following qualifications: - Education: Master's degree (M2) in Computer Science, with a preference for candidates experienced in Natural Language Processing (NLP), Computer Vision (CV), or Artificial Intelligence (AI). - Technical Skills: - Proficiency in Python and familiarity with deep learning libraries such as TensorFlow, PyTorch, or Keras. - Experience in data analysis and handling multimodal datasets is a plus. - Soft Skills: Strong analytical abilities, an interest in accessibility and human-centric AI, and the ability to work independently and collaboratively in a research environment. To apply, please send your CV, a cover letter and your M1 and M2 transcripts (if available) by email to Camille Guinaudeau camille.guinaudeau@universite-paris-saclay.fr and Frédéric Dufaux frederic.dufaux@centralesupelec.fr.