Internship: multimodal conversation script generation Context Face to face conversation remains the most natural means of communication between humans. It conveys much richer messages than typical text representations such as meeting minutes, thanks to the diversity and range of signals that can be passed through multimodal channels, such as emotions, visual grounding, etc. Handling such features remains elusive for machines, because of fundamental limitations in how multimodal understanding is implemented. Yet, latest developments in the field of AI show promise in handling audio-video captures of participants, and reasoning on the content of such inputs. In order to address the gap in multimodal understanding of conversations, The MINERAL ANR project aims at generating enriched representations of multiparty conversations in the form of a conversation script similar to a movie script or a play script. These representations will include transcripts of the uttered speech with addressee, goal and communicative act, as well as textual descriptions of the activities and stances of each speaker, and their interactions with real-world objects. An imaginative goal is that actors should be able to replay the conversation from the script as they do with movies. Latent representations uncovered from performing this task are expected to enhance the understanding capabilities of AI models and allow for novel applications, such as generation of audio-visual summaries of face-to-face meetings. This internship is framed in the MINERAL ANR project, and aims at building an evaluation framework for assessing the quality of generated scripts based on existing movie and episode script datasets, and construct baselines for script generation with appropriate specialized models for underlying tasks such as scene description or transcript generation, plugged into large language models. Objectives The goal of this internship is twofold: 1) Propose an evaluation methodology for assessing the quality of a generated script 2) Build and assess baselines leveraging disjoint building blocks such as speech transcription and automated description of video scenes Work resulting from the internship will be published in appropriate conferences and journals. Work plan The intern will first review methods for generating and evaluating textual representations from videos in the subfield of conversation analysis and summarization. The goal is to get a good understanding of current research problems and potential solutions. Then, the next step consists in preparing existing datasets for the script generation task. Targeted datasets include minutes from debates at the assemblée nationale (including transcripts, reactions, stance, from audio and video), as well as the Bazinga TV series corpus including original scripts from episodes of Big Bang Theory, and manually structured transcripts. Finally, the intern will implement baseline systems from existing feature extraction models such as OpenFace for face dynamics (expression, gaze...), Whisper for transcripts, Pyannote for speaker diarization, etc that will be fed to finetuned LLMs in textual form. If time permits, the intern will look into audio and video tokenizers, such as SpeechTokenizer and Cosmos-Tokenizer, in order to adapt the LLMs to raw features. Practicalities The internship will be funded ~500 euros per month for a duration of 6 months. It will take place in Marseille within the TALEP research group at LIS/CNRS on the Luminy campus. The intern will collaborate with other interns from the ANR project (at LISN and Orange Labs), as well as PhD students and researchers from the research group. A potential PhD funding on the same topic is also available in the project. How to apply: send an application letter, transcripts and CV to benoit.favre@univ-amu.fr - Application deadline: December 15th, 2024 - Expected start: early spring 2025 References Soldan, M., Pardo, A., Alcázar, J. L., Caba, F., Zhao, C., Giancola, S., & Ghanem, B. (2022). Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5026-5035). https://github.com/Soldelli/MAD Banchs, R. E. (2012, July). Movie-DiC: a movie dialogue corpus for research and development. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 203-207). Lerner, P., Bergoënd, J., Guinaudeau, C., Bredin, H., Maurice, B., Lefevre, S., ... & Barras, C. (2022, June). Bazinga! a dataset for multi-party dialogues structuring. In 13th Conference on Language Resources and Evaluation (LREC 2022) (pp. 3434-3441). Baltru¨aitis, T., Robinson, P., & Morency, L. P. (2016, March). Openface: an open source facial behavior analysis toolkit. In 2016 IEEE winter conference on applications of computer vision (WACV) (pp. 1-10). IEEE. Bredin, H., Yin, R., Coria, J. M., Gelly, G., Korshunov, P., Lavechin, M., ... & Gill, M. P. (2020, May). Pyannote. audio: neural building blocks for speaker diarization. In ICASSP 2020-2020IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7124-7128). IEEE. Huang, B., Wang, X., Chen, H., Song, Z., & Zhu, W. (2024). Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14271-14280). Wu, S., Fei, H., Qu, L., Ji, W., & Chua, T. S. (2023). Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519. Ryoo, M., Piergiovanni, A. J., Arnab, A., Dehghani, M., & Angelova, A. (2021). Tokenlearner: Adaptive space-time tokenization for videos. Advances in neural information processing systems, 34, 12786-12797. Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747.