Postdoc: Enhancing frugality and robustness during the training of an LLM model for French medical texts Supervision: - Gaël Varoquaux, Soda, Inria (gael.varoquaux@inria.fr) - Judith Abécassis, Soda, Inria (judith.abecassis@inria.fr) Keywords: Generative AI, Language models, Data augmentation, Electronic Health Records, Robust AI Starting date: April 1st 2025 - Duration: 20 months. Context Generative AI and large language models have recently opened up perspectives on using AI in various fields. To actualize this potential, the collaborative national project "PARTAGES" addresses a crucial need: the development of an open-source language model in French, specialized on health data. Such a tool will considerably accelerate and democratize the use of AI for health, providing massive benefits to the organization of the health system, the work conditions of caregivers, and, ultimately, public health. A primary concern when working with health text data is patients' privacy, so "Partages" will proceed in two steps: (i) generating synthetic medical notes in combination with the biomedical scientific literature to fine-tune an open source model, (ii) refining the model on actual medical notes, within each of the 18 partner care facilities. Goals Building on the Soda's team expertise on robust and frugal AI [4, 2, 3, 6, 5] and on handling text in Electronic Health Records (EHRs) [1], we are involved in the creation of a foundation model which is robust to new vocabulary (either new medical concepts appearing in the future, or misspelling or abbreviations used in the medical texts), and frugal. Those two aspects are key to the practical application as hallucinations or over-confidence might have harmful consequences on individuals' health, and one objective of the project is that the model would be used within hospitals' secure computing infrastructure (and not the cloud for privacy concerns), where the computing resources, in particular GPUs, are limited. Additionally, as this open-source model will be a digital common that is intended to be reused many times, any energy-saving impact will be amplified. Environment This internship will take place at Inria Saclay, in the Soda team. Soda is doing computational and statistical research, both fundamental and applied, to harness large databases on health and society. Soda is also developing core software tools such as scikit-learn. The work will be done in the context of the "PARTAGES" project with a 10-million-euros funding from BPI France, involving 32 partners. Requirements We are seeking a highly motivated candidate with a PhD in a related domain (deep learning, NLP). Factors of success include: - previous experience working with high performance computing - good theoretical and practical knowledge in large model training - Reasonable proficiency in French - Curious mindset. Possible collaboration: - Gaël Dias, CNRS GREYC (gael.dias@unicaen.fr) References [1] Judith Abécassis, Théo Jolivet, Audrey Bergès, Elise Liu, Jean-Baptiste Julla, Yawa Abouleka, Julie Alberge, Isabel Bonnetier, Thomas Petit-Jean, Romain Bey, et al. Operational challenges of building a million-patient cohort from ehrs: The cohort of diabetic patients (codia) on the ap-hp eds. In journée de l'Atelier TIDS (Traitement Informatique des Données de Santé) du GdR MaDICS, 2024. [2] Lihu Chen, Alexandre Perez-Lebel, Fabian Suchanek, and Gaël Varoquaux. Reconfidencing llm uncertainty from the grouping loss perspective. In The 2024 Conference on Empirical Methods in Natural Language Processing. arXiv, 2024. [3] Lihu Chen and Gaël Varoquaux. What is the role of small models in the llm era: A survey. arXiv preprint arXiv:2409.06857, 2024. [4] Lihu Chen, Gaël Varoquaux, and Fabian M Suchanek. Imputing out-of-vocabulary embeddings with love makes language models robust with little cost. arXiv preprint arXiv:2203.07860, 2022. [5] Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, and Gaël Varoquaux. Vectorizing string entries for data processing on tables: when are larger language models better? arXiv preprint arXiv:2312.09634, 2023. [6] Gaël Varoquaux, Alexandra Sasha Luccioni, and Meredith Whittaker. Hype, sustainability, and the price of the bigger-is-better paradigm in ai. arXiv preprint arXiv:2409.14160, 2024.