Postdoc: Enhancing frugality and robustness during the training of an
LLM model for French medical texts

Supervision:
- Gaël Varoquaux, Soda, Inria (gael.varoquaux@inria.fr)
- Judith Abécassis, Soda, Inria (judith.abecassis@inria.fr)

Keywords: Generative AI, Language models, Data augmentation,
Electronic Health Records, Robust AI

Starting date: April 1st 2025 - Duration: 20 months.

Context

Generative AI and large language models have recently opened up
perspectives on using AI in various fields.  To actualize this
potential, the collaborative national project "PARTAGES" addresses a
crucial need: the development of an open-source language model in
French, specialized on health data. Such a tool will considerably
accelerate and democratize the use of AI for health, providing massive
benefits to the organization of the health system, the work conditions
of caregivers, and, ultimately, public health. A primary concern when
working with health text data is patients' privacy, so "Partages" will
proceed in two steps: (i) generating synthetic medical notes in
combination with the biomedical scientific literature to fine-tune an
open source model, (ii) refining the model on actual medical notes,
within each of the 18 partner care facilities.

Goals

Building on the Soda's team expertise on robust and frugal AI [4, 2,
3, 6, 5] and on handling text in Electronic Health Records (EHRs) [1],
we are involved in the creation of a foundation model which is robust
to new vocabulary (either new medical concepts appearing in the
future, or misspelling or abbreviations used in the medical texts),
and frugal. Those two aspects are key to the practical application as
hallucinations or over-confidence might have harmful consequences on
individuals' health, and one objective of the project is that the
model would be used within hospitals' secure computing infrastructure
(and not the cloud for privacy concerns), where the computing
resources, in particular GPUs, are limited. Additionally, as this
open-source model will be a digital common that is intended to be
reused many times, any energy-saving impact will be amplified.

Environment

This internship will take place at Inria Saclay, in the Soda
team. Soda is doing computational and statistical research, both
fundamental and applied, to harness large databases on health and
society. Soda is also developing core software tools such as
scikit-learn. The work will be done in the context of the "PARTAGES"
project with a 10-million-euros funding from BPI France, involving 32
partners.

Requirements

We are seeking a highly motivated candidate with a PhD in a related domain (deep learning, NLP).
Factors of success include:
- previous experience working with high performance computing
- good theoretical and practical knowledge in large model training
- Reasonable proficiency in French
- Curious mindset.

Possible collaboration:
- Gaël Dias, CNRS GREYC (gael.dias@unicaen.fr)

References

[1] Judith Abécassis, Théo Jolivet, Audrey Bergès, Elise Liu,
Jean-Baptiste Julla, Yawa Abouleka, Julie Alberge, Isabel Bonnetier,
Thomas Petit-Jean, Romain Bey, et al. Operational challenges of
building a million-patient cohort from ehrs: The cohort of diabetic
patients (codia) on the ap-hp eds. In journée de l'Atelier TIDS
(Traitement Informatique des Données de Santé) du GdR MaDICS, 2024.

[2] Lihu Chen, Alexandre Perez-Lebel, Fabian Suchanek, and Gaël
Varoquaux. Reconfidencing llm uncertainty from the grouping loss
perspective. In The 2024 Conference on Empirical Methods in Natural
Language Processing.  arXiv, 2024.

[3] Lihu Chen and Gaël Varoquaux. What is the role of small models in
the llm era: A survey. arXiv preprint arXiv:2409.06857, 2024.

[4] Lihu Chen, Gaël Varoquaux, and Fabian M Suchanek. Imputing
out-of-vocabulary embeddings with love makes language models robust
with little cost. arXiv preprint arXiv:2203.07860, 2022.

[5] Léo Grinsztajn, Edouard Oyallon, Myung Jun Kim, and Gaël
Varoquaux. Vectorizing string entries for data processing on tables:
when are larger language models better? arXiv preprint
arXiv:2312.09634, 2023.

[6] Gaël Varoquaux, Alexandra Sasha Luccioni, and Meredith
Whittaker. Hype, sustainability, and the price of the bigger-is-better
paradigm in ai. arXiv preprint arXiv:2409.14160, 2024.