Postdoc: Generating synthetic text data for the training of an open-source language model in French specialized on health data Supervision: - Gaël Dias, CNRS GREYC (gael.dias@unicaen.fr) Keywords: Generative AI, Language models, Data augmentation, Clinical notes, Patient privacy Starting date: April 1st 2025 - Duration: 20 months. Context: Generative AI and large language models have recently opened up perspectives on using AI in various fields. To actualize this potential, the collaborative national project PARTAGES addresses a crucial need: the development of an open-source language model in French, specialized on health data. Such a tool will considerably accelerate and democratize the use of AI for health, providing massive benefits to the organization of the health system, the work conditions of caregivers, and, ultimately, public health by targeting 6P medicine. A primary concern when working with health text data is patients' privacy, so PARTAGES will proceed in two steps: (i) generating synthetic clinical notes in combination with the biomedical scientific literature to fine-tune an open source model, (ii) refining the model on actual clinical notes, within each of the 18 partner care facilities. Goals: Building on the GREYC expertise on AI for health [1, 3, 4, 5] and on handling text collections in the medical domain [2], this research aims to develop a model for generating synthetic clinical reports (CRs) based on human-written artificial clinical reports and progressive clinical case files from the UNESS database. The focus is on leveraging parameter-efficient fine-tuning (PEFT) techniques and reinforcement learning from human feedback (RLHF) strategies for LLMs to ensure that the generated reports meet critical criteria, including diversity, coherence, structure, and accuracy. Another objective is to design models capable of generating annotated synthetic clinical reports. By employing multitask architectures and PEFT approaches, the model will take annotated CRs as input and produce annotated synthetic CRs tailored to support a variety of tasks, defined in collaboration with other use cases of the PARTAGES project. This research will advance AI-driven data generation methods, with significant implications for clinical and medical applications. Environment: This internship will take place at CNRS GREYC located in the historical city of Caen, Normandy (France). The research team has strong experience in natural language processing, text generation and representation learning applied to mental health. The work will be endeavoured in the context of the PARTAGES project with a 10-million-euros funding from BPI France, involving 32 partners. The current position may imply to collaborate with 4 different research laboratories of the project, namely SODA, LISN, LIMICS and LIG. Requirements We are seeking a highly motivated candidate with a PhD in a related domain (deep learning, NLP, generative models). Factors of success include: - Good theoretical and practical knowledge in large language model training - Previous experience working with high performance computing - Reasonable proficiency in French - Curious mindset. Possible collaboration: - Gaël Varoquaux, Soda, Inria (gael.varoquaux@inria.fr) - Judith Abécassis, Soda, Inria (judith.abecassis@inria.fr) References [1] Milintsevich K., Sirts K., and Dias G. Towards automatic text-based estimation of depression through symptom prediction. Brain Informatics, 2022. [2] Milintsevich K., Sirts K., and Dias G. Your model is not predicting depression well and that is why: A case study of primate dataset. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychol- ogy (CLPSYCH) associated to 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2024. [3] Agarwal N., Dias G., and Dollfus S. Analysing relevance of discourse structure for improved mental health estimation. In Proceedings of the 9th Workshop on Computational Linguistics and Clinical Psychology (CLPSYCH) associated to 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2024. [4] Agarwal N., Dias G., and Dollfus s. Multi-view graph-based interview representation to improve depression level estimation. Brain Informatics, 2024. ISSN: 2198-4018. [5] Agarwal N., Milintsevich K., Métivier L., Rotharmel M., Dias G., and Dollfus S. Analyzing symptom-based depression level estimation through the prism of psychiatric expertise. In Proceedings of the Joint International Conference on Computational Linguistics, Language Resources and Evaluation (COLING-LREC), 2024.