Proposal for a master internship 2024 Title: Study of the combination of large language models and knowledge structures for guiding the French translation of SNOMED CT Supervision Co-supervisor: Maël Le Gall Dir. Expertise, Innovation et Internat., Agence du Numérique en Santé (ANS) mael.le-gall@esante.gouv.fr Co-supervisor: Elisabeth Serrot-Damatte Direction Expertise, Innovation et International, ANS elisabeth.serrot-damatte@esante.gouv.fr Co-supervisor: Adrien Coulet HeKA research team (Inria, Inserm, Université Paris Cité) adrien.coulet@inria.fr Context The internship will be co-supervised by the ANS and the HeKA team (Inria, Inserm, Univ. Paris Cité). It will be located at PariSanté Campus, 2-10 rue d'Oradour-sur-Glane, 75015 Paris, where both groups have their offices. If successful, this internship will lead to the funding of a 3-year PhD project. This internship takes place in the context of the recent joining of the organization called SNOMED International [1] by France [2]. This membership allows to health French stakeholders to use and implement SNOMED CT (e.g., for interoperability between healthcare applications or data extraction from Electronic Health Records). However, this membership comes along with the need for an accurate French translation of its terms by French-speaker clinical experts. Such translation could be guided by Natural Language Processing (NLP) and Knowledge Engineering (KE) approaches. The ANS has been mandated by the French Ministry of Health and Safety, to become the French National Release Center (NRC) for SNOMED CT. Among other responsibilities, the NRC is in charge of the creation of national edition, containing translation of SNOMED CT. In this context,the ANS is working with the French Translation Collaboration Group [3] with Belgium, Canada, Luxembourg and Swiss to create a Common French translation. This translation is used as a basis, by each country, for their own national editions. Motivations The deployment of SNOMED CT in France will be driven by the accessibility of high-quality French translation. However, as the Common French is a result of different french-speaking countries, the translation is not always relevant for France-based stakeholders. This translation is achieved by a human-driven process, conducted by French-speaker medical experts. It includes translation of (i) a preferred term, (ii) a set of synonyms and, optionally, (iii) a textual definition in natural language. This manual process ensures the quality of the SNOMED CT translations, it is however time consuming and we believe that it could be accelerated by the help of NLP and KE tools. In particular generative approaches that are used by Large Language Models (LLM) let us think that we can suggest to experts some high quality translations. In addition, we hypothesize that on one hand the use of French clinical texts, which illustrate the use of terms in practice, and on the other hand the available structure (hierarchical and non-hierarchical relationships) of SNOMED CT could improve translations. Objectives The aim of the internship is to develop one (or several) translation approaches, based on LLM, that suggest : i a principal and extensive translation of preferred terms, ii a set of valid synonyms, iii a full-text unambiguous description of the term. Note that i, ii and iii follow strict edition rules. It is part of the objectives of the internship to formalize these rules and explore how those can be provided to the LLM with prompting. The two first LLM to consider are GPT-4 and Mixtral [4-7], but this list stays open and may change during the internship. We would like to empirically evaluate the quality of suggestions, in particular to be able to compare our translation to those provided by an existing solution, but also to be able to improve iteratively the methods we will propose. To this aim, it is a principal objective for the internship to propose strategies that provide an evaluation of our translations. Accordingly, the intern will propose and motivate strategies to objectively compare translations (objective 1) [8, 9]. From a methodological point of view, we would like to explore several directions : first, leveraging French clinical texts to improve suggested translations with corpora such as [10-12] (objective 2). To this aim, the following work may provide initial research directions : [13]. Moreover, we would like to leverage the structure of SNOMED CT ontology, accordingly taking advantage of the availability of descriptions, synonyms and potentially translation of sibling terms, and to another extent semantically-close terms (objective 3). To this aim, we will refer to the following works : [14, 15]. This subject is voluntarily ambitious as it is opening on a PhD contract. Priorities of the master internship are to focus first on the translations of (i), then on (ii), with objectives 1 and 2.5 Working plan Current working plan, and deliverables are : March-April 2024 (please shift the agenda if the internship does not starts in March) : - Bibliographic work on the evaluation of translations. - Hands on SNOMED CT, the manual translation process, and existing tool for translation support. - Hands on LLM, and prompting for translations (i) and (ii). May 2024 : - Proposition of a first strategy to evaluate/compare translations. - Development of first translation pipelines. - Bibliographic work on the improvement of translation, using French clinical text June 2024 : - Adaptation of the training with French clinical text - Experiments : comparison of performance - Bibliographic work on the consideration of knowledge graph, or ontoliges by LLM July & August 2024 : - Thesis writing and defense preparation - Experiment tuning and paper writing Références [1] SNOMED International website. https://www.snomed.org/, 2024. [2] SNOMED International welcomes France as a new member of the global SNOMED CT community. https://www.snomed.org/news/snomed-international-welcomes- france-as-a-new-member-of-the-global-snomed-ct-community, 2023. [3] https://confluence.ihtsdotools.org/display/FTCG/French+Translation+ Collaboration+Group, 2024. [4] the OpenAI developer platform : GPT-4 and GPT-4 Turbo. https://platform.openai. com/docs/models/gpt-4-and-gpt-4-turbo, 2023. [5] Mixtral-8x7B : A high quality Sparse Mixture-of-Experts . https://mistral.ai/news/ mixtral-of-experts/, 2023. [6] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral of experts, 2024. [7] Announcing Tower : An Open Multilingual LLM for Translation-Related Tasks. https://unbabel.com/announcing-tower-an-open-multilingual-llm-for- translation-related-tasks/, 2024. [8] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models : Empirical results and analysis, 2023. [9] Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. A paradigm shift in machine translation : Boosting translation performance of large language models, 2023. [10] Aurélie Névéol, Cyril Grouin, Jeremy Leixa, Sophie Rosset, and Pierre Zweigenbaum. The QUAERO French medical corpus : A ressource for medical entity recognition and normaliza- tion. In Proc of BioTextMining Work, pages 24-30, 2014. [11] Natalia Grabar, Clément Dalloux, and Vincent Claveau. CAS : corpus of clinical cases in French. Journal of Biomedical Semantics, 11(1) :1-10, 2020. [12] Bernardo Magnini, Begona Altuna, Alberto Lavelli, Manuela Speranza, and Roberto Zanoli. The E3C project : European clinical case corpus. Language, 1(L2) :L3, 2021. [13] Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung Poon, and Tie-Yan Liu. BioGPT : generative pre-trained transformer for biomedical text generation and mi- ning. Briefings in Bioinformatics, 23(6) :bbac409, 09 2022. ISSN 1477-4054. doi : 10.1093/bib/bbac409. URL https://doi.org/10.1093/bib/bbac409. [14] Genet Asefa Gesese, Russa Biswas, Mehwish Alam, and Harald Sack. A survey on know- ledge graph embeddings with literals : Which model links better literal-ly ? Semantic Web, 12 (4) :617-647, 2021. [15] Russa Biswas, Yiyi Chen, Heiko Paulheim, Harald Sack, and Mehwish Alam. It's all in the name : Entity typing using multilingual language models. In European Semantic Web Confe- rence, pages 36-41. Springer, 2022.