Proposal for a master internship 2024

Title: Study of the combination of large language models and knowledge
structures for guiding the French translation of SNOMED CT

Supervision

Co-supervisor:
Maël Le Gall
Dir. Expertise, Innovation et Internat., Agence du Numérique en Santé (ANS)
mael.le-gall@esante.gouv.fr

Co-supervisor:
Elisabeth Serrot-Damatte
Direction Expertise, Innovation et International, ANS
elisabeth.serrot-damatte@esante.gouv.fr

Co-supervisor:
Adrien Coulet
HeKA research team (Inria, Inserm, Université Paris Cité)
adrien.coulet@inria.fr


Context

The internship will be co-supervised by the ANS and the HeKA team
(Inria, Inserm, Univ. Paris Cité). It will be located at PariSanté
Campus, 2-10 rue d'Oradour-sur-Glane, 75015 Paris, where both groups
have their offices.

If successful, this internship will lead to the funding of a 3-year
PhD project.

This internship takes place in the context of the recent joining of
the organization called SNOMED International [1] by France [2]. This
membership allows to health French stakeholders to use and implement
SNOMED CT (e.g., for interoperability between healthcare applications
or data extraction from Electronic Health Records). However, this
membership comes along with the need for an accurate French
translation of its terms by French-speaker clinical experts. Such
translation could be guided by Natural Language Processing (NLP) and
Knowledge Engineering (KE) approaches.

The ANS has been mandated by the French Ministry of Health and Safety,
to become the French National Release Center (NRC) for SNOMED
CT. Among other responsibilities, the NRC is in charge of the creation
of national edition, containing translation of SNOMED CT. In this
context,the ANS is working with the French Translation Collaboration
Group [3] with Belgium, Canada, Luxembourg and Swiss to create a
Common French translation. This translation is used as a basis, by
each country, for their own national editions.


Motivations

The deployment of SNOMED CT in France will be driven by the
accessibility of high-quality French translation. However, as the
Common French is a result of different french-speaking countries, the
translation is not always relevant for France-based stakeholders. This
translation is achieved by a human-driven process, conducted by
French-speaker medical experts. It includes translation of (i) a
preferred term, (ii) a set of synonyms and, optionally, (iii) a
textual definition in natural language. This manual process ensures
the quality of the SNOMED CT translations, it is however time
consuming and we believe that it could be accelerated by the help of
NLP and KE tools.

In particular generative approaches that are used by Large Language
Models (LLM) let us think that we can suggest to experts some high
quality translations. In addition, we hypothesize that on one hand the
use of French clinical texts, which illustrate the use of terms in
practice, and on the other hand the available structure (hierarchical
and non-hierarchical relationships) of SNOMED CT could improve
translations.

Objectives

The aim of the internship is to develop one (or several) translation
approaches, based on LLM, that suggest :
i a principal and extensive translation of preferred terms,
ii a set of valid synonyms,
iii a full-text unambiguous description of the term.

Note that i, ii and iii follow strict edition rules. It is part of the
objectives of the internship to formalize these rules and explore how
those can be provided to the LLM with prompting.  The two first LLM to
consider are GPT-4 and Mixtral [4-7], but this list stays open and may
change during the internship. We would like to empirically evaluate
the quality of suggestions, in particular to be able to compare our
translation to those provided by an existing solution, but also to be
able to improve iteratively the methods we will propose. To this aim,
it is a principal objective for the internship to propose strategies
that provide an evaluation of our translations. Accordingly, the
intern will propose and motivate strategies to objectively compare
translations (objective 1) [8, 9].

From a methodological point of view, we would like to explore several
directions : first, leveraging French clinical texts to improve
suggested translations with corpora such as [10-12] (objective 2). To
this aim, the following work may provide initial research directions :
[13]. Moreover, we would like to leverage the structure of SNOMED CT
ontology, accordingly taking advantage of the availability of
descriptions, synonyms and potentially translation of sibling terms,
and to another extent semantically-close terms (objective 3). To this
aim, we will refer to the following works : [14, 15].

This subject is voluntarily ambitious as it is opening on a PhD
contract. Priorities of the master internship are to focus first on
the translations of (i), then on (ii), with objectives 1 and 2.5

Working plan

Current working plan, and deliverables are :

March-April 2024 (please shift the agenda if the internship does not
starts in March) :

- Bibliographic work on the evaluation of translations.

- Hands on SNOMED CT, the manual translation process, and existing
tool for translation support.

- Hands on LLM, and prompting for translations (i) and (ii).

May 2024 :

- Proposition of a first strategy to evaluate/compare translations.

- Development of first translation pipelines.

- Bibliographic work on the improvement of translation, using French
clinical text

June 2024 :

- Adaptation of the training with French clinical text

- Experiments : comparison of performance

- Bibliographic work on the consideration of knowledge graph, or
ontoliges by LLM

July & August 2024 :

- Thesis writing and defense preparation

- Experiment tuning and paper writing

Références

[1] SNOMED International website. https://www.snomed.org/, 2024.

[2] SNOMED International welcomes France as a new member of the global
SNOMED CT
community. https://www.snomed.org/news/snomed-international-welcomes-
france-as-a-new-member-of-the-global-snomed-ct-community, 2023.

[3] https://confluence.ihtsdotools.org/display/FTCG/French+Translation+
Collaboration+Group, 2024.

[4] the OpenAI developer platform : GPT-4 and GPT-4
Turbo. https://platform.openai.
com/docs/models/gpt-4-and-gpt-4-turbo, 2023.

[5] Mixtral-8x7B : A high quality Sparse Mixture-of-Experts
. https://mistral.ai/news/ mixtral-of-experts/, 2023.

[6] Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur
Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego
de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel,
Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile
Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian,
Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut
Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mixtral
of experts, 2024.

[7] Announcing Tower : An Open Multilingual LLM for
Translation-Related Tasks.
https://unbabel.com/announcing-tower-an-open-multilingual-llm-for-
translation-related-tasks/, 2024.

[8] Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang,
Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine
translation with large language models : Empirical results and
analysis, 2023.

[9] Haoran Xu, Young Jin Kim, Amr Sharaf, and Hany Hassan Awadalla. A
paradigm shift in machine translation : Boosting translation
performance of large language models, 2023.

[10] Aurélie Névéol, Cyril Grouin, Jeremy Leixa, Sophie Rosset, and
Pierre Zweigenbaum. The QUAERO French medical corpus : A ressource for
medical entity recognition and normaliza- tion. In Proc of
BioTextMining Work, pages 24-30, 2014.

[11] Natalia Grabar, Clément Dalloux, and Vincent Claveau. CAS :
corpus of clinical cases in French. Journal of Biomedical Semantics,
11(1) :1-10, 2020.

[12] Bernardo Magnini, Begona Altuna, Alberto Lavelli, Manuela
Speranza, and Roberto Zanoli.  The E3C project : European clinical
case corpus. Language, 1(L2) :L3, 2021.

[13] Renqian Luo, Liai Sun, Yingce Xia, Tao Qin, Sheng Zhang, Hoifung
Poon, and Tie-Yan Liu. BioGPT : generative pre-trained transformer for
biomedical text generation and mi- ning. Briefings in Bioinformatics,
23(6) :bbac409, 09 2022. ISSN 1477-4054. doi :
10.1093/bib/bbac409. URL https://doi.org/10.1093/bib/bbac409.

[14] Genet Asefa Gesese, Russa Biswas, Mehwish Alam, and Harald
Sack. A survey on know- ledge graph embeddings with literals : Which
model links better literal-ly ? Semantic Web, 12 (4) :617-647, 2021.

[15] Russa Biswas, Yiyi Chen, Heiko Paulheim, Harald Sack, and Mehwish
Alam. It's all in the name : Entity typing using multilingual language
models. In European Semantic Web Confe- rence, pages 36-41. Springer,
2022.