Stage de master / Graduate internship
Does translation lead to better recognition of biomedical named
entities from original French Electronic Health Records ?


Keywords :  natural language processing (NLP), named entity recognition
            (NER), biomedical concepts

Ideal starting date : March/April 2022

Duration : 4-6 months

Advisors :
-   Xavier Tannier (LIMICS, Professor)
    xavier.tannier@sorbonne-universite.fr,
-   Fabrice Carrat (IPLESP, Professor),
-   Christel Gérardin (IPLESP, MD, PhD student)
    christel.ducroz-gerardin@iplesp.upmc.fr

Location : LIMICS/ IPLESP


Context (General presentation of the topic)

Named-entity recognition (NER) is an important step in Natural Language
Processing (NLP), especially for processing specialized textual
documents such as medical reports in order to extract key information.
Major improvements have been made in this area, especially in English
since a very large amount of data is accessible.

Modern NLP makes extensive use of pre-trained language models, which
allow efficient semantic representation of texts. The development of
algorithms such as transformers [Vaswani2017,Devlin2018] has allowed
significant improvements in this field, and these algorithms are now
used in a vast range of NLP applications: question-answering, neural
machine translation, named entity recognition and sequence
classification.

Transformers language models need to be trained on a very large amount
of data (in the biomedical field, this led to models such as BioBERT
[Lee2020], ClinicalBERT [Huang2019]), enabling significant improvement
in medical information retrieval. Two main types of data are available
in the biomedical field to train any language model: public articles
(e.g. Pubmed) and clinical Electronic Health Records databases
(e.g. MIMIC III). In many languages other than English, efforts still
need to be made to obtain such interesting results, in particular due
to a much smaller amount of accessible data [Neveol2018].

At the same time, machine translation has also gained in performance
thanks to the same type of language models based on transformers, and
the last few years have seen the emergence of high-quality automatic
translation.

These last two observations led several research teams to add a
translation step in order to analyze medical texts, for instance to
extract relevant mentions in ultrasonography reports
[Campos2017,Suarez2021] or to perform medical concept normalization
[Wajsbürt2021].


Objective of the internship

In this work, we want to address the question of adding an English
translation stage to improve medical concept extraction from French
electronic health records. Hence, we will compare two approaches : :

    1.  translation-oriented : first, the French hospital record is
        translated to English, then a NER stage is performed in English
        (with arguably better language models and resources, overcoming
        the loss of quality brought by the automatic translation).
    2.  a classical, monolingual French NER performed on the hospital
        records.


In order to be able to compare the two pipelines, both systems will
have a final mapping step from extracted terms to Unified Medical
Language System (UMLS) "Concept Unique Identifier" (CUI).

a baseline will be annotated directly with the Unified Medical Language
System (UMLS) "Concept Unique Identifier" (CUI). The UMLS corresponds
to several knowledge sources including a metathesaurus where all the
medical terms (drugs, symptoms, acts, etc.) are mapped to a
"concept unique identifier" (CUI). CUIs are language-independent, which
will allow a fair evaluation of the two pipelines. In this approach, we
will only focus on the signs and symptoms (e.g. fever, anuria,
vomiting, etc.) and diseases (e.g. Crohn's disease, myocardial
infarction, etc.) categories. The annotations of 200 texts were
achieved by a medical doctor before the internship.

The student will have to compare different algorithms and their
performances during her/his internship.


Expected skills of the student

Background in programming science and programming skills, having a good
knowledge in deep learning and/or natural language processing.

The intern will be given a "bonus" (around 600¤/month) and a
contribution to transport costs.


In order to apply , the student must send a motivation note and her/his
CV to christel.ducroz-gerardin@iplesp.upmc.fr


Bibliographic references

[Vaswani2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
    Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is
    all you need. In  Advances in neural information processing systems
    (pp. 5998-6008).

[Devlin2018] Delvin J., Changm. W., Leek. & Toutanovak (2019). BERT :
    Pre-training of deep bidirectional transformers for language
    understanding. NAACL HLT 2019 - 2019 Conference of the North
    American Chapter of the Association for Computational Linguistics :
    Human LanguageTechnologies - Proceedings of the Conference,1,
    4171-4186.

[Lee2020] Lee, Jinhyuk, et al. "BioBERT: a pre-trained biomedical
    language representation model for biomedical text mining."
    Bioinformatics 36.4 (2020): 1234-1240.

[ClinicalBERT] Huang, Kexin, Jaan Altosaar, and Rajesh Ranganath.
    "Clinical bert: Modeling clinical notes and predicting hospital
    readmission." arXiv preprint arXiv:1904.05342 (2019).

[Neveol2018] Névéol A, Dalianis H, Velupillai S, Savova G,
    Zweigenbaum P. (2018). Clinical natural language processing in
    languages other than english: opportunities and challenges. Journal
    of biomedical semantics,9(1):1-3.

[Campos2017]Campos, L, Pedro,V., Couto, F. Impact of translation on
    named-entity recognition in radiology texts. Database (2017)
    Vol. 2017: article ID bax064;

[Suarez2021] Suárez-Paniagua, Víctor, Hang Dong, and Arlene Casey.
    "A multi-BERT hybrid system for Named Entity Recognition in Spanish
    radiology reports." CLEF eHealth (2021)

[Wajsbürt2021] Wajsbürt, Perceval, Arnaud Sarfati, and Xavier Tannier.
    "Medical concept normalization in French using multilingual
    terminologies and contextual embeddings." Journal of Biomedical
    Informatics 114 (2021): 103684.