An NLP internship position of 6 months is opening at Sanofi R&D (based
at Chilly-Mazarin) in the NLP team.
The objective is to develop machine learning models to perform Entity
Linking on a biomedical documents dataset.

More details are available in the internship offer below.

To apply :
https://sanofi.wd3.myworkdayjobs.com/fr-FR/StudentPrograms/job/Chilly-Mazarin/UN-STAGIAIRE---DATA-SCIENTIST--H-F-_R2681433

For information, please contact : maali.mnasri@sanofi.com

Best regards,


Exploring Joint Learning and Domain Adaptation Approaches for Named
Entity Disambiguation in the Biomedical Domain

*Introduction:*
Named Entity Disambiguation (NED) or Entity Linking (EL) is a subfield
of natural language processing (NLP) that has gained a lot of interest
in recent years. EL consists of identifying and linking entities
mentioned in texts to their corresponding entries in a target
Knowledge-Base (KB) such as Wikipedia URLs or Medical Subject Headings
(MeSH) terms. This task is crucial for many natural language
understanding tasks, such as information retrieval, question answering,
and text summarization. The increasing availability of large-scale
knowledge bases and the development of new techniques and methods for
entity recognition, candidate generation, and disambiguation have
contributed to the growth of interest in this field. Additionally, the
use of pre-trained contextual language models, such as BERT and GPT,
have also been shown to improve performance on the entity linking
task, further increasing the interest in this field. In the biomedical
domain, this task is particularly important and challenging as it
allows for accurate identification and understanding of medical
concepts and terminology in information retrieval use cases.

*Methodology:* A standard architecture for Entity Linking would consist
of three steps:

1.  Mention Detection: i.e., named entity recognition - identify text
    spans of possible entities
2.  Candidate Generation: generate for each mention, the set of
    possible corresponding named entities from the knowledge base
3.  Entity Disambiguation: find the entity link i.e., which of the
    candidates fits better to the entity mention.

Usually, these three steps are performed separately. Recent research
investigated the possibility of jointly learning these steps within the
same model (Broscheit, 2020) where the downstream task is a multilabel
classification of entities over the vocabulary entries of the KB.
Another interesting research work (Sakor et al. 2020) has proposed a
zero-shot entity linking system that utilizes entity descriptions
in a textual entities dictionary to disambiguate unseen entities
(without any labeled training examples).
This task is more challenging since there are no metadata for unseen
examples. After generating candidates, a Transformer-based architecture
is used for ranking them.
Finally, adaptation to the Target domain is performed through transfer
learning techniques. This is particularly useful for specific domains
where specialized KB are expensive to obtain.

In this internship, we will explore the possibility of adapting these
approaches to the biomedical domain by using entity descriptions from
medical ontologies (such as MeSH (Medical Subject Headings), SNOMED CT
(Systematized Nomenclature of Medicine - Clinical Terms), etc)
and databases (Cancer Genome Atlas (TCGA), PharmGKB, UniProt, etc) .
The goal will be to first build an Entity Knowledge Base for our
domain. This includes:

1.  flattening existing medical ontologies
2.  merging ontologies to medical databases
3.  disambiguating the output by keeping unique entries

The next goal is to leverage pretrained contextual language models such
as BERT, GPT, etc., to implement a disambiguation tool for entities in
the biomedical domain using the produced KB. The use of these
pretrained models will allow for improved performance on this task, as
they have been trained on a large amount of text data and can therefore
better understand the context in which entities are mentioned
(Zhang et al 2019). Our approach will involve fine-tuning these models
on our domain and then using them for disambiguation in real world
cases.


*Objectives :*

1.  Read the state of the art in the field of named entity
    disambiguation in the biomedical domain
2.  Explore existing biomedical ontologies and databases and merge them
    into a custom Knowledge Base for Entity Linking
3.  Implement a baseline system using spaCy's Entities Linker along
    with an existing biomedical thesaurus
4.  Leverage a pretrained language model (ideally re-trained on a
    medical dataset) and fine-tune it on the downstream tasks of
    multi-label classification of entities over the unique identifiers
    in the produced Knowledge Base and/or to perform candidates
    ranking.
5.  Evaluate the disambiguation tool on Sanofi's internal data and on
    available public data for publishing prupose.


*Expected outcomes:*

1.  A custom biomedical knowledge base for Entity Linking from existing
    external resources.
2.  A working spaCy's Entities Linker based baseline
3.  A neural end-to-end entity linking model for biomedical domain
4.  A report documenting the research, implementation, evaluation and
    comparison process of the two systems


*Required skills:*

1.  Strong programming skills, particularly in Python with knowledge of
    one deep learning framework (PyTorch, TensorFlow, etc)
2.  Knowledge of NLP, BERT or BERT-like models and deep learning
3.  Experience with fine-tuning and evaluating deep learning models
4.  Familiarity with spaCy and/or other NLP libraries
5.  Experience in developing and evaluating machine learning models in
    a real-world setting


*The conditions of this internship are as follows:*

1.  This internship will be based at the Sanofi research facility, and
    you will be expected to work onsite during the duration of the
    internship (1 Av. Pierre Brossolette, 91380 Chilly-Mazarin).
2.  The duration of the internship will be for a period of 6 months,
    starting date is to be discussed.
3.  You will be provided with a computer and necessary software for the
    duration of the internship.
4.  You will be expected to work full-time during the internship, with
    a schedule to be determined by your supervisor.

Broscheit, 2020 Investigating Entity Knowledge in BERT with Simple
Neural End-To-End Entity Linking

Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019.
ERNIE: enhanced language representation with informative entities.
to appear.