An NLP internship position of 6 months is opening at Sanofi R&D (based at Chilly-Mazarin) in the NLP team. The objective is to develop machine learning models to perform Entity Linking on a biomedical documents dataset. More details are available in the internship offer below. To apply : https://sanofi.wd3.myworkdayjobs.com/fr-FR/StudentPrograms/job/Chilly-Mazarin/UN-STAGIAIRE---DATA-SCIENTIST--H-F-_R2681433 For information, please contact : maali.mnasri@sanofi.com Best regards, Exploring Joint Learning and Domain Adaptation Approaches for Named Entity Disambiguation in the Biomedical Domain *Introduction:* Named Entity Disambiguation (NED) or Entity Linking (EL) is a subfield of natural language processing (NLP) that has gained a lot of interest in recent years. EL consists of identifying and linking entities mentioned in texts to their corresponding entries in a target Knowledge-Base (KB) such as Wikipedia URLs or Medical Subject Headings (MeSH) terms. This task is crucial for many natural language understanding tasks, such as information retrieval, question answering, and text summarization. The increasing availability of large-scale knowledge bases and the development of new techniques and methods for entity recognition, candidate generation, and disambiguation have contributed to the growth of interest in this field. Additionally, the use of pre-trained contextual language models, such as BERT and GPT, have also been shown to improve performance on the entity linking task, further increasing the interest in this field. In the biomedical domain, this task is particularly important and challenging as it allows for accurate identification and understanding of medical concepts and terminology in information retrieval use cases. *Methodology:* A standard architecture for Entity Linking would consist of three steps: 1. Mention Detection: i.e., named entity recognition - identify text spans of possible entities 2. Candidate Generation: generate for each mention, the set of possible corresponding named entities from the knowledge base 3. Entity Disambiguation: find the entity link i.e., which of the candidates fits better to the entity mention. Usually, these three steps are performed separately. Recent research investigated the possibility of jointly learning these steps within the same model (Broscheit, 2020) where the downstream task is a multilabel classification of entities over the vocabulary entries of the KB. Another interesting research work (Sakor et al. 2020) has proposed a zero-shot entity linking system that utilizes entity descriptions in a textual entities dictionary to disambiguate unseen entities (without any labeled training examples). This task is more challenging since there are no metadata for unseen examples. After generating candidates, a Transformer-based architecture is used for ranking them. Finally, adaptation to the Target domain is performed through transfer learning techniques. This is particularly useful for specific domains where specialized KB are expensive to obtain. In this internship, we will explore the possibility of adapting these approaches to the biomedical domain by using entity descriptions from medical ontologies (such as MeSH (Medical Subject Headings), SNOMED CT (Systematized Nomenclature of Medicine - Clinical Terms), etc) and databases (Cancer Genome Atlas (TCGA), PharmGKB, UniProt, etc) . The goal will be to first build an Entity Knowledge Base for our domain. This includes: 1. flattening existing medical ontologies 2. merging ontologies to medical databases 3. disambiguating the output by keeping unique entries The next goal is to leverage pretrained contextual language models such as BERT, GPT, etc., to implement a disambiguation tool for entities in the biomedical domain using the produced KB. The use of these pretrained models will allow for improved performance on this task, as they have been trained on a large amount of text data and can therefore better understand the context in which entities are mentioned (Zhang et al 2019). Our approach will involve fine-tuning these models on our domain and then using them for disambiguation in real world cases. *Objectives :* 1. Read the state of the art in the field of named entity disambiguation in the biomedical domain 2. Explore existing biomedical ontologies and databases and merge them into a custom Knowledge Base for Entity Linking 3. Implement a baseline system using spaCy's Entities Linker along with an existing biomedical thesaurus 4. Leverage a pretrained language model (ideally re-trained on a medical dataset) and fine-tune it on the downstream tasks of multi-label classification of entities over the unique identifiers in the produced Knowledge Base and/or to perform candidates ranking. 5. Evaluate the disambiguation tool on Sanofi's internal data and on available public data for publishing prupose. *Expected outcomes:* 1. A custom biomedical knowledge base for Entity Linking from existing external resources. 2. A working spaCy's Entities Linker based baseline 3. A neural end-to-end entity linking model for biomedical domain 4. A report documenting the research, implementation, evaluation and comparison process of the two systems *Required skills:* 1. Strong programming skills, particularly in Python with knowledge of one deep learning framework (PyTorch, TensorFlow, etc) 2. Knowledge of NLP, BERT or BERT-like models and deep learning 3. Experience with fine-tuning and evaluating deep learning models 4. Familiarity with spaCy and/or other NLP libraries 5. Experience in developing and evaluating machine learning models in a real-world setting *The conditions of this internship are as follows:* 1. This internship will be based at the Sanofi research facility, and you will be expected to work onsite during the duration of the internship (1 Av. Pierre Brossolette, 91380 Chilly-Mazarin). 2. The duration of the internship will be for a period of 6 months, starting date is to be discussed. 3. You will be provided with a computer and necessary software for the duration of the internship. 4. You will be expected to work full-time during the internship, with a schedule to be determined by your supervisor. Broscheit, 2020 Investigating Entity Knowledge in BERT with Simple Neural End-To-End Entity Linking Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Maosong Sun, and Qun Liu. 2019. ERNIE: enhanced language representation with informative entities. to appear.