Matching contextual and definitional embeddings for a sense-aware reading assistant Internship proposal - Carlos Ramisch and Alexis Nasr, SELEXINI ANR project Imagine you are reading a book in a foreign language that you understand quite well, but you are not totally fluent in. At some point, you come across a word that you do not understand in a sentence. Imagine you can click on the word in your screen and its definition shows up (like in the ebook reader shown in Figure 1). A definition is a text snippet that explains the meaning of that word using other words that you are more likely to be familiar with. It allows you to access the meaning of the unknown word and fully grasp the meaning of the whole sentence. Suppose that the unknown word (e.g. cell) has multiple senses (e.g. "a small room in which a prisoner is locked up" or "the smallest structural and functional unit of an organism", among others). Instead of a single definition, you will get a list of definitions, and you will have to read through all of them to decide which one is more appropriate in this context. As a human, you will probably be able to interpret the known words in the context of the unknown word, and choose the definition that better matches this context. Figure 1: ebook excerpt from Amazon's web Kindle reader (https://read.amazon.com/). When clicking on a word (court), a definition pops up. Only one (incorrect) definition is shown here. It is also possible to see all 6 definitions from the dictionary. The presence of the context word tennis could help disambiguate and show only relevant definitions. The goal of this internship is to develop and evaluate an original NLP model capable of aligning a word's context with its correct definition, even if the word is ambiguous, i.e., having more than one definition listed in the dictionary. To achieve this ambitious goal, the recruited intern will address 3 challenges that constitute major milestones of the project. 1. Representations: current NLP models represent words and sentences as real-numbered vectors. A first baseline would be to apply the Lesk algorithm, which counts the number of overlapping words between the context and the definitions (Basile et al. 2014).Another simple method consists in embedding both the context of the word and its candidate definitions using a pre-trained language model such as BERT (Devlin et al. 2018), returning the definition whose embedding is closest to the context embedding. As definitions are often composed of a genus-differentia pair, their embeddings could be fine-tuned to represent this structure, using techniques for automatic hypernym extraction (Camacho-Collados et al. 2018), hyperbolic spaces (Nickel & Kiela 2017) or graph embeddings that encode the relations between word senses (Nguyen 2020). Semeval 2022 - task 1 CODWOE can also provide useful insights into definition embeddings. 2. Alignment: When definitions and contexts are embedded into a shared space, comparing them is trivial. However, if this is not the case, it might be necessary to learn a transformation between these spaces, analogous to cross-lingual embeddings (Lample et al. 2018). Moreover, contextual embeddings such as BERT often represent wordpiece tokens instead of full words (Schuster & Nakajima 2012). Finally, the occurrence of multiword phrases both in texts and dictionaries complicates the alignment further, requiring some technique to identify them in advance (Ramisch et al. 2020). 3. Evaluation: traditionally, the task that consists in assigning a sense to a word occurrence is word sense disambiguation (Navigli 2009). Beyond this straightforward possibility, it is possible to evaluate the model using a word-in-context task (Pilehvar & Camacho-Collados 2019) sense-aware similarity datasets (Huang et al. 2012), etc. Since the project's target language is French, evaluation datasets must cover this language. This internship will take place in the context of the recently funded ANR SELEXINI project. The project aims at developing lexicon induction methods to create a large structured semantic lexicon for French. One of the by-products of this internship is a large French corpus with corresponding contextual embeddings aligned to Wiktionary entries. The intern will join the TALEP team in Luminy, Marseille, and have the opportunity to interact with researchers in the partner universities (Univ. de Saclay, Univ. de Paris, Univ. de Lorraine) and submit a paper to an international conference, depending on the results of the internship. The development of a showcase web interface for clicking and retrieving definitions is a possible extension after the internship (e.g. as a summer project). Other possible extensions include generating definitions on the fly for words or senses not present in Wiktionary (Bevilacqua et al. 2020), or generating definitions for whole phrases and multiword expressions. These can be the topic of a PhD thesis if the intern shows interest and skills compatible with the project. Requirements: - Familiarity with word embeddings - Python programming is highly recommended - Fluent English reading, reasonable English writing - Curiosity, autonomy, rigour, scientific methodology - Interest in linguistics, languages, dictionaries, semantics Email Carlos Ramisch et Alexis Nasr (first.last@lis-lab.fr) before November 1st, 2021