Title : A study on sampling datasets for linguistic diversity (Natural Language Processing) Subject in brief : This internship can be adapted to the candidate preferences among two possibilities: 1. Impact of diversity-driven raw data sampling on downstream applications (possibly based on deep learning) 2. When evaluating NLP systems, how much diversity does the test corpus add to train and how does it impact the systems scores? For the first proposal, the intern will focus on a task involving the discovery of multiword expressions (MWEs). MWEs are sets of words whose meaning is not necessarily directly related to the meaning of the individual words ("They spilled the beans"). One possible method of discovery is to select sentences in which a word (e.g. "spill ") appears, then cluster these sentences to group together those that have a similar meaning. In theory, we could have a cluster where all the sentences with the MWE "spill the beans", and another cluster with the sentences where the word "spill " appears in the sense "to make fall". The selection of these sentences is currently random. We would like to investigate diversity-driven data-sampling methods to study the impact of this selection. For the second proposal, we are looking at the diversity of one corpus compared to another. If we have two corpora, a train corpus and a test corpus, does the test increase diversity of the train corpus? We also want to study the impact of a biased test on the results of a task linked to these train and test corpora. Different diversity can be measured in different linguistic objects (words, syntactic subtrees, word meanings, etc...). Both proposals include development, although the second one is more focused on linguistic analysis. Possible outcomes of this internship : a pipeline including diversity quantification and a scientific paper. A full description of the subject is available here : https://perso.limsi.fr/savary/Projects/2024_Internship_subject.pdf. Location : France, Orsay, Université Paris Saclay, LISN Duration : 6 months (starting around March) How to apply : Applicants should send a CV, transcripts of bachelor and master grades to Louis Estève, Manon Scholivet, Agata Savary and Thomas Lavergne (firstname.lastname@universite-paris-saclay.fr without accents) before January 6. Applicants may also link past work, or git repositories of their code.