Title: Investigating Multilingual Word Associations Networks with Link Prediction Location : LORIA, Vandoeuvre-lès-Nancy, France Supervision : - Pr Karën FORT, Professor in Computer Science (SÉMAGRAMME, LORIA, Université de Lorraine) -- karen.fort@loria.fr - Hee-Soo Choi, PhD Student in Language Science (SÉMAGRAMME LORIA - Modélisation et Ressources, ATILF) -- hee-soo.choi@loria.fr - Dr Simon De Deyne, Senior Research Associate (Computational Cognitive Science Lab, University of Melbourne) -- simon.dedeyne@unimelb.edu.au Duration : 5 - 6 months (starting around March) Prerequisites : Be a Master 2 student in NLP or Computational Linguistics. We expect the candidate to be proficient in Python programming, rigorous in handling large amounts of data and interested in linguistics. To apply: Send CV and Master's grades to hee-soo.choi@loria.fr An e-mail explaining your motivations is sufficient. Please note: students from outside the Schengen zone should apply as early as possible (applications must be accepted at least 2 months before the internship start date) in order to obtain FSD authorization for access to the research laboratory (LORIA). *** Detailed description of subject *** Motivation and context Lexicography has long focused on dictionaries to describe the words of a language in the form of a list, often presented in alphabetic order. However, unlike traditional dictionaries, representing words as graphs provides a more flexible alternative to convey how words are related semantically and syntactically. In the early days of NLP, the need to represent lexical knowledge in machine-readable format led to the development of lexical-semantic networks such as WordNet [1], one of the most widely used NLP resources. Unlike dictionary definitions, encoding linguistic knowledge as a graph better captures the links between words, resembling how humans understand language. Encoding knowledge as a set of labelled nodes and edges also means that the world's knowledge can be represented transparently in knowledge graphs, which are particularly useful for improving recommendation systems, question-answering, etc [2]. Recent developments in Natural Language Processing (NLP) have dramatically changed lexicography by improving the scale and quality of knowledge graphs without needing extensive human annotation. A key challenge is that these graphs are almost always incomplete due to the impossibility of describing the world or a language exhaustively. To overcome this issue, the task of Knowledge Graph Completion (KGE) or Link Prediction has been developed in NLP, consisting of training models to infer missing information in the graph [3]. Goals and Objectives In this internship, we propose to reproduce the experiments of Choi et al. [4], which leverage the predictions of a link prediction model to enrich French lexical-semantic graphs (Réseau lexical du français (RL-fr) [5] and JeuxDeMots [6], on graphs of word associations created by crowdsour cing via the Small World of Words (SWOW) project [7, see https://smallworldofwords.org/project]. SWOW aims to measure meaning as it would be encoded in a mental lexicon in several world languages. It is based on a word association task where participants respond with the first words that come to mind to a given cue word. Importantly, recent work has shown that knowledge graphs derived from SWOW encoded common-sense knowledge that captures additional information not encoded in much larger graphs such as WordNet and ConceptNet [7]. Moreover, this extra knowledge is critical as it improves the prediction across a wide range of tasks, such as similarity [8], sentiment [9], and moral reasoning [10]. In contrast to other knowledge graphs, participants don't always explain the nature of the relation between a cue and a response. For example, for a cue "dog" and response "animal", a taxonomic relation is not encoded in the graph. However, recent work on semantic role labelling [11] suggests that these latent edge labels can be recovered. This leads to an exciting opportunity to study how link prediction can be further improved when combined with semantic role labelling. In this internship, the intern will learn to manipulate graphs and use link prediction models and grids for experiments. Part of the experiments will be devoted to adapting SWOW data for the existing pipeline. The main objectives are as follows but can be adapted depending on the student's background and interests: - Adapt the link prediction pipeline of Choi et al. (2024) on the SWOW word association graph from English to extract potential new links and increase graph coverage. - Propose a methodology for automatically typing relations in SWOW, which, unlike RL-fr and JeuxDeMots, edge labels are not given but need to be inferred. - Benchmark the performance of the augmented graph on existing human data. - Compare differences in vocabulary for the same cue word in different languages. References [1] Fellbaum, C. WordNet: An Electronic Lexical Database. The MIT Press, 1998 [2] Ji, Shaoxiong et al. A Survey on Knowledge Graphs: Representation, Acquisition, and Applications." IEEE Transactions on Neural Networks and Learning Systems 33 (2020): 494-514. [3] Chen, Z.; Wang, Y.; Zhao, B.; Cheng, J.; Zhao, X. & Duan, Z. Knowledge Graph Completion: A Review. IEEE Access, vol. 8, pp. 192435-192456, 2020, doi: 10.1109/ACCESS.2020.3030076. [4] Choi, H-S., Trivedi, P., Constant M., Fort K., Guillaume B. Beyond Model Performance: Can Link Prediction Enrich French Lexical Graphs?. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2329-2341, Torino, Italia [5] Lux-Pogodalla, V. & Polguère, A. Construction of a French Lexical Network: Methodological Issues First InternationalWorkshop on Lexical Resources, WoLeR 2011, 2011, 54-61 [6] Lafourcade, M. & Joubert, A. JeuxDeMots : un prototype ludique pour l'émergence de relations entre termes. JADT'08 : Journées internationales d'Analyse statistiques des Données Textuelles, 2008, 657-666 [7] De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms, G. The "Small World of Words" English word association norms for over 12,000 cue words. Behavior Research Methods, 51(3), 987-1006, DOI 10.3758/s13428-018-1115-7. 2019 [8] Liu, C., Cohn, T., & Frermann, L. (2021). Commonsense knowledge in word associations and ConceptNet. arXiv preprint arXiv:2109.09309. [9] Vankrunkelsven, H., Verheyen, S., Storms, G., & De Deyne, S. (2018). Predicting lexical norms: A comparison between a word association model and text-based word co-occurrence models. Journal of Cognition, 1(1). [10] Ramezani, A., & Xu, Y. (2024). Moral association graph: A cognitive model for moral inference. In Proceedings of the Annual Meeting of the Cognitive Science Society (Vol. 46). [11] Liu, C., Cohn, T., De Deyne, S., & Frermann, L. (2022). WAX: A new dataset for word association explanations. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 106-120).