Title: Investigating Multilingual Word Associations Networks with Link
Prediction

Location : LORIA, Vandoeuvre-lès-Nancy, France

Supervision :
-   Pr Karën FORT, Professor in Computer Science (SÉMAGRAMME, LORIA,
    Université de Lorraine) -- karen.fort@loria.fr
-   Hee-Soo Choi, PhD Student in Language Science (SÉMAGRAMME LORIA -
    Modélisation et Ressources, ATILF) -- hee-soo.choi@loria.fr
-   Dr Simon De Deyne, Senior Research Associate (Computational
    Cognitive Science Lab, University of Melbourne) --
    simon.dedeyne@unimelb.edu.au

Duration : 5 - 6 months (starting around March)

Prerequisites : Be a Master 2 student in NLP or Computational
Linguistics. We expect the candidate to be proficient in Python
programming, rigorous in handling large amounts of data and interested
in linguistics.

To apply: Send CV and Master's grades to hee-soo.choi@loria.fr
An e-mail explaining your motivations is sufficient.

Please note: students from outside the Schengen zone should apply as
early as possible (applications must be accepted at least 2 months
before the internship start date) in order to obtain FSD authorization
for access to the research laboratory (LORIA).


*** Detailed description of subject ***

Motivation and context

Lexicography has long focused on dictionaries to describe the words of
a language in the form of a list, often presented in alphabetic order.
However, unlike traditional dictionaries, representing words as graphs
provides a more flexible alternative to convey how words are related
semantically and syntactically. In the early days of NLP, the need to
represent lexical knowledge in machine-readable format led to the
development of lexical-semantic networks such as WordNet [1], one of
the most widely used NLP resources. Unlike dictionary definitions,
encoding linguistic knowledge as a graph better captures the links
between words, resembling how humans understand language. Encoding
knowledge as a set of labelled nodes and edges also means that the
world's knowledge can be represented transparently in knowledge graphs,
which are particularly useful for improving recommendation systems,
question-answering, etc [2]. Recent developments in Natural Language
Processing (NLP) have dramatically changed lexicography by improving
the scale and quality of knowledge graphs without needing extensive
human annotation. A key challenge is that these graphs are almost
always incomplete due to the impossibility of describing the world or a
language exhaustively. To overcome this issue, the task of Knowledge
Graph Completion (KGE) or Link Prediction has been developed in NLP,
consisting of training models to infer missing information in the
graph [3].


Goals and Objectives

In this internship, we propose to reproduce the experiments of Choi et
al. [4], which leverage the predictions of a link prediction model to
enrich French lexical-semantic graphs (Réseau lexical du français
(RL-fr) [5] and JeuxDeMots [6], on graphs of word associations created
by crowdsour cing via the Small World of Words (SWOW) project [7, see
https://smallworldofwords.org/project]. SWOW aims to measure meaning as
it would be encoded in a mental lexicon in several world languages. It
is based on a word association task where participants respond with the
first words that come to mind to a given cue word. Importantly, recent
work has shown that knowledge graphs derived from SWOW encoded
common-sense knowledge that captures additional information not encoded
in much larger graphs such as WordNet and ConceptNet [7]. Moreover,
this extra knowledge is critical as it improves the prediction across a
wide range of tasks, such as similarity [8], sentiment [9], and moral
reasoning [10]. In contrast to other knowledge graphs, participants
don't always explain the nature of the relation between a cue and a
response. For example, for a cue "dog" and response "animal", a
taxonomic <is a> relation is not encoded in the graph. However, recent
work on semantic role labelling [11] suggests that these latent edge
labels can be recovered. This leads to an exciting opportunity to study
how link prediction can be further improved when combined with semantic
role labelling.

In this internship, the intern will learn to manipulate graphs and use
link prediction models and grids for experiments. Part of the
experiments will be devoted to adapting SWOW data for the existing
pipeline. The main objectives are as follows but can be adapted
depending on the student's background and interests:

-   Adapt the link prediction pipeline of Choi et al. (2024) on the
    SWOW word association graph from English to extract potential new
    links and increase graph coverage.
-   Propose a methodology for automatically typing relations in SWOW,
    which, unlike RL-fr and JeuxDeMots, edge labels are not given but
    need to be inferred.
-   Benchmark the performance of the augmented graph on existing
    human data.
-   Compare differences in vocabulary for the same cue word in
    different languages.

References

[1] Fellbaum, C. WordNet: An Electronic Lexical Database. The MIT
    Press, 1998
[2] Ji, Shaoxiong et al. A Survey on Knowledge Graphs: Representation,
    Acquisition, and Applications." IEEE Transactions on Neural
    Networks and Learning Systems 33 (2020): 494-514.
[3] Chen, Z.; Wang, Y.; Zhao, B.; Cheng, J.; Zhao, X. & Duan, Z.
    Knowledge Graph Completion: A Review. IEEE Access, vol. 8,
    pp. 192435-192456, 2020, doi: 10.1109/ACCESS.2020.3030076.
[4] Choi, H-S., Trivedi, P., Constant M., Fort K., Guillaume B. Beyond
    Model Performance: Can Link Prediction Enrich French Lexical
    Graphs?. In Proceedings of the 2024 Joint International Conference
    on Computational Linguistics, Language Resources and Evaluation
    (LREC-COLING 2024), pages 2329-2341, Torino, Italia
[5] Lux-Pogodalla, V. & Polguère, A. Construction of a French Lexical
    Network: Methodological Issues First InternationalWorkshop on
    Lexical Resources, WoLeR 2011, 2011, 54-61
[6] Lafourcade, M. & Joubert, A. JeuxDeMots : un prototype ludique pour
    l'émergence de relations entre termes. JADT'08 : Journées
    internationales d'Analyse statistiques des Données Textuelles,
    2008, 657-666
[7] De Deyne, S., Navarro, D. J., Perfors, A., Brysbaert, M., & Storms,
    G. The "Small World of Words" English word association norms for
    over 12,000 cue words. Behavior Research Methods, 51(3), 987-1006,
    DOI 10.3758/s13428-018-1115-7. 2019
[8] Liu, C., Cohn, T., & Frermann, L. (2021). Commonsense knowledge in
    word associations and ConceptNet. arXiv preprint arXiv:2109.09309.
[9] Vankrunkelsven, H., Verheyen, S., Storms, G., & De Deyne, S.
    (2018). Predicting lexical norms: A comparison between a word
    association model and text-based word co-occurrence models. Journal
    of Cognition, 1(1).
[10] Ramezani, A., & Xu, Y. (2024). Moral association graph: A
    cognitive model for moral inference. In Proceedings of the Annual
    Meeting of the Cognitive Science Society (Vol. 46).
[11] Liu, C., Cohn, T., De Deyne, S., & Frermann, L. (2022). WAX: A new
    dataset for word association explanations. In Proceedings of the
    2nd Conference of the Asia-Pacific Chapter of the Association for
    Computational Linguistics and the 12th International Joint
    Conference on Natural Language Processing (Volume 1: Long Papers)
    (pp. 106-120).