We propose two internship positions at IRIT, Toulouse, in the MELODI
team.
Please send an email to apply, with a CV and a few lines explaining
your motivation at: chloe.braud@irit.fr and philippe.muller@irit.fr

Multilingual discourse relation prediction

In Natural Language Processing, "discourse structure" corresponds to
the semantic links between sentences or paragraphs, e.g. "explanation",
"contrast", "elaboration", that organize a document in a coherent
manner.
Predicting such links is still a difficult task, all the more as
available supervised data is rather scarce, when it exists, for most
languages except English. The goal of this internship is to develop
models that can leverage data in multiple languages to improve results
on relation prediction.
This task is crucial to enhance performance of current discourse
parsers, and could also be used to develop larger datasets annotated
with relations for other tasks, such as machine reading and question
generation.

The first step of this internship will be dedicated to a review of the
state of the art on discourse relation prediction, including a review
of the existing datasets for English and for other languages: many of
them have been pre-processed and made available through a shared task
organized in 2019 and 2021, see the website here:
https://sites.google.com/georgetown.edu/disrpt2021/call-for-participation.
Through this shared task, we have access to datasets for 11 languages
in the same format, but each corpus presents specific features,
especially in terms of relation sets.

The second step will be to develop a system for the identification of
discourse relations in a multilingual setting. A few methods will be
compared:  merging corpora at training time while using a multilingual
pretrained language model, with different merging strategies, multi-
task learning, with different architectures, or automated translation
of corpora.

Supervision: Chloé Braud chloé.braud@irit.fr, Philippe Muller
philippe.muller@irit.fr

Location: IRIT, University of Toulouse, France, within the Melodi team

Duration: 5-6 months

Compensation: 546¤/month

Requirements: Master 2 or equivalent in Computer Science or
Mathematics; good programming skills and knowledge of Machine Learning
principles and tools. Some knowledge of NLP would be a plus but is not
required.

References:

-   Amir Zeldes, Yang Janet Liu, Mikel Iruskieta, Philippe Muller,
    Chloé Braud, and Sonia Badene. 2021. The DISRPT 2021 Shared Task on
    Elementary Discourse Unit Segmentation, Connective Detection, and
    Relation Classification. In Proceedings of the 2nd Shared Task on
    Discourse Relation Parsing and Treebanking (DISRPT 2021)

-   Chloé Braud, Barbara Plank, and Anders Søgaard. 2016. Multi-view
    and multi-task training of RST discourse parsers. In Proceedings of
    COLING 2016

-   Phylogenetic Multi-Lingual Dependency Parsing. Mathieu Dehouck and
    Pascal Denis. NAACL 2019, Minneapolis, USA.

-   Zhengyuan Liu, Ke Shi, and Nancy Chen. 2020. Multilingual Neural
    RST Discourse Parsing. In Proceedings of the 28th International
    Conference on Computational Linguistics, pages 6730-6738