Job: Internship - Weak supervision for discourse parsing
Internship at IRIT,Toulouse (France) - ANR AnDiAMO

This internship will be co-supervised by Chloé Braud and Philippe
Muller, and the intern will work within the MELODI team at IRIT. They
will participate in group meetings, reading groups, and they will
collaborate with other members of the project.

-   Contract duration: 5-6 months
-   Starting date: March 2024 (flexible)
-   Location: IRIT, Université P. Sabatier (Toulouse III)
-   Application deadline: 21 January 2024 or until position filled
-   Send application by email to chloe.braud@irit.fr

Description of the project :

Natural Language Processing (NLP) is a subfield of Artificial
Intelligence, at the interface of Computer Science, Machine Learning
and Linguistics. Its ultimate goals are to build computational models
of human languages. NLP is a science of data, as current approaches
based on machine learning algorithms rely on the availability of
annotated corpora for their training and evaluation, and even more when
it comes to the currently dominating neural architectures, described as
data-hungry. However, annotations are not available, or only in small
quantities, for most languages or domains, and specific high-level,
semantic and pragmatic tasks. This leads to low performance and more
generally to issues with robustness, when systems are unable to
generalize to new situations. In this internship, we propose to explore
Weak Supervision approaches to develop hybrid systems in order to
tackle low resource NLP.

Weak Supervision is intended at automatically annotating large labeled
sets without the need of seed gold instances. Many weak strategies have
been applied to NLP, such as distant supervision, crowdsourcing or
ensemble methods. All these approaches allow to leverage synthetic,
noisy datasets and improve performance within low-resource settings,
but a key challenge is to understand how to combine them, to enhance
performance and coverage.
To this end, we will explore the paradigm of Programmatic Weak
Supervision (PWS) [Ratner et al. 2016, Zhang et al. 2021] that subsumes
all weak supervision strategies, while also dealing with conflicting
and dependent rules, and noisy labels. We will apply this paradigm to
discourse parsing, e.g.  [Wang et al. 2017, Nishida et al 2022], a
high-level task - crossing sentence boundaries - and a complex learning
problem, typically requiring large amounts of annotations. Discourse
parsing consists in building structures in which spans of text are
linked with semantic-pragmatic relation such as Explanation or
Contrast. It is a crucial task for many applications such as machine
translation or question answering, but with, for now, low performance.
In this internship, we will focus on discourse relation classification,
but evaluating the impact of the proposed approach for full parsing.

Requirements:
    -   Master degree in Computer Science / Natural Language Processing
        or equivalent
    -   Good knowledge in Machine Learning
    -   Good programming skills: preferably with Python, knowledge of
        PyTorch is a plus

Application procedure: please send a CV, your grades for the last
2 years and a short letter motivating your application by detailing
the following elements:

-   indicate your **skills / experience in machine learning**
-   describe your **interest and/or experience in natural language
    processing**

More about AnDiAMO: https://pagesperso.irit.fr/~Chloe.Braud/andiamo/

Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., & Ré, C. (2016). Data
programming: Creating large training sets, quickly. Advances in neural
information processing systems, 29.

Wang, Y., Li, S., & Wang, H. (2017, July). A two-stage parsing method
for text-level discourse analysis. In Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 2:
Short Papers) (pp. 184-188).

Nishida, N., & Matsumoto, Y. (2022). Out-of-Domain Discourse Dependency
Parsing via Bootstrapping: An Empirical Analysis on Its Effectiveness
and Limitation. Transactions of the Association for Computational
Linguistics, 10, 127-144.

Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., & Ratner, A.
(2021, August). WRENCH: A Comprehensive Benchmark for Weak
Supervision. In Thirty-fifth Conference on Neural Information
Processing Systems Datasets and Benchmarks Track (Round 2).