Job: Internship - Weak supervision for discourse parsing Internship at IRIT,Toulouse (France) - ANR AnDiAMO This internship will be co-supervised by Chloé Braud and Philippe Muller, and the intern will work within the MELODI team at IRIT. They will participate in group meetings, reading groups, and they will collaborate with other members of the project. - Contract duration: 5-6 months - Starting date: March 2024 (flexible) - Location: IRIT, Université P. Sabatier (Toulouse III) - Application deadline: 21 January 2024 or until position filled - Send application by email to chloe.braud@irit.fr Description of the project : Natural Language Processing (NLP) is a subfield of Artificial Intelligence, at the interface of Computer Science, Machine Learning and Linguistics. Its ultimate goals are to build computational models of human languages. NLP is a science of data, as current approaches based on machine learning algorithms rely on the availability of annotated corpora for their training and evaluation, and even more when it comes to the currently dominating neural architectures, described as data-hungry. However, annotations are not available, or only in small quantities, for most languages or domains, and specific high-level, semantic and pragmatic tasks. This leads to low performance and more generally to issues with robustness, when systems are unable to generalize to new situations. In this internship, we propose to explore Weak Supervision approaches to develop hybrid systems in order to tackle low resource NLP. Weak Supervision is intended at automatically annotating large labeled sets without the need of seed gold instances. Many weak strategies have been applied to NLP, such as distant supervision, crowdsourcing or ensemble methods. All these approaches allow to leverage synthetic, noisy datasets and improve performance within low-resource settings, but a key challenge is to understand how to combine them, to enhance performance and coverage. To this end, we will explore the paradigm of Programmatic Weak Supervision (PWS) [Ratner et al. 2016, Zhang et al. 2021] that subsumes all weak supervision strategies, while also dealing with conflicting and dependent rules, and noisy labels. We will apply this paradigm to discourse parsing, e.g. [Wang et al. 2017, Nishida et al 2022], a high-level task - crossing sentence boundaries - and a complex learning problem, typically requiring large amounts of annotations. Discourse parsing consists in building structures in which spans of text are linked with semantic-pragmatic relation such as Explanation or Contrast. It is a crucial task for many applications such as machine translation or question answering, but with, for now, low performance. In this internship, we will focus on discourse relation classification, but evaluating the impact of the proposed approach for full parsing. Requirements: - Master degree in Computer Science / Natural Language Processing or equivalent - Good knowledge in Machine Learning - Good programming skills: preferably with Python, knowledge of PyTorch is a plus Application procedure: please send a CV, your grades for the last 2 years and a short letter motivating your application by detailing the following elements: - indicate your **skills / experience in machine learning** - describe your **interest and/or experience in natural language processing** More about AnDiAMO: https://pagesperso.irit.fr/~Chloe.Braud/andiamo/ Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., & Ré, C. (2016). Data programming: Creating large training sets, quickly. Advances in neural information processing systems, 29. Wang, Y., Li, S., & Wang, H. (2017, July). A two-stage parsing method for text-level discourse analysis. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 184-188). Nishida, N., & Matsumoto, Y. (2022). Out-of-Domain Discourse Dependency Parsing via Bootstrapping: An Empirical Analysis on Its Effectiveness and Limitation. Transactions of the Association for Computational Linguistics, 10, 127-144. Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., & Ratner, A. (2021, August). WRENCH: A Comprehensive Benchmark for Weak Supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).