Offensive Content Mitigation Research - Internship Category Internship Start date As soon as possible Duration 6 months Description Modern LLMs have acquired impressive language understanding while being trained on a massive amount of data but they may struggle to generate responses that align with user preferences and expectations for the input request. In deployment systems, a crucial task is to ensure that the generated content is free of offensive expressions and patronizing language to address the safety risks posed by deployed systems, such as chatbots and conversational agents. While a lot of effort is invested into alignment of LLMs [1,2,3,4,5], the safety risk is still existent, especially for non-English content [6,7,8,9]. Moreover, many aligned models tend to overreact to certain "trigger patterns" (eg. swear words, mention of protected attributes, etc.) and may wrongly refuse to answer inoffensive questions, which results in existing tension between "helpfulness" and "safety". Models' over-reliance on such patterns makes detection of implicit hate speech more challenging [11,14]. The goal of this internship is to investigate strategies to diminish offensive content generation focusing on implicit offensive speech in multilingual settings. This internship is part of an ANR project called DIKÉ (https://www.anr-dike.fr/), which aims at studying bias, fairness and ethics of compressed NLP models. Results are expected to be reported in a paper by the end of the internship (or soon after). The internship will be hosted at NAVER LABS Europe and co-supervised by NAVER LABS and Lyon 2 University researchers. Supervisors: Caroline Brun and Vassilina Nikoulina Required skills - PhD or last year MSc student in NLP-related domains - Solid deep learning and NLP background - Strong programming skills, with knowledge of PyTorch, NumPy, and the HF Transformers - Familiarity with recent preference optimization techniques, such as DPO, is a plus - Ability to communicate in English; knowledge of French is an advantage References [1] Compositional Preference Models for Aligning LMs. D.Go et al. ICLR 2024 [2] Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs Ahmadian et al. 2024 [3] Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2. Ivison et al. 2023 [4] Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Rafailov et al 2023 [5] Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models: https://aclanthology.org/2023.findings-emnlp.339/ [6] Li, Xiaochen, Zheng-Xin Yong, and Stephen H. Bach. "Preference tuning for toxicity mitigation generalizes across languages." arXiv preprint arXiv:2406.16235 (2024). [7] Ermis, Beyza, et al. "From One to Many: Expanding the Scope of Toxicity Mitigation in Language Models." Findings of the Association for Computational Linguistics ACL 2024. 2024.https://aclanthology.org/2024.findings-acl.893.pdf [8] PolygloToxicityPrompts https://arxiv.org/html/2405.09373v1 [9] FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts. Brun and Nikoulina 2024 [10] Playing the Part of the Sharp Bully: Generating Adversarial Examples for Implicit Hate Speech Detection: [11] Ocampo, Nicolas Benjamin, et al. "An in-depth analysis of implicit and subtle hate speech messages." Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2023. [12] Latent Hatred: A Benchmark for Understanding Implicit Hate Speech https://aclanthology.org/2021.emnlp-main.29/. [13] An In-depth Analysis of Implicit and Subtle Hate Speech Messages: https://aclanthology.org/2023.eacl-main.147.pdf. [14] Don't Go To Extremes: Revealing the Excessive Sensitivity and Calibration Limitations of LLMs in Implicit Hate Speech Detection https://arxiv.org/abs/2402.11406 Application instructions Please note that applicants must be registered students at a university or other academic institution and that this establishment will need to sign an 'Internship Convention' with NAVER LABS Europe before the student is accepted. You can apply for this position online. Don't forget to upload your CV and cover letter before you submit. Incomplete applications will not be accepted. About NAVER LABS NAVER is the #1 Internet portal in Korea with activities that span a wide range of businesses including search, commerce, content, financial and cloud platforms. NAVER LABS, co-located in Korea and France, is the organization dedicated to preparing NAVER's future. NAVER LABS Europe is located in a spectacular setting in Grenoble, in the heart of the French Alps. Scientists at NAVER LABS Europe are empowered to pursue long-term research problems that, if successful, can have significant impact and transform NAVER. We take our ideas as far as research can to create the best technology of its kind. Active participation in the academic community and collaborations with world-class public research groups are, among others, important tools to achieve these goals. Teamwork, focus and persistence are important values for us. NAVER LABS Europe is an equal opportunity employer. Apply Online https://europe.naverlabs.com/job/offensive-content-mitigation-research-internship/