Offensive Content Mitigation Research - Internship


Category Internship  
Start date As soon as possible
Duration 6 months

Description

Modern LLMs have acquired impressive language understanding while
being trained on a massive amount of data but they may struggle to
generate responses that align with user preferences and expectations
for the input request. In deployment systems, a crucial task is to
ensure that the generated content is free of offensive expressions and
patronizing language to address the safety risks posed by deployed
systems, such as chatbots and conversational agents. While a lot of
effort is invested into alignment of LLMs [1,2,3,4,5], the safety risk
is still existent, especially for non-English content
[6,7,8,9]. Moreover, many aligned models tend to overreact to certain
"trigger patterns" (eg. swear words, mention of protected attributes,
etc.) and may wrongly refuse to answer inoffensive questions, which
results in existing tension between "helpfulness" and
"safety". Models' over-reliance on such patterns makes detection of
implicit hate speech more challenging [11,14].

The goal of this internship is to investigate strategies to diminish
offensive content generation focusing on implicit offensive speech in
multilingual settings.

This internship is part of an ANR project called DIKÉ
(https://www.anr-dike.fr/), which aims at studying bias, fairness and
ethics of compressed NLP models. Results are expected to be reported
in a paper by the end of the internship (or soon after). The
internship will be hosted at NAVER LABS Europe and co-supervised by
NAVER LABS and Lyon 2 University researchers.

Supervisors: Caroline Brun and Vassilina Nikoulina

Required skills

- PhD or last year MSc student in NLP-related domains

- Solid deep learning and NLP background

- Strong programming skills, with knowledge of PyTorch, NumPy, and the
  HF Transformers

- Familiarity with recent preference optimization techniques, such as
  DPO, is a plus

- Ability to communicate in English; knowledge of French is an
  advantage

References

[1] Compositional Preference Models for Aligning LMs.  D.Go et
al. ICLR 2024

[2] Back to Basics: Revisiting REINFORCE Style Optimization for
Learning from Human Feedback in LLMs Ahmadian et al. 2024

[3] Camels in a Changing Climate: Enhancing LM Adaptation with Tulu
2. Ivison et al. 2023

[4] Direct Preference Optimization: Your Language Model is Secretly a
Reward Model. Rafailov et al 2023

[5] Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented
Models: https://aclanthology.org/2023.findings-emnlp.339/

[6] Li, Xiaochen, Zheng-Xin Yong, and Stephen H. Bach. "Preference
tuning for toxicity mitigation generalizes across languages." arXiv
preprint arXiv:2406.16235 (2024).

[7] Ermis, Beyza, et al. "From One to Many: Expanding the Scope of
Toxicity Mitigation in Language Models." Findings of the Association
for Computational Linguistics ACL
2024. 2024.https://aclanthology.org/2024.findings-acl.893.pdf

[8] PolygloToxicityPrompts https://arxiv.org/html/2405.09373v1

[9] FrenchToxicityPrompts: a Large Benchmark for Evaluating and
Mitigating Toxicity in French Texts. Brun and Nikoulina 2024

[10] Playing the Part of the Sharp Bully: Generating Adversarial
Examples for Implicit Hate Speech Detection:

[11] Ocampo, Nicolas Benjamin, et al. "An in-depth analysis of
implicit and subtle hate speech messages." Proceedings of the 17th
Conference of the European Chapter of the Association for
Computational Linguistics. Association for Computational Linguistics,
2023.

[12] Latent Hatred: A Benchmark for Understanding Implicit Hate Speech
https://aclanthology.org/2021.emnlp-main.29/.

[13] An In-depth Analysis of Implicit and Subtle Hate Speech Messages:
https://aclanthology.org/2023.eacl-main.147.pdf.

[14] Don't Go To Extremes: Revealing the Excessive Sensitivity and
Calibration Limitations of LLMs in Implicit Hate Speech Detection
https://arxiv.org/abs/2402.11406 Application instructions

Please note that applicants must be registered students at a
university or other academic institution and that this establishment
will need to sign an 'Internship Convention' with NAVER LABS Europe
before the student is accepted.

You can apply for this position online. Don't forget to upload your CV
and cover letter before you submit. Incomplete applications will not
be accepted.


About NAVER LABS

NAVER is the #1 Internet portal in Korea with activities that span a
wide range of businesses including search, commerce, content,
financial and cloud platforms.

NAVER LABS, co-located in Korea and France, is the organization
dedicated to preparing NAVER's future. NAVER LABS Europe is located in
a spectacular setting in Grenoble, in the heart of the French
Alps. Scientists at NAVER LABS Europe are empowered to pursue
long-term research problems that, if successful, can have significant
impact and transform NAVER. We take our ideas as far as research can
to create the best technology of its kind. Active participation in the
academic community and collaborations with world-class public research
groups are, among others, important tools to achieve these
goals. Teamwork, focus and persistence are important values for us.

NAVER LABS Europe is an equal opportunity employer.

Apply Online

https://europe.naverlabs.com/job/offensive-content-mitigation-research-internship/