The objective of this 6 month internship working for the Lingua
Custodia Lab (Research and Development team)  is to study creation of
high quality synthetic data to fine-tune task specific LLMs. The tasks
will be defined during the internship. You will be working on creating
a pipeline to create both instruction and RLHF/DPO synthetic data with
guard rails and personas. You will then finetune LLMs on the synthetic
dataset and perform extensive evaluation. Theoretical and practical
knowledge in training language models with huggingface or similar
libraries are essential to carry out the internship.


*Purpose*

-   Internship supervised by the Lingua Custodia Lab (R&D team)
-   Internship of 6 months in Paris (fully on-site)
-   Full time, 35 hours weekly
-   Starting date: early 2025


*Responsibilities*

1.  Study state-of-the art in Large Language Models (LLMs) synthetic
    data generation.
2.  Compare different methods through systematic experiments.
3.  Finetune models on the generated data (+ public sources)
4.  Perform evaluations using lm-evaluation-harness library
5.  Publish results in a workshop or a conference.


*Qualifications*

-   Master's Degree student in Computer Science, Machine Learning or
    Natural Language Processing.
-   Experience with LLMs (class or personal projects) (this is a must)
-   Experience with Huggingface library (this is a must)
-   Proficient in English and/or French.

*Benefits*
Ticket restos
50% of your Navigo pass is reimbursed
Working in an innovative and supportive environment at the forefront of
AI developments!

To apply:

Please send your CV to :  raheel.qader@linguacustodia.com


Lingua Custodia is a Paris based Fintech company leader in Natural
Language Processing (NLP) for Finance. It was created in 2011 by
finance professionals to initially offer specialized machine
translation.

Leveraging its state-of-the-art NLP expertise, the company now offers a
growing range of applications in addition to its initial Machine
translation offering: Speech-to-Text automation, Document
classification, Linguistic data extraction from unstructured documents,
Mass web crawling and data collection, ... and achieves superior
quality thanks to highly domain-focused machine learning algorithms.