The objective of this 6 month internship working for the Lingua Custodia Lab (Research and Development team) is to study creation of high quality synthetic data to fine-tune task specific LLMs. The tasks will be defined during the internship. You will be working on creating a pipeline to create both instruction and RLHF/DPO synthetic data with guard rails and personas. You will then finetune LLMs on the synthetic dataset and perform extensive evaluation. Theoretical and practical knowledge in training language models with huggingface or similar libraries are essential to carry out the internship. *Purpose* - Internship supervised by the Lingua Custodia Lab (R&D team) - Internship of 6 months in Paris (fully on-site) - Full time, 35 hours weekly - Starting date: early 2025 *Responsibilities* 1. Study state-of-the art in Large Language Models (LLMs) synthetic data generation. 2. Compare different methods through systematic experiments. 3. Finetune models on the generated data (+ public sources) 4. Perform evaluations using lm-evaluation-harness library 5. Publish results in a workshop or a conference. *Qualifications* - Master's Degree student in Computer Science, Machine Learning or Natural Language Processing. - Experience with LLMs (class or personal projects) (this is a must) - Experience with Huggingface library (this is a must) - Proficient in English and/or French. *Benefits* Ticket restos 50% of your Navigo pass is reimbursed Working in an innovative and supportive environment at the forefront of AI developments! To apply: Please send your CV to : raheel.qader@linguacustodia.com Lingua Custodia is a Paris based Fintech company leader in Natural Language Processing (NLP) for Finance. It was created in 2011 by finance professionals to initially offer specialized machine translation. Leveraging its state-of-the-art NLP expertise, the company now offers a growing range of applications in addition to its initial Machine translation offering: Speech-to-Text automation, Document classification, Linguistic data extraction from unstructured documents, Mass web crawling and data collection, ... and achieves superior quality thanks to highly domain-focused machine learning algorithms.