INTERNSHIP OFFER Natural Language Processing / Information Extraction / Digital Humanities (4 months, Bac+5: M2 or Engineering degree) Internship Topic Named Entity Recognition and Entity Linking in Early-Modern Prints and Manuscripts Keywords: handwritten text recognition (HTR), optical character recognition (OCR), named entity recognition (NER), entity linking (EL) Duration: 4 months (start: March or April 2026) Application deadline: March 12, 2026 Compensation: legal French internship allowance (approximately ¤650/month) Host Laboratory: BdTln Team, LIFAT (EA 6300), Université de Tours 64 avenue Jean Portalis, 37200 Tours, France Context Early-modern printed books and manuscripts present significant challenges for automatic text processing: non-standardized spelling, language variation, complex document layouts (paratexts, marginalia), and transcription noise introduced by OCR and HTR systems. These factors greatly reduce the performance of downstream natural language processing (NLP) tasks, especially named entity recognition (NER) and entity linking (EL). Overcoming these issues requires specialized methods, including normalization strategies for historical spelling, domain-specific annotation guidelines, and training or fine-tuning models on historical corpora. In collaboration with the ERC project PRIMA - Manuscripts in the Age of Print (Grant Agreement No. 101142242), hosted by the Centre d'Études Supérieures de la Renaissance (CESR, UMR 7323, Université de Tours), this internship focuses on developing and evaluating NER and EL techniques applied to early-modern prints, manuscripts, and bibliographic catalogs. Tasks The intern will be responsible for: - Annotating named entities in early-modern texts to create a reference (gold-standard) corpus - Developing, applying, and assessing NER and EL approaches on OCR/HTR outputs from historical documents - Analyzing results in relation to transcription noise and document features, and measuring their impact on entity extraction and linking performance Internship Framework This internship is financed by La Région Centre-Val de Loire (France) and involves collaboration between the LIFAT and CESR labs at Université de Tours, as well as partners specializing in manuscript studies and NLP. Regular interactions with the PRIMA research team are planned. The intern will be hosted at Université de Tours in the LIFAT laboratory (Tours). Travel between Tours and Blois (with reimbursement) is planned for team meetings, where the intern will participate. Possibility exists for continuation as a PhD project, depending on results and funding. Desired Profile / Required Skills - Final-year Master's student or engineering student in Computer Science, Computational Linguistics, or related fields - Strong programming skills in Python and experience with NLP libraries and model fine-tuning (e.g., NLTK, spaCy, Hugging Face) - Familiarity with large language models (LLMs), including practical experience with APIs - Interest in early-modern texts and digital humanities (prior experience is a plus but not required) - Ability to document methods accurately and conduct rigorous experimental assessments Application Applications should include a CV and cover letter, and must be sent by March 12, 2026, to the following addresses: Carlos-Emiliano González-Gallardo, LIFAT/CESR, Université de Tours gonzalezgallardo@univ-tours.fr Cyril de Runz, LIFAT/IUT Blois, Université de Tours cyril.derunz@univ-tours.fr References https://hal.science/hal-05248289v1 https://univ-tours.hal.science/hal-04662000/ https://ceur-ws.org/Vol-3180/paper-84.pdf