Internship (M1/M2): Large language models for ecological information extraction Duration: 6 months Location: Laboratoire d'Écologie Alpine, Grenoble, France Supervision: Nicolas Le Guillarme (research engineer), Wilfried Thuiller (senior researcher) Contact: nicolas.leguillarme@univ-grenoble-alpes.fr Aim of the project Understanding community structure and dynamics is a key element of modern ecology, especially in the light of global change. Whether there are species-based or trait-based, approaches that aim at improving our understanding of the assembly of communities and their impact on the functioning of ecosystems require extensive information on the organisms that make them up. This includes information about organism traits and roles, which represent their physiological, morphological, or life-history characteristics, and the interactions they have with other organisms and the environment. There are several open-access databases that centralize some of the available knowledge on organism traits and interactions. However, most of the information remains dispersed in unstructured form throughout scientific and grey literature, making it challenging to use as part of large-scale and multi-taxa biodiversity studies. The goal of this project is to develop NLP tools to automatically extract information on organism traits and interactions from textual documents to complement existing databases. Large Language Models (LLMs), such as GPT, have demonstrated a revolutionary ability to retrieve and analyze natural language, including its context and distinct nuances in meaning, and transform the data, potentially delivering it in structured forms. These capabilities hold the promise of LLMs as a comprehensive tool for extracting and structuring information from textual data in ecology (Castro et al., 2024). However, it is also apparent that their performance in these tasks depends on the models used, the type of data, and the desired structure of the extracted information. The aim of this internship is to evaluate the capacity of these models for information extraction in ecological research. More specifically, we aim to determine whether it is possible to build robust ecological information extraction models by directly prompting LLMs. Activities The work will consist in the following task: - Create a gold-standard corpus for organism trait and/or interaction extraction. - Design and implement a few-shot learning prompt-based method for taxonomic entity recognition and organism trait/interaction extraction. - Compare the performance of the proposed method with that of TaxoNERD (Le Guillarme and Thuiller, 2022) for taxonomic entity recognition. - Compare performance of different LLMs (e.g. GPT, LLAMA 2, MISTRAL 7B) on the two downstream tasks. Profile sought A student in the last years of an engineer's or research master's degree (M1-M2) or in a gap year, specializing in applied mathematics or artificial intelligence. The candidate must have a sound knowledge of machine learning (deep learning, reinforcement learning) and good Python programming skills. Previous experience in text mining, information extraction, NLP or LLMs would be appreciated. The candidate should also be able to demonstrate autonomy and good communication skills. References Castro, A., Pinto, J., Reino, L., Pipek, P., & Capinha, C. (2024). Large language models overcome the challenges of unstructured text data in ecology. bioRxiv, 2024-01. Le Guillarme, N., & Thuiller, W. (2022). TaxoNERD: deep neural models for the recognition of taxonomic entities in the ecological and evolutionary literature. Methods in Ecology and Evolution, 13(3), 625-641. Le stage, d'une durée comprise en 4 et 6 mois, se déroulera au Laboratoire d'Écologie Alpine à Grenoble (https://leca.osug.fr/-Equipes-themes-transversaux-), au sein de l'équipe BIOM (Biodiversity Monitoring). Les candidatures sont ouvertes jusqu'au 10 mai 2024. Les candidats intéressés sont invités à me contacter directement : nicolas.leguillarme@univ-grenoble-alpes.fr