Context The NanoBubbles ERC Synergy project's objective (https://nanobubbles.hypotheses.org) is to understand how, when and why science fails to correct itself. The project focuses on claims made within the field of nanobiology. Project members combine approaches from the natural sciences, computer science, and the social sciences and humanities (Science and Technology Studies) to understand how error correction in science works and what obstacles it faces. For this purpose, we aim to trace claims and corrections through various channels of scientific communication (journals, social media, advertisements, conference programs, etc.) via both qualitative and digital methods. Intership objectifs Entity recognition is an important step for downstream treatment in natural language processing. It consists in identifying the entities in a corpus belonging to a specific domain and in their labeling. Training methods relying on large annotated corpora are usually used for this purpose. However, such resource are not always available for specific domains, and alternative methods have to be employed (Hedderich 2020). Distant supervision (Mintz 2009) is a technique used to automatically label textual data using an external resource such as dictionaries (Shang 2018), gazetteers, ontologies (Wang 2021) and knowledge bases (Sun 2019). This enable the construction of a training corpus without the need of manual annotation. In specialized domains, this is especially useful in order to annotate complex and discontinuous entities with which human annotators may struggle (Khandelwal 2022). The objective of this internship is to implement a method to automatically annotate a corpus of scientific documents, using existing resources, in the nanobiology domain. After it, they will employ existing deep learning approaches (Liang 2020) to train an entity extraction model for entities in the nanobiology domain. Skills - Being enrolled in a Master in Natural Language Processing, computer science or data science. - Good programming skills in Python, including experiences with natural language processing tools and methods, knowledge of machine learning and deep learning frameworks and semantic web. - Ability to communicate and write in English is a plus. Scientific environment The work will be conducted within the Sigma team of the LIG laboratory (http://sigma.imag.fr). The recruited person will be welcomed within the team which offer a stimulating, multinational and pleasant working environment. Instructions for applying Applications must contain a CV + letter/message of motivation + master grades + letter(s) of recommendation (or names for potential letters), and be addressed to Cyril Labbé (cyril.labbe@imag.fr) and Amira Barhoumi (amira.barhoumi@univ-grenoble-alpes.fr). Applications will be considered on the fly. It is therefore advisable to apply as soon as possible. References - Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009, August). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP (pp. 1003-1011). - Shang, J., Liu, L., Ren, X., Gu, X., Ren, T., & Han, J. (2018). Learning named entity tagger using domain-specific dictionary. arXiv preprint arXiv:1809.03599. - Sun, Y., & Loparo, K. (2019, July). Information extraction from free text in clinical trials with knowledge-based distant supervision. In 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 954-955). IEEE. - Wang, X., Hu, V., Song, X., Garg, S., Xiao, J., & Han, J. (2021, November). CHEMNER: Fine-Grained Chemistry Named Entity Recognition with Ontology-Guided Distant Supervision. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (pp. 5227-5240). - Liang, C., Yu, Y., Jiang, H., Er, S., Wang, R., Zhao, T., & Zhang, C. (2020, August). Bond: Bert-assisted open-domain named entity recognition with distant supervision. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 1054-1064). - Hedderich, M. A., Lange, L., Adel, H., Str ?otgen, J., & Klakow, D. (2020). A survey on recent approaches for natural language processing in low-resource scenarios. arXiv preprint arXiv:2010.12309. - Khandelwal, A., Kar, A., Chikka, V. R., & Karlapalem, K. (2022, May). Biomedical NER using Novel Schema and Distant Supervision. In Proceedings of the 21st Workshop on Biomedical Language Processing (pp. 155-160)