Call for Internship applications in Natural Language Processing Title : Study on the accuracy of citations in scientific papers Starting date : February 2023 Application deadline : December 5th, 2022 Location: LIG laboratory, Grenoble Alps University, France Keywords: Natural language processing, Scientific literature, citation accuracy Context : The NanoBubbles ERC Synergy project's objective (https://nanobubbles.hypotheses.org) is to understand how, when and why science fails to correct itself. The project focuses on claims made within the field of nanobiology. Project members combine approaches from the natural sciences, computer science, and the social sciences and humanities (Science and Technology Studies) to understand how error correction in science works and what obstacles it faces. For this purpose, we aim to trace claims and corrections through various channels of scientific communication (journals, social media, advertisements, conference programs, etc.) via both qualitative and digital methods. Internship objectives : In scientific papers, citations acknowledge the sources and help the reader to find more information about the citation context. Citations are also an important indicator exploited to identify significant publications in a specific scientific field (Aragon 2013). They are used for different purposes, e.g. referring to state of the art, to a specific method or result, and they reflect how authors frame their work and this diversity impacts future academics adoption (Jurgens 2018). Recently, there have been numerous research in Natural Language Processing on citation analysis in scientific literature. Studies of citation behavior aim at understanding how researchers cited a paper in their work. Existing works on citation analysis deal with determining citation sentiment (Liu 2017, Athar 2011), finding out citation function (Yu 2020, Pride 2019, Bakhti 2018) and identifying critical citation contexts (Te 2022). Nevertheless, studies that evaluate the accuracy of citations are scarce. Studies on the accuracy of citations in various scientific disciplines demonstrate an error rate of 25%-54% (Jergas 2015, Siebers 2000, Kristof 1997, Key 1977). These errors alter the original content and meaning of the cited paper. They can range from minor to major errors in citation accuracy. Several studies describe various issues that may arise when citing original research done by others. For example, in the following sentence "it has been shown that bubblegum is much more pink than flamingo while running very fast [Einstein A., 1916] ": "[Einstein A., 1916]" represents the "citation" The citation refers to the following scientific paper "Einstein, A. (1916 (translation 1920)), Relativity: The Special and General Theory" The Einstein's paper represents the cited paper The cited paper is not correlated with the meaning of the sentence, i.e. there is no relation between the colors and the relativity notion. The aim of this internship is to assess the content of both cited and citing papers in scientific literature, i.e. study the correlation between the citation and its context in the citing paper in order to identify miss-citations. The intern tasks would then be to (1) test and compare unsupervised NLP methods and pre-trained embedding models (SciBert, BioBert, etc.) in order to measure the accuracy of citations using available datasets, and to (2) provide project members with a set of reliable tools. Skills : - Being enrolled in a Master in Natural Language Processing, computer science or data science. - Good programming skills in Python, experience with natural language processing tools and frameworks, knowledge of machine learning methods and deep learning technics. - Ability to communicate and write in English is a plus Scientific environment : The work will be conducted within the Sigma team of the LIG laboratory (http://sigma.imag.fr). The recruited person will be welcomed within the team which offer a stimulating, multinational and pleasant working environment. Instructions for applying : Applications must contain a CV + letter/message of motivation + master grades + letter(s) of recommendation (or names for potential letters), and be addressed to Cyril Labbé (cyril.labbe@imag.fr) and Amira Barhoumi (amira.barhoumi@univ-grenoble-alpes.fr). Applications will be considered on the fly. It is therefore advisable to apply as soon as possible. References : (Aragon 2013) Aragon M. A measure for the impact of research. Scientific reports. 2013;3(1):1-5. (Jurgens 2018) Jurgens D, Kumar S, Hoover R, Mc-Farland D, Jurafsky D. Measuring the Evolution of a Scientific Field through Citation Frames. Transactions of the Association for Computational Linguistics. 2018;6:391-406. (Jergas 2015) Jergas H, Baethge C. Quotation accuracy in medical journal articles-a systematic review and meta-analysis. PeerJ. 2015;3:e1364. (Kristof 1997) Kristof C. Accuracy of reference citations in five entomology journals. Am Entomol. 1997;43(4):246-251. (Key 1977) Key JD, Roland CG. Reference accuracy in articles accepted for publication in the Archives of Physical Medicine and Rehabilitation. Arch Phys Med Rehabil. 1977;58(3):136-137. (Siebers 2000) Siebers R, Holt S. Accuracy of references in five leading medical journals. Lancet. 2000;356(9239):1445. (Te 2022) Te S, Barhoumi A, Lentschat M, Bordignon F, Labb ?e C, Portet F. Citation Context Classification: Critical vs Non-critical. In proceedings of the Third Workshop on Scholarly Document Processing. 2022:49-53. (Liu 2017) Liu H. Sentiment analysis of citations using word2vec. 2017;CoRR, abs/1704.00177. (Athar 2011) Athar A. Sentiment analysis of citations using sentence structure-based features. In Proceedings of the ACL 2011 Student Session. 2011:81-87. (Bakhti 2018) Bakhti K, Niu Z, Yousif A, Nyamawe A. Citation Function Classification Based on Ontologies and Convolutional Neural Networks. 2018:105-115. (Pride 2019) Pride D, Knoth P, Jozef Harag J. Act: An annotation platform for citation typing at scale. In 2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL). 2019:329-330. (Yu 2020) Yu W, Yu M, Zhao T, Jiang M. Identifying referential intention with heterogeneous contexts. 2020:962-972.