REPRESENTATION AND DECODING ALGORITHMS FOR TEXTUAL SPAN EXTRACTION HOST: RCLN team at Laboratoire Informatique de Paris Nord (LIPN). CONTACTS: - Urchade Zaratiana (zaratiana@lipn.fr) - Nadi Tomeh (tomeh@lipn.fr) - Thierry Charnois (thierry.charnois@lipn.univ-paris13.fr) QUALIFICATION: - Master in Computer Science or related field - Strong programming skills - Good reading and writing of English - Experience in training deep learning models (with Pytorch/Jax) DURATION: 6 months HOW TO APPLY ? send your CV and transcript of available Master grades by email to zaratiana@lipn.fr and tomeh@lipn.fr CONTEXT Span extraction, consisting in extracting subsequences of tokens (or words) from texts, is an important task in natural language processing, with applications in a wide range of domains such as Named Entity Recognition, Keyphrase Extraction or extractive question answering. There are several approaches to span extraction, including: - Span boundary prediction [1], [2], commonly used in extractive question answering tasks to predict the position of the start and end tokens of the answer span. - Sequence labelling [3], [4], usually used in Named Entity Recognition by classifying individual tokens, e.g. according to the BIO scheme (Begin Inside Outside). - Span enumeration and classification (Span-based) [5]-[7], where the representation of every possible span (up to a maximum span width) is computed using a neural network, then fed to a softmax classifier for prediction. - Prompt-based approach [8], that considers span extraction as template filling by fine-tuning a pre-trained BART model. Furthermore, most of these approaches require specialised decoding to ensure valid output. For instance, CRFs [9] are often used to constrain valid BIO tags for sequence labelling, and semi-Markov CRFs [10] to ensure non-overlapping spans for span-based approaches. GOAL A key focus of this internship is the development of an advanced decoding algorithm for extractive question answering that goes beyond the traditional span boundary prediction (SPB). SPB is typically performed by feeding the input text T and the question Q into a pre-trained encoder such as BERT [2], producing a contextualized token representation for each token in the sequence. Then, to predict the start (or end) position of the answer span, a probability distribution is induced over the entire sequence by computing the dot product of a learned vector with every token representation. In this setting, the learning objective is to maximize the sum of the log-likelihoods of the gold start and end positions. While empirical studies have shown that this approach gives very competitive results, it has several shortcomings: - It ignores the span structure of the answer i.e there is no explicit constraint specifying that the end position should be higher than the start position. - It is not straightforward to obtain the probability of the span answer since the prediction of start and end positions are done independently (it is thus difficult to obtain an accurate ranking of the span answers, which could be useful in some settings [11]) - Previous studies have shown that this approach can suffer from position bias [12], [13], where a QA model uses the spurious positional cues to locate the answer in the input text. The main goal of this internship is to propose new architecture and decoding for extractive question answering that alleviate these issues. One possible approach could use a learning objective that optimizes the score of the gold answer span against all possible spans in the input text [14] (under some well-formedness constraint). However, the quadratic complexity of this approach can pose efficiency and scalability problems unless it resorts to some sort of approximation [15]. References [1] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, 'Bidirectional Attention Flow for Machine Comprehension'. 2018. [2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2019. [3] Z. Huang, W. Xu, and K. Yu, Bidirectional LSTM-CRF Models for Sequence Tagging. 2015. [4] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, 'Neural Architectures for Named Entity Recognition'. 2016. [5] Y. Luan, D. Wadden, L. He, A. Shah, M. Ostendorf, and H. Hajishirzi, 'A general framework for information extraction using dynamic span graphs', in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, Jun. 2019, pp. 3036-3046. doi: 10.18653/v1/N19-1308. [6] D. Wadden, U. Wennberg, Y. Luan, and H. Hajishirzi, 'Entity, Relation, and Event Extraction with Contextualized Span Representations'. 2019. [7] Y. Li, L. Liu, and S. Shi, 'Empirical Analysis of Unlabeled Entity Problem in Named Entity Recognition'. 2021. [8] L. Cui, Y. Wu, J. Liu, S. Yang, and Y. Zhang, 'Template-Based Named Entity Recognition Using BART'. 2021. [9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, 'Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data', in Proceedings of the Eighteenth International Conference on Machine Learning, San Francisco, CA, USA, 2001, pp. 282-289. [10]S. Sarawagi and W. W. Cohen, 'Semi-Markov Conditional Random Fields for Information Extraction', in Advances in Neural Information Processing Systems, 2005, vol. 17. [Online]. Available: https://proceedings.neurips.cc/paper/2004/file/eb06b9db06012a7a4179b8f3cb5384d3-Paper.pdf [11]S. Levy, K. Mo, W. Xiong, and W. Y. Wang, 'Open-Domain Question-Answering for COVID-19 and Other Emergent Domains'. 2021. [12]Y. Niu and H. Zhang, 'Introspective Distillation for Robust Question Answering'. 2021. [13]M. Ko, J. Lee, H. Kim, G. Kim, and J. Kang, 'Look at the First Sentence: Position Bias in Question Answering'. 2021. [14]K. Lee, S. Salant, T. Kwiatkowski, A. Parikh, D. Das, and J. Berant, 'Learning Recurrent Span Representations for Extractive Question Answering'. 2017. [15]J. Raiman and J. Miller, 'Globally Normalized Reader'. 2017.