REPRESENTATION AND DECODING ALGORITHMS FOR TEXTUAL SPAN EXTRACTION

HOST: RCLN team at Laboratoire Informatique de Paris Nord (LIPN).

CONTACTS:
- Urchade Zaratiana (zaratiana@lipn.fr)
- Nadi Tomeh  (tomeh@lipn.fr)
- Thierry Charnois (thierry.charnois@lipn.univ-paris13.fr)

QUALIFICATION:
- Master in Computer Science or related field
- Strong programming skills
- Good reading and writing of English
- Experience in training deep learning models (with Pytorch/Jax)

DURATION: 6 months

HOW TO APPLY ?
send your CV and transcript of available Master grades by email to
zaratiana@lipn.fr and tomeh@lipn.fr

CONTEXT Span extraction, consisting in extracting subsequences of
tokens (or words) from texts, is an important task in natural language
processing, with applications in a wide range of domains such as Named
Entity Recognition, Keyphrase Extraction or extractive question
answering. There are several approaches to span extraction, including:
-   Span boundary prediction [1], [2], commonly used in extractive
    question answering tasks to predict the position of the start and
    end tokens of the answer span.
-   Sequence labelling [3], [4], usually used in Named Entity
    Recognition by classifying individual tokens, e.g. according to the
    BIO scheme (Begin Inside Outside).
-   Span enumeration and classification (Span-based) [5]-[7], where the
    representation of every possible span (up to a maximum span width)
    is computed using a neural network, then fed to a softmax
    classifier for prediction.
-   Prompt-based approach [8], that considers span extraction as
    template filling by fine-tuning a pre-trained BART model.
Furthermore, most of these approaches require specialised decoding to
ensure valid output. For instance, CRFs [9] are often used to constrain
valid BIO tags for sequence labelling, and semi-Markov CRFs [10] to
ensure non-overlapping spans for span-based approaches.

GOAL A key focus of this internship is the development of an advanced
decoding algorithm for extractive question answering that goes beyond
the traditional span boundary prediction (SPB). SPB is typically
performed by feeding the input text T and the question Q into a
pre-trained encoder such as BERT [2], producing a contextualized token
representation for each token in the sequence. Then, to predict the
start (or end) position of the answer span, a probability distribution
is induced over the entire sequence by computing the dot product of a
learned vector with every token representation. In this setting, the
learning objective is to maximize the sum of the log-likelihoods of the
gold start and end positions.

While empirical studies have shown that this approach gives very
competitive results, it has several shortcomings:
-   It ignores the span structure of the answer i.e there is no
    explicit constraint specifying that the end position should be
    higher than the start position.
-   It is not straightforward to obtain the probability of the span
    answer since the prediction of start and end positions are done
    independently (it is thus difficult to obtain an accurate ranking
    of the span answers, which could be useful in some settings [11])
-   Previous studies have shown that this approach can suffer from
    position bias [12], [13], where a QA model uses the spurious
    positional cues to locate the answer in the input text.

The main goal of this internship is to propose new architecture and
decoding for extractive question answering that alleviate these issues.
One possible approach could use a learning objective that optimizes the
score of the gold answer span against all possible spans in the input
text [14] (under some well-formedness constraint). However, the
quadratic complexity of this approach can pose efficiency and
scalability problems unless it resorts to some sort of approximation
[15].

References
[1] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, 'Bidirectional
    Attention Flow for Machine Comprehension'. 2018.
[2] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT:
    Pre-training of Deep Bidirectional Transformers for Language
    Understanding. 2019.
[3] Z. Huang, W. Xu, and K. Yu, Bidirectional LSTM-CRF Models for
    Sequence Tagging. 2015.
[4] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C.
    Dyer, 'Neural Architectures for Named Entity Recognition'. 2016.
[5] Y. Luan, D. Wadden, L. He, A. Shah, M. Ostendorf, and H.
    Hajishirzi, 'A general framework for information extraction using
    dynamic span graphs', in Proceedings of the 2019 Conference of the
    North American Chapter of the Association for Computational
    Linguistics: Human Language Technologies, Volume 1 (Long and Short
    Papers), Minneapolis, Minnesota, Jun. 2019, pp. 3036-3046.
    doi: 10.18653/v1/N19-1308.
[6] D. Wadden, U. Wennberg, Y. Luan, and H. Hajishirzi, 'Entity,
    Relation, and Event Extraction with Contextualized Span
    Representations'. 2019.
[7] Y. Li, L. Liu, and S. Shi, 'Empirical Analysis of Unlabeled Entity
    Problem in Named Entity Recognition'. 2021.
[8] L. Cui, Y. Wu, J. Liu, S. Yang, and Y. Zhang, 'Template-Based Named
    Entity Recognition Using BART'. 2021.
[9] J. D. Lafferty, A. McCallum, and F. C. N. Pereira, 'Conditional
    Random Fields: Probabilistic Models for Segmenting and Labeling
    Sequence Data', in Proceedings of the Eighteenth International
    Conference on Machine Learning, San Francisco, CA, USA, 2001,
    pp. 282-289.
[10]S. Sarawagi and W. W. Cohen, 'Semi-Markov Conditional Random
    Fields for Information Extraction', in Advances in Neural
    Information Processing Systems, 2005, vol. 17. [Online].
    Available:
https://proceedings.neurips.cc/paper/2004/file/eb06b9db06012a7a4179b8f3cb5384d3-Paper.pdf
[11]S. Levy, K. Mo, W. Xiong, and W. Y. Wang, 'Open-Domain
    Question-Answering for COVID-19 and Other Emergent Domains'. 2021.
[12]Y. Niu and H. Zhang, 'Introspective Distillation for Robust
    Question Answering'. 2021.
[13]M. Ko, J. Lee, H. Kim, G. Kim, and J. Kang, 'Look at the First
    Sentence: Position Bias in Question Answering'. 2021.
[14]K. Lee, S. Salant, T. Kwiatkowski, A. Parikh, D. Das, and J.
    Berant, 'Learning Recurrent Span Representations for Extractive
    Question Answering'. 2017.
[15]J. Raiman and J. Miller, 'Globally Normalized Reader'. 2017.