- Title: Extracting Semantic Information from Noisy Data
- Duration: 6 month internship, starting February-March 2020
- Location: IRIT, Toulouse (https://www.irit.fr)
- Research Team: MELODI (https://www.irit.fr/-Equipe-MELODI-)
- Supervisors: Nicholas Asher, Tim Van de Cruys
- Contact: nicholas.asher@irit.fr, tim.vandecruys@irit.fr

In cooperation with AIRBUS

Description

A prominent research subject within the domain of machine learning is
adjusting learning and training to noisy data. The subject has been
actively researched within the context of computer vision (Goldberger
and Ben-Reuven, 2017; Vahdat, 2017; Veit et al., 2017); in our case we
are interested in training from noisy linguistic data. While many groups
are looking at noisy linguistic data for general purpose learning in
open domains (Baldwin et al., 2015), the goal of this research
internship is to extract semantic information from noisy linguistic data
in closed or relatively closed domains. Use cases involve notices to
airmen (Notams) and airport traffic information bulletins (ATIS),
maintenance logs or notes for aircraft or other industrial productions,
and notes from meetings. The interest in such use cases is that they
allow us to look at a variety of learning systems and compare
them. Within relatively closed domains, we have had success with distant
supervision models, where we learn weights for a set of expert coded
rules based on an estimation of ground truth labels for unannotated
data, where the rules are derived from the study of a small but
representative and meticulously annotated corpus (Badene et al.,
2019). We would like to compare such models with neural network based
approaches, such as word embedding that incorporate character-based
representations (Joulin et al., 2017), as well as transformer networks
(viz. BERT; Devlin et al., 2019) that we can adapt to the task with
specific pretraining.

This research internship would be suitable for a 2nd year master student
(M2) with knowledge of machine learning and natural language processing
algorithms. Experience with Python libraries for neural network
implementations (specifically Pytorch) is a plus.

References

Sonia Badene, Kate Thompson, Jean-Pierre Lorré, and Nicholas Asher
(2019). Weak Supervision for Learning Discourse Structure. In
Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP), pp. 2296-2305.

Baldwin, Timothy, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim,
Alan Ritter, and Wei Xu (2015). Shared tasks of the 2015 workshop on
noisy user-generated text: Twitter lexical normalization and named
entity recognition. In Proceedings of the Workshop on Noisy
User-generated Text, pp. 126-135.

Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton and Toutanova, Kristina,
2019. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding. In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies, pp. 4171-4186.

Jacob Goldberger and Ehud Ben-Reuven (2017). Training deep
neural-networks using a noise adaptation layer. In 5th International
Conference on Learning Representations (ICLR 2017), Toulon, France.

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of
Tricks for Efficient Text Classification. In Proceedings of the 15th
Conference of the European Chapter of the Association for Computational
Linguistics: Volume 2, Short Papers (pp. 427-431). Association for
Computational Linguistics.

Arash Vahdat (2017). Toward Robustness against Label Noise in Training
Deep Discriminative Neural Networks. In Advances in Neural Information
Processing Systems 30: Annual Conference on Neural Information
Processing Systems 2017, Long Beach, CA, USA, pp. 5596-5605.

Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and
Serge J. Belongie (2017). Learning from Noisy Large-Scale Datasets with
Minimal Supervision. In 2017 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR 2017), Honolulu, HI, USA, pp. 6575-6583.