- Title: Extracting Semantic Information from Noisy Data - Duration: 6 month internship, starting February-March 2020 - Location: IRIT, Toulouse (https://www.irit.fr) - Research Team: MELODI (https://www.irit.fr/-Equipe-MELODI-) - Supervisors: Nicholas Asher, Tim Van de Cruys - Contact: nicholas.asher@irit.fr, tim.vandecruys@irit.fr In cooperation with AIRBUS Description A prominent research subject within the domain of machine learning is adjusting learning and training to noisy data. The subject has been actively researched within the context of computer vision (Goldberger and Ben-Reuven, 2017; Vahdat, 2017; Veit et al., 2017); in our case we are interested in training from noisy linguistic data. While many groups are looking at noisy linguistic data for general purpose learning in open domains (Baldwin et al., 2015), the goal of this research internship is to extract semantic information from noisy linguistic data in closed or relatively closed domains. Use cases involve notices to airmen (Notams) and airport traffic information bulletins (ATIS), maintenance logs or notes for aircraft or other industrial productions, and notes from meetings. The interest in such use cases is that they allow us to look at a variety of learning systems and compare them. Within relatively closed domains, we have had success with distant supervision models, where we learn weights for a set of expert coded rules based on an estimation of ground truth labels for unannotated data, where the rules are derived from the study of a small but representative and meticulously annotated corpus (Badene et al., 2019). We would like to compare such models with neural network based approaches, such as word embedding that incorporate character-based representations (Joulin et al., 2017), as well as transformer networks (viz. BERT; Devlin et al., 2019) that we can adapt to the task with specific pretraining. This research internship would be suitable for a 2nd year master student (M2) with knowledge of machine learning and natural language processing algorithms. Experience with Python libraries for neural network implementations (specifically Pytorch) is a plus. References Sonia Badene, Kate Thompson, Jean-Pierre Lorré, and Nicholas Asher (2019). Weak Supervision for Learning Discourse Structure. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 2296-2305. Baldwin, Timothy, Marie-Catherine de Marneffe, Bo Han, Young-Bum Kim, Alan Ritter, and Wei Xu (2015). Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text, pp. 126-135. Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton and Toutanova, Kristina, 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4171-4186. Jacob Goldberger and Ehud Ben-Reuven (2017). Training deep neural-networks using a noise adaptation layer. In 5th International Conference on Learning Representations (ICLR 2017), Toulon, France. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers (pp. 427-431). Association for Computational Linguistics. Arash Vahdat (2017). Toward Robustness against Label Noise in Training Deep Discriminative Neural Networks. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, pp. 5596-5605. Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge J. Belongie (2017). Learning from Noisy Large-Scale Datasets with Minimal Supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), Honolulu, HI, USA, pp. 6575-6583.