Stage de master 2 / Graduate internship Distant supervision for event extraction from a newswire corpus Keywords: natural language processing, text mining, machine learning, distant supervision. Ideal starting date: March/April 2017 Duration: 4-6 months Advisor: Xavier Tannier (LIMSI-CNRS), Olivier Ferret (CEA-LIST) Location: LIMSI, Orsay, Univ. Paris-Saclay 1 Context 1.1 ANR Project ASRAEL Information and communication society led to the production of huge volumes of content. This content is still generally non-structured (text, images, videos) and the promises of a "Web of Knowledge" are still long ahead. This situation evolves with the development of Open Data portals or resources such as DBPedia, that have made easier the access to information stored in databases (economic or demographic statistics, world knowledge contained in Wikipedia infoboxes, etc). However, most of the knowledge is still produced by textual data. Among the information concerned by the difficulty of accessing textual data, those related to events are of great interest, notably in the context of the emergence of data journalism. Data journalism has been fed until now by publicly available, statistical data, but it has paradoxically made only little use of the very journalistic materials that are events. The project ASRAEL aims at bridging this gap. Our proposal comes within the scope of the general scientific framework of information extraction (IE). We aim at extracting events from a large set of textual documents, without prior knowledge about them, and at populating and publishing a knowledge base of events. This knowledge base will be the support of a dedicated event search engine. 1.2 Event extraction We define event in a traditional information extraction way. An event is a structured representation of something that happens, with a nucleus, a spatiotemporal context and some arguments. The "event type" gathers comparable instances of events, as "earthquake", "election" or "car race". Arguments are attribute/value pairs that characterize an event type (for an earthquake, its location, date, magnitude, casualties...). A template is the set of arguments that can describe an event type (earthquake template, election template). The generic representation of an event is based on the rule of the "5 Ws" (What, Who, Where, When, Why) that prevails in the "Anglo-Saxon" way of writing articles. This rule stipulates that a good description of an event must make these five elements explicit. In automatic information extraction, the information about "Who", "Where" and "When" are extracted by a traditional and quite generic named entity recognition approach. On the other hand, the "What" is very domain-specific. For this reason, traditional IE systems lean on templates predefined by experts and identify events in texts with either rule-based systems or statistical models. However, in the general domain, where the huge number of possible events makes the manual definition of these templates impossible, information retrieval ("bag of words") methods take over, but do not provide a structured answer. 2 Description The global aim of the ASRAEL project is to build a fully-unsupervised event extraction system. However, the goal of this proposed internship can be seen as an intermediate goal, seeking at reducing the amount of necessary supervision in event extraction. Agence France Presse (AFP) is one of the partners of the project. They provide us with their newswire article corpus from 2004 to present, as well as textual chronologies of events and a few structured datasets containing the attributes of events of the same kinds (for example, a list of plane crashes, together with their date, location, plane type, casualties, cause, etc.). The intern will work on a distantly supervised system aiming at consolidating and updating such datasets. The different steps of such a system will be the following: 1. Use structured instances of events as described in the existing datasets as seed for a bootstrapping approach; 2. Find textual descriptions of these events in the newswire corpus; 3. Build a classifier from these descriptions; 4. Run the classifier on the entire corpus to find new instances or news descriptions of existing instances; 5. Build an update procedure for the analysis of new articles. Two main differences exist between the proposed approach and existing distant supervision approaches [1, 3]: - The eventive nature of the relations, making them temporally constrained and not always true (also explored in [2]); - The fact to some attributes may not been named entities (e.g. the cause of a crash). 3 Application We are particularly interested in candidates with a solid background in computer science and strong programming skills, having a good knowledge of machine learning and/or natural language processing. As most of the data are in French, knowledge of French basics is a plus. Applications should include: - Cover letter outlining interest in the position - Names of two referees - Curriculum Vitae (CV) The intern will be given a "bonus" (was 546,01 e in 2016) + half a "Navigo" (or "Imagine R") pass. Contact for questions and applications: Xavier.Tannier[at]limsi.fr 3 References [1] Mike Mintz, Steven Bills, Rion Snow, and Dan Jurafsky. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2 - Volume 2, pages 1003-1011, Suntec, Singapore, 2009. Association for Computational Linguistics. [2] Kevin Reschke, Martin Jankowiak, Mihai Surdeanu, Christopher Manning, and Daniel Jurafsky. Event Extraction Using Distant Supervision. In Proceedings of the 9th International Language Resources and Evaluation (LREC'2014), Reykjavik, Iceland, May 2014. [3] Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks. In Lluis Marquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015. Association for Computational Linguistics, Morristown, NJ, USA.