Stage de master 2 / Graduate internship Automatic Classification of Claims from Political Debates and Declarations Keywords: natural language processing, text mining, machine learning, computational journalism, fact-checking. Ideal starting date: March/April 2017 Duration: 4-6 months Advisor: Xavier Tannier (LIMSI-CNRS) Location: LIMSI, Orsay, Univ. Paris-Saclay1 Context (ANR Project ContentCheck) Fact checking is the task of assessing the factual accuracy of claims, generally made by public figures such as politicians, entrepreneurs, etc. Fact-checking is part and parcel of journalists' everyday work, either while working independently on an article, or as part of vetting done in the newsroom before publication, to prevent the publication of innacurate information. Modern factchecking is faced with a triple revolution in terms of scale, complexity, and visibility: many more claims are made and disseminated through Web and social media, they represent a complex reality and their investigation requires using multiple heterogeneous data source. Our project (https://team.inria.fr/cedar/contentcheck/) brings together academic labs with expertise in data management, natural language processing, automated reasoning and data mining, and a fact-checking team of journalists from a major French Web media. In recent years, journalists and the computer scientists started talking to each other in order to identify which technologies could help journalists' everyday work. This space of exchanges is known as Computational Journalism [1]. At this level, it encompasses very diverse uses and tools such as learning how to correctly or better use a database, producing simple spreadsheet-based visuali1 sations that are however personalized or better adapted to Web format, optical character recognition for scanned texts in order to conduct computer-based keyword search or using statistics to analyse public data from an interesting point of view, thus highlighting interesting trends. These latter uses are referred to by the term Data Journalism [2]. Description The process of fact checking requires many challenging steps; one of them is to separate factual claims from opinions, beliefs, hyperboles, questions, etc. and to discern which are "check-worthy", i.e. deserve to be considered and checked by the journalists [3]. The intern will work on this particular task: (s)he will build a tool extracting automatically the check-worthy claims and classifying them between different predefined classes (such as "doubtful number", "doubtful fact", "opinion", "contextualization needed", etc.), in order to make the watch easier for the journalist. Examples of claims to classify could be: - "40 % de la taxe ont été détourné pour rémunérer le capital d'une société italienne privée" (number to check) - "25 % du chiffre d'affaires d'Amazon se fait le dimanche." (number to check) "On peut continuer à ne vouloir laisser travailler que les multinationales anglo-saxonnes qui paient peu d'impôts dans notre pays le dimanche mais ça n'est pas la bonne solution." (opinion, need for contextualization) - "J'ai été le premier avec Wolfgang Schäuble à signer une lettre pour que nous soyions capables de mettre en place cette coopération renforcée à onze." (fact to check) - "Je veux abroger le droit du sol" (possible contradiction with a former claim by the same person) - "Je ne peux pas accepter que les Etats-Unis soient devenus du point de vue de l'énergie indépendants grâce au gaz de schiste et que la France ne puisse pas profiter de cette nouvelle énergie" (opinion, need for contextualization) 2 Dataset A labeled dataset in French will be provided by our partners from the newspaper Le Monde. It will contains political claims coming from different sources: - Newspaper articles - Debates - Speeches - Twitter and other social networks - etc. Approach We will model this problem as a classication task and follow a supervised learning approach to tackle it. Application We are particularly interested in candidates with a solid background in computer science and strong programming skills, having a good knowledge of machine learning and/or natural language processing. As most of the data are in French, knowledge of French basics is a plus. Applications should include: - Cover letter outlining interest in the position - Names of two referees - Curriculum Vitae (CV) The intern will be given a "bonus" (was 546,01 e in 2016) + half a "Navigo" (or "Imagine R") pass. Contact for questions and applications: Xavier.Tannier[at]limsi.fr 3 References [1] Sarah Cohen, James T. Hamilton, and Fred Turner. Computational Journalism. Communications of the ACM, 54(11):66-71, 2011. [2] Jonathan Gray, Lucy Chambers, and Liliana Bounegru. The Data Journalism Handbook. O'Reilly, 2012. [3] Naeemul Hassan, Chengkai Li, and Mark Tremayne. Detecting Checkworthy Factual Claims in Presidential Debates. In Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM 2015), Melbourne, Australia, October 2015.