Team V : Natural Language Processing
Axe V - version française


Coordinator: Ludovic Tanguy
Members
Cécile Fabre
Bruno Gaume
Nabil Hathout
Anna Kupść
Marie-Paule Péry-Woodley
PhD students and Post-Doc  
Clémentine Adam
Lydia-Mai Ho-Dac
Fanny Lalleman
Simon Leva
François Morlane-Hondère
Nikola Tulechki
Assaf Urieli
Associate Members
Lionel Clément

Overview

Research in the NLP team focuses on the use, adaptation and conception of computer systems to process linguistic data.
  • A corpus-driven approach: linguistic data are the main and starting point of our work. Computer-aided techniques allow us to get a new view of these data and to set up new procedures to investigate language.
  • Extensive data: we deal with large amounts of data, among which several corpora containing hundreds of million words each (textbanks, newspaper archives, encyclopedias). We also use the Web as a corpus.
  • Rich data: Our work systematically targets rich and complex data, rather than raw texts. Our datasets are either rich by nature (structured documents, dictionaries, manually annotated corpora) or enriched by automatic tagging and parsing tools.
  • Multiple objectives: We aim at the description and modelling of language, but we also contribute to NLP applications (information retrieval mostly) and the development of generic linguistic data and tools.
  • Multiple domains: Our work addresses different linguistic domains: morphology, lexicon, syntax, semantics, discourse, psycholinguistics.

Method

Although we address language engineering through the development and use of computer tools, our main target is the study of language. We see NLP as an experimental apparatus that enables us to confront linguistic theories with data, and to make new phenomena emerge from data.

Consequently, we work on massive data, using specific tools and addressing new questions. These questions can be related to already well-defined topics or categories (lexical relations, syntactic constituents) but also to fuzzier and more controversial aspects of language (lexical cohesion, discourse relations). This methodology leads us to design original methods for exploring data (lexical graph walk, large scale distributional analysis). It relies on our ability to efficiently use tools and data to investigate complex phenomena at different levels of description (structure of the lexicon, discourse and documents organisation, as detailed below).

This versatility is possible thanks to the diversity of linguistic skills available in the team. The different approaches are mutually enriched by sharing both methods and results.

Main Research Topics

Structure of the lexicon

  • Study of the structure of the lexicon through dictionary graphs (B. Gaume)

    Most lexical graphs, like the majority of field graphs are Small Worlds networks and have specific structural properties: low edge density (P1), low topological distance between vertices (P2), large representation of high-density subgraphs (P3), fat-tailed distribution of incidence (P4). These properties signal fundamental linguistic phenomena and enable us to better understand and use the underlying data.
    For example, property P3 indicates the presence of clusters in a synonym graph; these clusters reflect concepts of the language under study. In this area, our work consists in proposing new metrologies based on a stochastic method (PROX) on lexical graphs in order to model linguistic phenomena across different data and languages (M3 project): synonymy, hyperonymy, metaphor, disambiguation. This method also allows us to shed new light on some psycho-linguistic phenomena (learning, deficit, approximation), and to propose new tools for information retrieval (for both the Web and text collections such as Wikipedia).

  • Acquisition of lexico-syntactic information and lexical resources from corpora (C. Fabre and A. Kupść)

    The availability of large annotated corpora gives linguistics new means of investigation and makes possible the development of lexical resources. We use two different methods (both separately and in conjunction) to study the complementation properties of verbs and adjectives. The first approach explores a treebank and is driven by linguistic knowledge. It resulted in creation of the Treelex lexicon, which has been manually validated. The second method is corpus-driven, and consists in the exploitation of large corpora that are automatically parsed (using Syntex, a parser developed by D. Bourigault) and processed by applying statistical techniques; it allows us to perform a large-scale study of the argument-adjunct continuum in French.
    Parsed corpora are also used to compute distributional similarities and to identify semantic relations between words that share the same syntactic contexts. We have therefore built distributional databases from several large corpora, and use this data to investigate lexical relations and discourse cohesion.

Discourse and Documents

  • Study of discourse organisation (M.-P. Péry-Woodley, C. Fabre, L. Tanguy)

    Our work on discourse organisation reflects the following choices:

    1. we consider texts as functional units (in the Systemic Functional Linguistics framework);
    2. we take their social dimension into account (documents at work);
    3. we design methods involving NLP tools and oriented towards NLP applications.

    The ANR project ANNODIS is the embodiment of these researcg directions: it aims at the construction of a corpus of French texts enriched with discourse annotations. These annotations are of two kind: multi-level structures -in particular enumerative structures- in a top-down approach (NLP team), and discourse relations in a bottom-up approach (S'caladis team). The method combines automated premarking and manual annotation via a specific user interface (GREYC lab, Caen) which also provides corpus querying tools and performs data mining. The VOILADIS project (PRES Toulouse) adds a lexical dimension to this study: its objective is to use lexical cohesion indices to help the identification of discourse structures (Clémentine Adam's PhD thesis).
    Research topics within Annodis, along with our collaboration with GREYC and IRIT are based on the former research project GEOSEM (CNRS, 2005-07) which aimed at the exploitation of discourse structures for intra-document navigation. (Ho-Dac's PhD thesis, defended in 2007).

  • Focused corpus studies (C. Fabre, L. Tanguy, M.-P. Péry-Woodley)

    A number of studies make use of automated methods for the annotation, exploration and characterisation of specific corpora, in order to meet the needs emanating from private companies or academic fields. Here are a few examples of these studies:

    • query topics from Information Retrieval evaluation campaigns (TREC and CLEF), where linguistic features have been studied in order to explain and predict IR systems' behaviour (ARIEL project);
    • research articles in Humanities and Social Sciences, focusing on citation analysis and the automatic classification of articles based on their citation profiles (RHECITAS project);
    • general medicine consultation transcripts, where doctor-patient interaction is studied and profiled with a focus on intercomprehension (INTERMEDE project);
    • encyclopedia entries, where obsolete information that requires an update is automatically identified and extracted (Marion Laignelet's PhD);
    • accident investigation reports from civil aviation authorities, where the attribution of specific coded events is checked automatically (joint research with BEA and CFH).

Tools and Data

As previously shown, one of the team's objectives is to build computer tools and linguistic datasets that are made available to the scientific community. Here is a short list:

  • PROX: a tool to navigate lexical graphs
  • Syntex: A robust parser developed through a joint effort by CLLE-ERSS and the Synomia company
  • Upery: a system for distributional analysis based on Syntex' output. It was used to build the following lexical databases from large corpora: Voisins de Le Monde et Voisins d'En Face
  • Leximédia 2007: a lexical and terminological system that enabled the real-time observation of political topics addressed by the candidates in the French presidential election campaign in 2007.
  • Treelex: a subcategorisation lexicon (verbs and adjectives)
  • TELOC : The TELOC project (Textes En Langue Occitane) aims at the development of an Occitan language textbank, names BaTelÒc.

These tools and resources are also available from the REDAC website.

Events

Members of the team actively participate in international conferences committees among which:

Teaching Activities

Members of the team are involved in teaching activities at Toulouse and Bordeaux Universities. Our research topics are directly connected to the following NLP courses and programmes: