| 
|
Team V : Natural Language Processing
|


Overview
Research in the NLP team focuses on the use, adaptation and conception
of computer systems to process linguistic data.
-
A corpus-driven approach: linguistic data are the main and
starting point of our work. Computer-aided techniques allow us to get
a new view of these data and to set up new procedures to investigate
language.
-
Extensive data: we deal with large amounts of data, among which
several corpora containing hundreds of million words each (textbanks,
newspaper archives, encyclopedias). We also use the Web as a corpus.
-
Rich data: Our work systematically targets rich and complex
data, rather than raw texts. Our datasets are either rich by nature
(structured documents, dictionaries, manually annotated corpora) or
enriched by automatic tagging and parsing tools.
-
Multiple objectives: We aim at the description and modelling of
language, but we also contribute to NLP applications (information
retrieval mostly) and the development of generic linguistic data and
tools.
-
Multiple domains: Our work addresses different
linguistic domains: morphology, lexicon, syntax, semantics, discourse,
psycholinguistics.
Method
Although we address language engineering through the development and
use of computer tools, our main target is the study of language. We
see NLP as an experimental apparatus that enables us to confront
linguistic theories with data, and to make new phenomena emerge from
data.
Consequently, we work on massive data, using specific tools and
addressing new questions. These questions can be related to
already well-defined topics or categories (lexical relations,
syntactic constituents) but also to fuzzier and more
controversial aspects of language (lexical cohesion, discourse
relations). This methodology leads us to design original methods
for exploring data (lexical graph walk, large scale
distributional analysis). It relies on our ability to
efficiently use tools and data to investigate complex phenomena
at different levels of description (structure of the lexicon,
discourse and documents organisation, as detailed below).
This versatility is possible thanks to the diversity of linguistic
skills available in the team. The different approaches are mutually
enriched by sharing both methods and results.
Main Research Topics
Structure of the lexicon
-
Study of the structure of the lexicon through dictionary graphs (B. Gaume)
Most lexical graphs, like the majority of field graphs are Small
Worlds networks and have specific structural properties: low edge
density (P1), low topological distance between vertices (P2), large
representation of high-density subgraphs (P3), fat-tailed distribution
of incidence (P4). These properties signal fundamental linguistic
phenomena and enable us to better understand and use the underlying
data. For example, property P3 indicates the presence of clusters
in a synonym graph; these clusters reflect concepts of the
language under study. In this area, our work consists in proposing new metrologies
based on a stochastic method (PROX) on lexical graphs in order to
model linguistic phenomena across different data and languages (M3
project): synonymy, hyperonymy, metaphor, disambiguation. This method
also allows us to shed new light on some psycho-linguistic phenomena
(learning, deficit, approximation), and to propose new tools for
information retrieval (for both the Web and text collections such as
Wikipedia).
-
Acquisition of lexico-syntactic information and lexical resources from corpora (C. Fabre and A. Kupść)
The availability of large annotated corpora gives linguistics new
means of investigation and makes possible the development of lexical
resources. We use two different methods (both separately and in
conjunction) to study the complementation properties of verbs and
adjectives. The first approach explores a treebank and is driven by
linguistic knowledge. It resulted in creation of the Treelex lexicon,
which has been manually validated. The second method is corpus-driven,
and consists in the exploitation of large corpora that are
automatically parsed (using Syntex, a parser developed by
D. Bourigault) and processed by applying statistical techniques; it
allows us to perform a large-scale study of the argument-adjunct
continuum in French.
Parsed corpora are also used to compute distributional similarities
and to identify semantic relations between words that share the same
syntactic contexts. We have therefore built distributional databases
from several large corpora, and use this data to investigate lexical
relations and discourse cohesion.
Discourse and Documents
-
Study of discourse organisation (M.-P. Péry-Woodley, C. Fabre, L. Tanguy)
Our work on discourse organisation reflects the following
choices:
- we consider texts as functional units (in the Systemic Functional
Linguistics framework);
- we take their social dimension into account (documents at work);
- we design methods involving NLP tools and oriented towards NLP
applications.
The ANR project ANNODIS is the embodiment
of these researcg directions: it aims at the construction of a corpus
of French texts enriched with discourse annotations. These annotations
are of two kind: multi-level structures -in particular enumerative
structures- in a top-down approach (NLP team), and discourse relations
in a bottom-up approach (S'caladis team). The method combines
automated premarking and manual annotation via a specific user
interface (GREYC lab, Caen) which also provides corpus querying tools
and performs data mining. The VOILADIS project (PRES Toulouse) adds a
lexical dimension to this study: its objective is to use lexical
cohesion indices to help the identification of discourse structures
(Clémentine Adam's PhD thesis).
Research topics within Annodis, along with our collaboration with GREYC and
IRIT are based on the former research project GEOSEM (CNRS, 2005-07)
which aimed at the exploitation of discourse structures for
intra-document navigation. (Ho-Dac's PhD thesis, defended in
2007).
-
Focused corpus studies (C. Fabre, L. Tanguy, M.-P. Péry-Woodley)
A number of studies make use of automated methods for the
annotation, exploration and characterisation of specific corpora, in
order to meet the needs emanating from private companies or
academic fields. Here are a few examples of these studies:
- query topics from Information Retrieval evaluation campaigns (TREC
and CLEF), where linguistic features have been studied in order to
explain and predict IR systems' behaviour (ARIEL project);
- research articles in Humanities and Social Sciences, focusing on
citation analysis and the automatic classification of articles based
on their citation profiles (RHECITAS project);
- general medicine consultation transcripts, where doctor-patient
interaction is studied and profiled with a focus on intercomprehension
(INTERMEDE project);
- encyclopedia entries, where obsolete information that requires an
update is automatically identified and extracted (Marion Laignelet's
PhD);
- accident investigation reports from civil aviation authorities,
where the attribution of specific coded events is
checked automatically (joint research with BEA and CFH).
Tools and Data
As previously shown, one of the team's objectives is to build
computer tools and linguistic datasets that are made available to the
scientific community. Here is a short list:
-
PROX: a tool to
navigate lexical graphs
-
Syntex:
A robust parser developed through a joint effort by CLLE-ERSS and the
Synomia company
- Upery: a system for distributional analysis based on Syntex'
output. It was used to build the following lexical databases from
large corpora: Voisins de Le
Monde et Voisins d'En Face
-
Leximédia 2007:
a lexical and terminological system that enabled the real-time
observation of political topics addressed by the
candidates in the French presidential election campaign
in 2007.
-
Treelex:
a subcategorisation lexicon (verbs and adjectives)
-
TELOC :
The TELOC project (Textes En Langue Occitane) aims at
the development of an Occitan language textbank, names
BaTelÒc.
These tools and resources are also available from the
REDAC website.
Events
Members of the team actively participate in international conferences
committees among which:
Teaching Activities
Members of the team are involved in teaching activities at Toulouse
and Bordeaux Universities. Our research topics are directly connected
to the following NLP courses and programmes:
| |