INRIA Postdoc: Large scale error detection using surface realisation

Because they are manually developed, symbolic grammars used in NLP systems are usually imperfect. In particular, they often over-generate in that they accept ungrammatical sentences. Importantly, over-generation has a negative impact both on the quality of the results (ungrammatical strings are produced) and on the realiser/analyser efficiency (because it leads to a proliferation of illicit intermediate structures and complexifies disambiguation).

The proposed postdoc aims to remedy these shortcomings by exploring how a surface realiser can be used to detect and correct overgeneration.

Program of work

A surface realiser generates from a meaning representation a text verbalising that meaning. More precisely, a realiser takes as input a grammar and a semantic representation (often a logical formula) and produces as output the sentences associated by the grammar with that semantic representation.

In order to detect and correct the sources of over-generation in a grammar, (Gardent & Kow 2007) have shown how the surface realiser GenI could be used to detect over-generation in a grammar. More specifically, they propose a three steps procedure. First, the realiser is used to generate from a suite of semantic representations the strings associated by the grammar with these representations. Second, the realiser output is inspected and manually annotated as either pass or over-generation (the string produced is not a sentence of the language modelled by the grammar). Third, the information associated by the realiser with the output strings is used to identify the possible sources of overgeneration.

The aim of the postdoc is to improve on this approach and in particular to turn it into a fully automated and large scale procedure.

More specifically, the programme of work will concentrate on the following three points.

Creating a large input set. The suite of semantic representations used in (Gardent & Kow 2007) is small (roughly 120 entries). By contrast, error mining techniques used to detect under-generation in parsing exploit corpora of several hundred of thousands sentences. In order to extend the suite of semantic representations used to detect over-generation, the postdoc will work on producing a large test suite based on parsing (the grammar when used in parsing mode permits outputing semantic representations). The main challenge is to devise a disambiguation method for choosing from amongst the many parses produced by the parser the correct one and hence the correct semantic representation. Here, existing symbolic (optimality theory) and stochastic (e.g., expectation maximisation) methods will be examined and adapted to the Talaris parser.

Automated validation of the realiser output. In (Gardent & Kow 2007), the realiser output is manually annotated as either pass or overgeneration. To speed up this part of the process and support large scale evaluation, the postdoc will explore ways in which this classification process could be automated. Here several directions are possible including the use of an analyzer (a string that is not parsable by another parser has a higher probability of not being grammatical than one that is not) and/or of a language model (using bigrams or n-grams to estimate the probability of a given string).

Use of statistical techniques to identify the source of over-generation. In (Gardent & Kow 2007), a derivational element (word, grammar rule, sub-derivation) is considered suspect if it systematically occurs in derivations that are marked as Â«Â overgeneration casesÂ Â». In other words, only definite causes of over-generation are identified thus restricting the scope of the method. To extend the approach to probable causes of over-generation, the postdoc will aim to adapt existing stochastic methods used for under-generation detection (van Noord 2004; Sagot and de la Clergerie 2005; Nicolas et alii. 2007) to the issue of over-generation detection.

The work will be based on SemTAG, an environment for parsing and generating French using a Tree Adjoining Grammar. The error mining techniques defined on a theoretical level will be tested on the french grammar thus permitting a quantitative evaluation of the impact of these proposed techniques.

The project will be carried out in close collaboration with Claire Gardent and in relation with the ANR project PASSAGE (Produire des annotations syntaxiques a grande echelle pour aller de l'avant).

Candidate profile

The candidate will have a background in computational linguistics, good statistical knowledge and a solid computational background. Good knowledge of Haskell and of a script language (perl, python) is required. A knowledge of parsing and/or realisation algorithm is useful but not imperative.

Conditions for applicants

Your have held a doctorate or Ph.D. for less than one year before the reruitment's date. If the Ph.D. is not defended at the application date, you should clearly point out the defense date and the composition of jury.
High priority will be given to French and foreign applicants who prepared their doctorate abroad.
No nationality requirement.
Please complete one application file per topic proposed by the research units.
If you have obtained your doctorate from within an INRIA research unit, you cannot apply to this unit. You may however apply to other research units.

Contact

Claire Gardent, Claire.Gardent@loria.fr