Surface realisation and large scale over-generation detection Position type: Post-doctoral Fellow Functional area: Nancy Research theme: Symbolic systems Project: TALARIS Environnement http://talaris.loria.fr/ In Natural Language Processing, a computational grammar is a central component which describes the association between linguistic expressions, their syntax and their semanitcs. On the basis of such a grammar, a parser will produce the semantic representation associated by the grammar with a string. Inversely, a realiser will generate all the sentences associated by the grammar with a given semantic representation. Unfortunately, because they are manually developed, symbolic grammars used in NLP systems are usually imperfect. In particular, they often over-generate in that they accept ungrammatical sentences. Importantly, over-generation has a negative impact both on the quality of the results (ungrammatical strings are produced) and on the realiser/analyser efficiency (because it leads to a proliferation of illicit intermediate structures and complexifies disambiguation). The proposed postdoc aims to remedy these shortcomings by exploring how a surface realiser can be used to detect and correct overgeneration. - team's post-doc : TALARIS - supervisor : Claire GARDENT Missions A surface realiser generates from a meaning representation a text verbalising that meaning. More precisely, a realiser takes as input a grammar and a semantic representation (often a logical formula) and produces as output the sentences associated by the grammar with that semantic representation. In order to detect and correct the sources of over-generation in a grammar, (Gardent & Kow 2007) have shown how a realiser could be used to detect over-generation in a grammar. More specifically, they propose a three steps procedure. First, the realiser is used to generate from a suite of semantic representations the strings associated by the grammar with these representations. Second, the realiser output is inspected and manually annotated as either pass or over-generation (the string produced is not a sentence of the language modelled by the grammar). Third, the information associated by the realiser with the output strings is used to identify the possible sources of overgeneration. The aim of the postdoc will be to improve on that approach and in particular to turn it into a fully automated and large scale procedure. Activités The programme of work will concentrate on the following three points. Creating a large input set. The suite of semantic representations used in (Gardent & Kow 2007) is small (roughly 120 entries). By contrast, error mining techniques used to detect under-generation in parsing exploit corpora of several hundred of thousands sentences. In order to extend the suite of semantic representations used to detect over-generation, the postdoc will work on producing a large test suite based on parsing (the grammar when used in parsing mode permits outputing semantic representations). The main challenge is to devise a disambiguation method for choosing from amongst the many parses produced by the parser the correct one and hence the correct semantic representation. Here, existing symbolic (optimality theory) and stochastic (e.g., expectation maximisation) methods will be examined and adapted to the Talaris parser. Automated validation of the realiser output. In (Gardent & Kow 2007), the realiser output is manually annotated as either pass or overgeneration. To speed up this part of the process and support large scale evaluation, the postdoc will explore ways in which this classification process could be automated. Here several directions are possible including the use of an analyzer (a string that is not parsable by another parser has a higher probability of not being grammatical than one that is not) and/or of a language model (using bigrams or n-grams to estimate the probability of a given string). Use of statistical techniques to identify the source of over-generation. In (Gardent & Kow 2007), a derivational element (word, grammar rule, sub-derivation) is considered suspect if it systematically occurs in derivations that are marked as « overgeneration cases ». In other words, only definite causes of over-generation are identified thus restricting the scope of the method. To extend the approach to probable causes of over-generation, the postdoc will aim to adapt existing stochastic methods used for under-generation detection (van Noord 2004; Sagot and de la Clergerie 2005; Nicolas et alii. 2007) to the issue of over-generation detection. The work will be based on SemTAG (http://trac.loria.fr/~semconst/, http://trac.loria.fr/~geni/ ),, an environment for parsing and generating French using a Tree Adjoining Grammar. The error mining techniques defined on a theoretical level will be tested on the french grammar thus permitting a quantitative evaluation of the impact of these proposed techniques. The project will be carried out in close collaboration with Claire Gardent and in relation with the ANR project PASSAGE (Produire des annotations syntaxiques à grande échelle pour aller de l'avant, http://atoll.inria.fr/~clerger/ANRMD06/ ). (Gardent & Kow 2007). Spotting overgeneration suspects. Gardent Claire and Eric Kow. 11th European Workshop on Natural Language Generation (ENLG), 2007. (van Noord 2004) Error Mining for wide coverage grammar engineering. Gertjan van Noord. Proceedings of ACL 2004. (Sagot and de la Clergerie 2005). Error mining in parsing results. Benoit Sagot ad Eric de la Clergerie. Proceedings of ACL 2006. (Nicolas et alii. 2007). Confondre le coupable: corrections à un lexique suggéré par une grammaire. L. Clément, J. Farré et E. de la Clergerie. Compétences et Profil The candidate will have a background in computational linguistics, good statistical knowledge and a solid computational background. A knowledge of Haskell and of parsing and/or realisation algorithms useful but not imperative. Informations complémentaires Contact and e-mail : * Téléphone - (33) 383592039 * mail : Claire.Gardent@loria.fr duration of the contract : between 12 and 24 mois