Research intern position (with the goal of pursuing in PhD) at Inria on
generation of natural language datasets and metrics of
representativeness

*Place of work*: Inria center in Paris area
(Paris / Saclay /Rocquencourt)

*Duration*: 6 months internship + 3 years PhD

*Starting date*: Anytime in 2023

*Keywords*: artificial intelligence, natural language processing,
natural language generation, story generation, evaluation metrics

*Context*
This internship fits within the roadmap activities of Inria's Defense
& Security Department.

Inria's Defense & Security Department develops and maintains a serious
game platform which can simulate the activity of a crisis monitoring
cell. For instance, analysts in geopolitical crises are employed by the
French Ministry of Armed Forces to better identify emerging or ongoing
conflicts throughout the world. These analysts are typically
overwhelmed by continuous streams of plain-text information and can be
helped by Natural Language Processing (NLP) tooling. That platform
enables, among other uses, to experiment with the NLP tools developed
by researchers and partners, in order to get practical feedback from
players on their usefulness during action.

For these simulations, a fictive world has been created by Inria,
including imaginary cities and an imaginary historical, social and
political background.

In order to test NLP solutions in a game scenario, it is necessary to
create large amounts of text documents, which need to be both realistic
in terms of their form and type of contents, and tailored to the
context of that imaginary world. To date, this creation remains mostly
manual, which is a very time-consuming process. To scale the scenarios
up, it is thus planned to gradually automate this data creation, which
is the ambition to which this internship will contribute.

The intern will be supervised by Dr Lauriane Aufrant, who is the lead
NLP researcher within Inria's Defense & Security Department. PhD
supervision will be done jointly with Dr Frédérique Segond (Inria's
Defense & Security Director) or with a researcher from another Inria
team, depending on the exact chosen PhD topic (to be discussed, see
below).


*Candidate profile*

-   Pursuing a master's degree in Natural Language Processing,
    Computational Linguistics or Computer Science with a specialization
    in Machine Learning
-   Theoretical and practical knowledge of deep learning, as well as
    traditional machine learning and knowledge-driven AI
-   Strong programming skills (at least Python, git, Linux environment,
    command line and scripting)
-   Fluency in English. Knowledge or interest for the French language.
    Knowledge of a second foreign language would be appreciated.


*How to apply*

Send a CV and a cover letter to lauriane.aufrant and frederique.segond
(both at inria.fr)
Indications of referees or reference letters would be appreciated but
are not mandatory.


*Internship description*

Two configurations for Natural Language Generation (NLG) will be
considered throughout this work: topic-based NLG (to produce haystacks)
and knowledge base-to-text NLG (to produce needles).

The internship will be devoted to prepare the scientific framework
within which the PhD work will be developed, and in particular metrics
that will be used to guide and evaluate future contributions on the
generation itself.

In the current NLG literature, text generation is usually evaluated at
the sentence- or document-level only, by considering aspects such as
fluency (is it good French?) or coherence (does the text make sense?).
But much more rarely is the setting of NLG evaluation at the dataset
level considered. This includes accounting for higher-level properties,
such as realism, consistency and diversity:
-   Diversity, as in generating documents with sufficient variability
    in form and content, is increasingly considered in the literature,
    in particular as a result of some NLG models' tendency to repeat
    themselves. However, existing metrics are often rather basic, and
    covering only some aspects of the diversity property (see for
    instance https://arxiv.org/pdf/2006.14799.pdf or
    https://aclanthology.org/2021.eacl-main.25.pdf ), so there is room
    for improvement.
-   Consistency, as in avoiding to generate one document where a given
    person is born on April 12 and another where the same person is
    born on April 13, is sometimes considered at the level of a single
    document (in the context of story generation), but not really
    across documents. Still, there is inspiration to be drawn here from
    the fact checking literature, for instance considering settings
    where each document in turn is to be fact checked against all
    others.
-   Realism is a much more complex, multi-faceted property,
    encompassing at least the adequacy of style (would the fictive
    author have written that way, in light of their background?), the
    nature of contents (would a transcript of political debate cite
    football match results?), the nature of facts (would a politician
    be aged 15, or a World cup football player aged 60?), and possibly
    many other aspects. On this part, most of the work remains to be
    done.

The intern's work will be to develop new metrics to quantify dataset
quality along those three properties, drawing inspiration both from
conceptual investigations and user studies (to build a taxonomy of
aspects to evaluate) and from empirical studies conducted with
state-of-the-art NLG models to observe how the generated data reacts to
various tentative metrics.


*PhD follow-up*

Based on the scientific methodology developed during the internship to
evaluate NLG quality at dataset level, the work will be pursued as a
PhD to propose new generation methods that better perform along those
metrics.

Considering both settings of topic-based and knowledge base-to-text
NLG, the goal in each case is to compare pure generation methods (using
GPT3-like models or more specialized models) with approaches based on
well-focused Web crawling followed by text substitutions, paraphrasing
and other automated transformations applied on the collected documents,
to modify their style or their base information.

The PhD student will work on designing and implementing new approaches
along those lines, but also use the new metrics to evaluate and compare
those approaches with existing models in NLG literature. As work
progresses, it is also likely that the initially proposed metrics are
refined or complemented to better match empirical evidence gathered
during the PhD.

The exact PhD topic will be written in coordination with the intern, to
fit their primary interests within that broad objective. In any case,
the work will focus as a first step on plain-text documents written in
correct language, but then it will tackle the generation of more
complex types of data, for which several options can be considered,
such as: generating tweets, documents with corrupted language
(simulating typos or grammar errors in the most realistic way),
multimodal documents that include and discuss tables or pictures,
corpora containing divergent views and (purposedly) inconsistent facts,
etc.

Since the PhD application processes are early in the year
(February-April), the intern will be asked to commit early to that PhD
follow-up, possibly even before the internship begins, and to be ready
to devote some time for writing the application over that period.