Research intern position (with the goal of pursuing in PhD) at Inria on
information extraction and automated knowledge graph construction for
French

*Place of work*:
Inria center in Paris area (Paris / Saclay / Rocquencourt)
*Duration*: 6 months internship + 3 years PhD
*Starting date*: Anytime in 2023
*Keywords*: artificial intelligence, natural language processing,
information extraction, knowledge graph, French language

*Context*

This internship fits within the roadmap activities of Inria's Defense &
Security Department.

Analysts in geopolitical crises are employed by the French Ministry of
Armed Forces to better identify emerging or ongoing conflicts
throughout the world. These analysts are typically overwhelmed by
continuous streams of plain-text information. The goal is to structure
that information, so that it can be manipulated as graphs and therefore
better formalized, cross-referenced and corroborated; such form would
in turn enable more advanced visualizations such as automatically
generating reports or various indicators on escalating tensions, with
the perspective of better anticipation.

Taking as example the current situation in Ukraine, one practical
application can be to get an overview of where in Ukraine there are
Russian tanks at present (and how many of them), based on reported
sightings posted by locals on Twitter. Another example is the
cross-referencing of live reports from online newspapers, to identify
which transport infrastructures (e.g. train stations, bridges...) have
been damaged all over the country, and thereby estimate remaining
options for evacuation of civilians.

The field of Natural Language Processing (NLP) offers numerous tools
and algorithms for information extraction, but they face several
limitations. First, many of those are disparate isolated tools, with
few comprehensive and consistent pipelines. While the first steps of
information structuring (extraction) are extensively studied, fewer
works reach the deeper stage of knowledge graph construction. And when
they do, they are often developed for English only, whereas here the
information stream would be in French.

Inria's Defense & Security Department develops and maintains a serious
game platform which can simulate the activity of a crisis monitoring
cell. Within that platform there will be the opportunity to experiment
with the NLP tools developed by the intern, in order to provide the
intern with practical feedback from players.

The intern will be supervised by Dr Lauriane Aufrant, who is the lead
NLP researcher within Inria's Defense & Security Department. PhD
supervision will be done jointly with Dr Frédérique Segond (Inria's
Defense & Security Director) or with a researcher from another Inria
team, depending on the exact chosen PhD topic (to be discussed, see
below).


*Candidate profile*

-   Pursuing a master's degree in Natural Language Processing,
    Computational Linguistics or Computer Science with a specialization
    in Machine Learning
-   Theoretical and practical knowledge of deep learning, as well as
    traditional machine learning and knowledge-driven AI
-   Strong programming skills (at least Python, git, Linux environment,
    command line and scripting)
-   Fluency in English. Knowledge or interest for the French language.
    Knowledge of a second foreign language would be appreciated.


*How to apply*

Send a CV and a cover letter to lauriane.aufrant and frederique.segond
(both at inria.fr)
Indications of referees or reference letters would be appreciated but
are not mandatory.


*Internship description*

Building a knowledge graph from text involves a number of diverse NLP
tasks, such as named entity recognition, named entity disambiguation
(aka entity linking), coreference resolution, open relation extraction,
relation clustering, document-level event extraction, slot filling,
etc.

The intern's work will touch upon the whole panel of tasks, but in
varying depth. Considering the large amount of open source code
releases in NLP research, priority is set on leveraging existing code
and models. When French models are not readily available, open source
code will need to be retrained on French corpora. And in some cases, it
will be necessary to re-implement an algorithm from its published
paper.

While deep learning approaches will be ubiquitous in that work, the
intern will need to remain open to alternate solutions, as the
large-scale datasets required by deep learning will not be available in
French for all these tasks.

The first research focus will be to study the best combination scheme
for these various tasks. For instance, relation clustering can inform
named entity disambiguation by providing more comprehensive and
consistent information on the named entities to disambiguate; and named
entity disambiguation can inform relation clustering by enabling access
to structured information on the arguments of those relations. To
leverage such interactions within the pipeline, several approaches are
possible: run one before the other, or vice-versa, iterate between
both, or setup joint predictions. These choices will be made based on
both theoretical and empirical analyses.

The second research focus will be a fine-grained evaluation of the
performance all over the pipeline, in order to identify where in the
pipeline the information is lost the most, and how errors propagate
throughout the pipeline. A thorough analysis will lead to identify
where to put the most research efforts in the future to improve
qualitative and quantitative performance.


*PhD follow-up*

The proposed internship is meant to serve as a proof of concept for a
broader project on building a unified framework for information
extraction. The goal is to implement a framework that is flexible
enough to integrate and evaluate any method from the state of the art
(with the prospect of becoming a new standard for the community), but
also experiment more deeply with pipeline design, considering
innovative combination schemes or extra preprocessing (e.g. parsing, to
enable syntax-aware models).

It is therefore proposed to pursue the internship with a PhD, whose
exact topic will be written in coordination with the intern, to fit
their primary interests within that broad objective.

In any case, the PhD topic will include at least working on the
framework itself (which will include extensive survey work to build an
accurate view of the diversity of existing approaches, and their
commonalities), and extensions of particular interest for the team are:
-   to pursue the work on the combination scheme, with the twofold goal
    of limiting error propagation and increasing the amount of
    available information in inputs,
-   to propose new task-specific models that better leverage the
    existing third-party information produced along the pipeline,
-   to work on algorithmic methods to speed up the building of a
    pipeline under that framework, for a new language that does not
    have as many datasets and existing models as English (including,
    but not limited to, transfer learning approaches),
-   to extend the framework with domain adaptation capabilities, to
    facilitate the application of the framework to a new domain (for a
    language in which a full pipeline already exists),
-   to focus on some specific tasks to improve in the French pipeline,
    according to the bottleneck analysis produced at the end of the
    internship.

Since the PhD application processes are early in the year (February-
April), the intern will be asked to commit early to that PhD follow-up,
possibly even before the internship begins, and to be ready to devote
some time for writing the application over that period.