Proposition de stage 2022 : Ť Building a French pipeline for
information extraction and automated knowledge graph construction ť

Research intern position at Inria on information extraction and
automated knowledge graph construction for French

Place of work: Inria's center in Rocquencourt (Paris area)

Duration: 6 months

Starting date: Anytime in 2022

Keywords: artificial intelligence, natural language processing,
information extraction, knowledge graph, French language

Context
This internship fits within the roadmap activities of Inria's Defense &
Security Department.

Analysts in geopolitical crises are employed by the French Ministry of
Armed Forces to better identify emerging or ongoing conflicts
throughout the world. These analysts are typically overwhelmed by
continuous streams of plain-text information. The goal is to structure
that information, so that it can be manipulated as graphs and therefore
better formalized, cross-referenced and corroborated; such form would
in turn enable more advanced visualizations such as automatically
generating reports or various indicators on escalating tensions, with
the perspective of better anticipation.

The field of Natural Language Processing (NLP) offers numerous tools
and algorithms for information extraction, but they face several
limitations.
First, many of those are disparate isolated tools, with few
comprehensive and pipelines. While the first steps of information
structuring (extraction) are extensively studied, fewer works reach the
deeper stage of knowledge graph construction. And when they do, they
are often developed for English only, whereas here the information
stream would be in French.

Inria's Defense & Security Department develops and maintains a serious
game platform which can simulate the activity of a crisis monitoring
cell. Within that platform there will be the opportunity to experiment
with the NLP tools developed by the intern, in order to provide the
intern with practical feedback from players.

The intern will be supervised by Dr Lauriane Aufrant, who is the lead
NLP researcher within Inria's Defense & Security Department.


Candidate profile

-   Pursuing a master's degree in Natural Language Processing,
    Computational Linguistics or Computer Science with a
    specialization in Machine Learning

-   Theoretical and practical knowledge of deep learning, as well as
    traditional machine learning and knowledge-driven AI

-   Strong programming skills (at least Python, git, Linux environment,
    command line and scripting)

-   Fluency in English. Knowledge or interest for the French language.
    Knowledge of a second foreign language would be appreciated.

How to apply

Send a CV and a cover letter to lauriane.aufrant@inria.fr and
frederique.segond@inria.fr

Indications of referees or reference letters would be appreciated but
are not mandatory.


Description

Building a knowledge graph from text involves a number of diverse NLP
tasks, such as named entity recognition, named entity disambiguation,
coreference resolution, open relation extraction, relation clustering,
document-level event extraction, slot filling, etc.

The intern's work will touch upon the whole panel of tasks, but in
varying depth. Considering the large amount of open source code
releases in NLP research, priority is set on leveraging existing code
and models.
When French models are not readily available, open source code will
need to be retrained on French corpora. And in some cases, it will be
necessary to re-implement an algorithm from its published paper.

While deep learning approaches will be ubiquitous in that work, the
intern will need to remain open to alternate solutions, as the
large-scale datasets required by deep learning will not be available in
French for all these tasks.

The first research focus will be to study the best combination scheme
for these various tasks. For instance, relation clustering can inform
named entity disambiguation by providing more comprehensive and
consistent information on the named entities to disambiguate; and named
entity disambiguation can inform relation clustering by enabling access
to structured information on the arguments of those relations. To
leverage such interactions within the pipeline, several approaches are
possible: run one before the other, or vice-versa, iterate between
both, or setup joint predictions. These choices will be made based on
both theoretical and empirical analyses.

The second research focus will be a fine-grained evaluation of the
performance all over the pipeline, in order to identify where in the
pipeline the information is lost the most, and how errors propagate
throughout the pipeline. A thorough analysis will lead to identify
where to put the most research efforts in the future to improve
qualitative and quantitative performance. Depending on the outcomes of
that analysis, the opportunity to pursue with a PhD position will be
discussed with the intern.