# Exploiting the structure of HTML to learn document representations

## Context
Information Retrieval (IR) models aim at predicting which documents
within a potentially huge collection are relevant to a given user
information need (usually a query). Current models of Information
Retrieval, like in many other fields, are nowadays based on transformer
architectures.

More precisely, two types of model are now prevalent:
(1) representation-based techniques, where the document and the query
representations are computed separately (dense or sparse vector) before
using a matching scoring function (e.g. inner product);
(2) interaction-based techniques, where both the query and the document
content are used to compute a relevance score.

Current research focuses on how to (pre)train the models and the
problem of modeling the task better, i.e., how to compute the
representation of the document and/or the query, or of both the query
and document. Improving the quality of the representation is key to
building successful (transformer) models for IR, as shown in the
best-performing models to date [Gao and Callan, 2021].

## Objectives
The internship will explore new ways to compute the representation of
(Web) documents, by considering various aspects of Web documents, i.e.
both their internal (DOM) and external (links) structure in the context
of Information Retrieval.

In the context of Web search, when dealing with web pages, the Document
Object Model (DOM) tree represents the document's structure
[Gupta et al., 2003].  Recent work on transformer-based models shows
that this structure can be encoded explicitly [Ainslie et al., 2020] or
implicitly [Aghajanyan et al., 2021] in the model. One recent approach
[Guo et al., 2022] proposes to separate the encoding of the text
content from the node structure, before using both representations as a
basis for dense ranking.

The goals of this internship will be to study how the HTML structure
can be leveraged to (1) build better document representations by
exploiting the inner HTML structure and/or the hyperlinks between the
documents; and (2) provide a better pre-training (i.e. without the
supervision of query paired with relevant documents).

The intern is encouraged to develop their own ideas, and to publish in
(inter)national venues and/or to participate in international
evaluation campaigns (such as TREC).

Organization
The internship will take place at the Qwant offices with visits to ISIR
(remote work is also possible). The internship is supervised by
Benjamin Piwowarski from ISIR, and Lara Perinetti and Romain Deveaud
from Qwant.

The intern will potentially work with the following tools/technologies:
-   Deep Learning libraries (PyTorch, TensorFlow, Jax/Flax, Huggingface
    ecosystem, etc.)
-   Python
-   Search engine tools (https://github.com/vespa-engine/pyvespa)
-   Git version control
-   Jupyter Environment

Qwant will provide the intern a laptop and access to a remote compute
server with GPU capabilities.

Candidates can send their questions, as well as their
resumes + motivation (a few lines) to
l.perinetti@qwant.com, r.deveaud@qwant.com and benjamin@piwowarski.fr


## References
[Gao and Callan, 2021] L. Gao and J. Callan, "Unsupervised Corpus Aware
    Language Model Pre-training for Dense Passage Retrieval,"
    arXiv:2108.05540 [cs], Aug. 2021 [Online].
    Available: http://arxiv.org/abs/2108.05540 .

[Gupta et al., 2003] S. Gupta, G. Kaiser, D. Neistadt, and P. Grimm,
    "DOM-based content extraction of HTML documents," in Proceedings of
    the twelfth international conference on World Wide Web - WWW '03,
    Budapest, Hungary, 2003, p. 207, doi: 10.1145/775152.775182 [Online].
    Available: http://portal.acm.org/citation.cfm?doid=775152.775182 .

[Ainslie et al., 2020] J. Ainslie et al., "ETC: Encoding Long and
    Structured Inputs in Transformers," in Proceedings of the 2020
    Conference on Empirical Methods in Natural Language Processing
    (EMNLP), Online, 2020, pp. 268-284,
    doi: 10.18653/v1/2020.emnlp-main.19 [Online].
    Available: https://www.aclweb.org/anthology/2020.emnlp-main.19 .

[Aghajanyan et al., 2021] Aghajanyan, Armen, Dmytro Okhonko, Mike
    Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, and Luke Zettlemoyer.
    "HTLM: Hyper-Text Pre-Training and Prompting of Language Models."
    ArXiv:2107.06955 [Cs], July 14, 2021 [Online].
    Available: http://arxiv.org/abs/2107.06955.

[Guo et al., 2022] Yu Guo, Zhengyi Ma, Jiaxin Mao, Hongjin Qian,
    Xinyu Zhang, Hao Jiang, Zhao Cao, and Zhicheng Dou. 2022.
    Webformer: Pre-training with Web Pages for Information Retrieval.
    In Proceedings of the 45th International ACM SIGIR Conference on
    Research and Development in Information Retrieval (SIGIR '22).
    Association for Computing Machinery, New York, NY, USA, 1502-1512.
    https://doi.org/10.1145/3477495.3532086