Post-doc position (CEA List and LISN) : Injecting Knowledge into Multimedia Entity Representation In the context of the research project MultimEdia Entity Representation and Question Answering Tasks (ANR 2020-2024), a postdoctoral position is proposed for highly motivated candidates interested in multimedia understanding and natural language processing. Context Exploiting multimedia content often relies on the correct identification of entities in text and images. A major difficulty for understanding a multimedia content lies in its ambiguity with regard to the actual user needs, for instance when identifying an entity from a given textual mention or matching a visual object to a query expressed through language [ADJ20a,ADJ20b]. The MEERQAT (https://www.meerqat.fr) project addresses the problem of analyzing ambiguous visual and textual content by learning and combining their representations and by taking into account the existing knowledge about entities. It aims at solving the Multimedia Question Answering (MQA) task, which requires answering a textual question associated with a visual input like an image, given a knowledge base (KB) containing millions of unique entities and associated text. The post-doc specifically addresses the problem of injecting knowledge into multimodal entities to be able to answer questions that relate to them. Other partners of the project work on the visual, textual and KB representation, and on the entity disambiguation. Main activities Entities can be represented by different modalities, in particular by visual and textual content. In a common space, an entity can thus be represented by several vectors, that need to be combined into a unique representation that reflects the similarity of the related entities. In such a context, a promising approach consists of learning a visual representation from natural language supervision [RAD21] relying on large datasets by a simple learning strategy based on contrastive predictive coding [OOR18], adapted to text and visual modalities [ZHA20]. The learned representation allows to address multiple cross-modal tasks and provide a large-scale vocabulary that is adapted to general audience in a given language. It exhibits state of the art performance on several tasks and can even exceed humans on certain tasks. However, it does not include any structural information from a knowledge base. The main task of the post doc will thus consist in injecting such prior knowledge into the entity representation to address Multimedia Question Answering. Some approaches were recently proposed to do so in the context of caption generation [GOE20]. We consider entities such as persons, places, objects or organizations (NGOs, companies...). Depending on the type of an entity, the information to take into account in its representation is not obvious. If a person can probably be associated with a couple of mentions and images, it becomes less obvious for other types of entities. For instance, a company can be associated with its logo, but also with its main products or even its main managers (CEO, CTO . . . ). In the same vein, a location can be represented by many pictures, and a large one such as a city by some emblematic buildings or places. The second task of the post-doc will consist to determinethe appropriate information to include in the representation of a given entity, depending on its type. Position and Application The post-doc will be supervised by CEA and LISN. The candidate will be hired by CEA (Palaiseau, near Paris, France) for a 18-months post-doc. The LISN is located close to CEA on the Paris-Saclay University Campus. The salary depends on qualifications and experience. This will include social coverage (health, unemployment, retirement). The postdoc will have access to large supercomputers equipped with multiple GPUs and large storage for experiments, in addition to a professional laptop. To apply to the position, send a CV (including publication list or a URL pointing to it, such as Google Scholar) and a cover letter to Hervé Le Borgne , Olivier Ferret , Sahar Ghannay and Anne Vilnat . Profile - PhD in Natural Language Processing, Computer Vision, Machine Learning or other relevant fields - Strong publication record, with accepted articles in top-tier conferences and journals of the domain - Solid programming skills (pytorch/tensorflow). Publicly available project will be appreciated - Ability to communicate and collaborate at the highest technical level - Experience on using GPUs on a supercomputer (e.g. with SLURM or similar tool) will be appreciated References [ADJ20a] O. Adjali, R. Besançon, O. Ferret, H. Le Borgne and B. Grau, Multimodal Entity Linking for Tweets, 42nd European Conference on Information Retrieval (ECIR): Advances in Information Retrieval, 2020 [code] [ADJ20b] O. Adjali, R. Besançon, O. Ferret, H. Le Borgne and B. Grau, Building a Multimodal Entity Linking Dataset From Tweets, 12th International Conference on Language Resources and Evaluation (LREC), 2020 [GOE20] Arushi Goel, Basura Fernando, Thanh-Son Nguyen, Hakan Bilen. Injecting Prior [OOR18] Oord, A. v. d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv:1807.03748, Jul 2018. [RAD21] Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; Krueger, G.; Sutskever, I. Learning Transferable Visual Models From Natural Language Supervision. ArXiv 2103.00020, Feb 2021. [ZHA20] Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C. D.;Langlotz, C. P. Contrastive learning of medical visual representations from paired images and text. arXiv preprint arXiv:2010.00747, 2020.