Generating instructions in natural language for robots

Laure Soulier, Nicolas Thome

Information

Supervisors: laure.soulier@isir.upmc.fr, nicolas.thome@isir.upmc.fr

Localization: Sorbonne University, France

Duration: 6 months, between February and August 2023.

Stipend: around 573,30 euros / month 

Expected profile: Master or engineering degree in Computer Science or
Applied Mathematics related to machine learning/natural language
processing. The candidate should have a strong scientific background
with good technical skills in programming, and be fluent in reading
and writing English.

How to apply? Send a CV, a motivation letter and Master records to
laure.soulier@isir.upmc.fr and
nicolas.thome@isir.upmc.fr. Recommendation letters would be
appreciated. Interviews will conducted as they arise and the position
will be filled as soon as possible - the latest application date is
set to 15th January.

A thesis can be considered at the end of the internship, depending on the progress of the
internship.

Context

Autonomous agents require reasoning and planning strategies for
performing tasks. We, therefore, believe that the semantics captured
by large language models can enhance the decision process at different
levels.

First, it can allow grounding object representations with common sense
to identify their intrinsic and actionable properties. Large language
models and also common sense knowledge bases, such as ConceptNet 2 ,
can be used as complementary information sources, implying to design
representation model leveraging multi-modal information. The
difficulty would be to identify which properties are relevant for
objects and how to fuse them into a single representation. Another
strategy can be to encode objects differently according to each
modality and then use self-attention to learn the possible
interactions that are relevant for the task solving. Object grounding
has been addressed in previous work [1, 6, 7], but we believe that a
larger point of view related to the object scene is crucial to better
model the context and object properties.

Second, natural language can serve for building and clarifying the
planning strategy, and therefore the actions done by a robot. Several
works have addressed instruction identification as abstract
representation [2, 4, 5] or natural language expression, but the
limited data supervision is often a challenge [3, 5]. To tackle this
issue, we propose to develop interactive training processes, which
imply asking humans to label situations with sentences, with strong
care on limiting interactions to a few relevant situations, to reduce
human effort. The underlying assumption is that the compositionality
of language is correlated to compositionality in the agent's world.

In this internship, we envision working on the generation of natural
language instructions and improve currents model. Our objective is to
enhance the semantics behind objects to identify the most relevant
actions/sub-actions. An example of expected outputs is presented in
Figure 1.  Figure 1: Example of expected outputs at the end of the
internship [5]

Objectives
The workplan proposed to the student are as follows :

1. Literature review on instruction generation for robotics.

2. Become familiar with the work of the previous work (can be run as
baselines)

3. Pursue the work by proposing novel models.

4. Conduct experiments on the proposed solution and evaluation schemes
with respect to baselines.

5. If the internship leads to publish work, we will provide support to
go present your work in a conference.

References

[1] Michael Ahn et al. "Do As I Can and Not As I Say: Grounding
Language in Robotic Affordances". In: arXiv preprint
arXiv:2204.01691. 2022.

[2] Jacob Andreas et al. "Learning with Latent Language". In:
Proceedings of the 2018 Conference of the North American Chapter of
the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long Papers). New Orleans, Louisiana:
Association for Computational Linguistics, June 2018,
pp. 2166-2179. doi: 10.18653/v1/N18-1197.  url:
https://aclanthology.org/N18-1197.

[3] Haonan Chen et al. "Enabling Robots to Understand Incomplete
Natural Language Instructions Using Commonsense Reasoning". In: 2020
IEEE International Conference on Robotics and Automation, ICRA 2020,
Paris, France, May 31 - August 31, 2020. IEEE, 2020,
pp. 1963-1969. doi: 10.1109/ICRA40945.2020.9197315. url:
https://doi.org/10.  1109/ICRA40945.2020.9197315.

[4] Athul Paul Jacob et al. "Multitasking Inhibits Semantic
Drift". In: Proceedings of the 2021 Conference of the North American
Chapter of the Association for Computational Linguistics: Human
Language Technologies. Online: Association for Computational
Linguistics, June 2021, pp. 5351-5366. doi:
10.18653/v1/2021.naacl-main.421. url:
https://aclanthology.org/2021.naacl-main.421.

[5] Pratyusha Sharma et al. "Skill Induction and Planning with Latent
Language". In: Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume 1: Long Papers), ACL
2022, Dublin, Ireland, May 22-27, 2022. Ed. by Smaranda Muresan et
al. Association for Computational Linguistics, 2022,
pp. 1713-1726. doi: 10.18653/v1/2022.acl-long.120. url:
https://doi.org/10.18653/v1/2022.acl-long.120.

[6] Mohan Sridharan et al. "Combining Commonsense Reasoning and
Knowledge Acquisition to Guide Deep Learning in Robotics". In: CoRR
abs/2201.10266 (2022). arXiv: 2201.10266. url:
https://arxiv.org/abs/2201.10266.

[7] Antigoni Tsiami et al. "Multi3: Multi-Sensory Perception System
for Multi-Modal Child Interaction with Multiple Robots".  In: 2018
IEEE International Conference on Robotics and Automation, ICRA 2018,
Brisbane, Australia, May 21-25, 2018.  IEEE, 2018, pp. 1-8. doi:
10.1109/ICRA.2018.8461210. url:
https://doi.org/10.1109/ICRA.2018.8461210.