Existing foundation models often used in robot perception, such as Vision-Language Models (VLMs) are typically limited to reasoning over objects and actions currently visible within the image. We present a spatial extension to the VLM, which leverages egocentric video demonstrations to predict new tasks relative to the observer.
We define spatial task affordance as the region in an environment where a given task can be completed. We refer to the problem of predicting the location of this region relative to the observer as Simultaneous Localization and Affordance Prediction (SLAP).
Our approach to training a model to solve ego SLAP involves pairing tasks seen in a demonstration with unrelated frames, allowing the model to capture the environment’s spatial task affordances.
Our model was trained on a dataset curated from Ego-Exo4D, representing 12 kitchen environments, with two hours of video data containing 422 tasks. We test our model on novel tasks for affordance grounding and localization error. We compare to a KNN baseline, which uses CLIP to encode all demonstration images, and returns the camera pose of the image which maximizes the cosine similarity with the query text encoding.
Using the trained model, a mobile robot can query new tasks from its camera not before seen in the environment. The stove is highlighted in both views to provide context.
A mobile robot may want to avoid busy areas while carrying out tasks. Multiple predicted task regions can be combined via a convex hull, representing busy regions a robot should avoid while navigating.
@INPROCEEDINGS {zachavis-slap-2025,
author={Chavis, Zach and Park, Hyun Soo and Guy, Stephen J.},
title={Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video},
booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
year={2025},
}