Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

Abstract

Existing foundation models often used in robot perception, such as Vision-Language Models (VLMs) are typically limited to reasoning over objects and actions currently visible within the image. We present a spatial extension to the VLM, which leverages egocentric video demonstrations to predict new tasks relative to the observer.

Spatial Task Affordance

We define spatial task affordance as the region in an environment where a given task can be completed. We refer to the problem of predicting the location of this region relative to the observer as Simultaneous Localization and Affordance Prediction (SLAP).

Training Framework

Our approach to training a model to solve ego SLAP involves pairing tasks seen in a demonstration with unrelated frames, allowing the model to capture the environment’s spatial task affordances.

Results

Our model was trained on a dataset curated from Ego-Exo4D, representing 12 kitchen environments, with two hours of video data containing 422 tasks. We test our model on novel tasks for affordance grounding and localization error. We compare to a KNN baseline, which uses CLIP to encode all demonstration images, and returns the camera pose of the image which maximizes the cosine similarity with the query text encoding.

Affordance Grounding. Predict which of 3 tasks can be accomplished in the location represented by an image. The dashed line represents random guess.

Task Localization Error. Predict the location and extent of the task region in the environment relative to an image.

Robot Applications

Navigation

An example of a robot navigating to the task 'heat the oil' using its camera.

Using the trained model, a mobile robot can query new tasks from its camera not before seen in the environment. The stove is highlighted in both views to provide context.

Task Obstacles

An example of a robot predicting a set of tasks, representing a busy area, and navigating around them.

A mobile robot may want to avoid busy areas while carrying out tasks. Multiple predicted task regions can be combined via a convex hull, representing busy regions a robot should avoid while navigating.

BibTeX

@INPROCEEDINGS {zachavis-slap-2025,
    author={Chavis, Zach and Park, Hyun Soo and Guy, Stephen J.},
    title={Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video},
    booktitle={Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)},
    year={2025},
}

Simultaneous Localization and Affordance Prediction of Tasks from Egocentric Video

"Where can I ... ?"

"Boil the noodles"

"Julienne the carrots"

"Flip a pancake"

"Clean the fruit"