Dexterous Focus Egocentric videos often contain significant dynamic motion, head tilting, and distracting elements in e.g. dexterous tasks. By restricting the ego video to only tracking the area around the camera-wearer's hands, we allow the model to mainly focus on relevant features for activity understanding, and we show improved performance on human activity understanding without needing to augment existing video network architectures.
Algorithm. First, we use a hand detector, D, focus point selection, F, and a stabilizer, S, to obtain a sample trajectory across the video which focuses on the ego-viewer's hands. Then we crop the video frames based on the trajectory to obtain the final hand-focused video.
Results
Dexterous Focus Improvement
Method (pretraining)
Training Data
Top-1 Accuracy (%)
TimeSFormer (K600)
ego
39.18
TimeSFormer (K600)
hands
45.81 (+17%)
TimeSFormer (K600)
ego+hands
47.75 (+22%)
Keystep Recognition with Hands. The Top-1 Accuracy of keystep recognition on hold-out validation dataset. We see that focusing on hands results in a 17% improvement over ego alone, and both combined results in a 22% improvement.
Ablation
Ablation
Top-1 Accuracy
Crop
Ego+Crop
Fcenter, Doff
34.89%
42.27%
Doff
41.93%
45.17%
D10%
44.01%
46.72%
Soff
44.02%
47.37%
All (D, F, S)
45.81%
47.75%
Ablation Study. Top-1 Accuracy of keystep recognition is reported with various components of the framework turned off or altered. Our ablation results indicate that each component of our method adds to the overall performance of dexterous focus.
Benchmark Results Comparison
Method (pretraining)
Training Data
Top-1 Accuracy
TimeSFormer (K600)
exo
32.68%
TimeSFormer (K600)
ego
35.13%
EgoVLPv2 (Ego4D)
ego,exo
35.84%
EgoVLPv2 (EgoExo4D)
ego
36.04%
Ego-Exo Transfer MAE
ego,exo
37.17%
Viewpoint Distillation
ego,exo
38.19%
EgoVLPv2 (EgoExo4D)
ego,exo
39.10%
TimeSFormer* (K600)
ego
39.18%
VI Encoder (EgoExo4D)
ego,exo
40.34%
TimeSFormer* (K600)
hands
45.81%
TimeSFormer* (K600)
ego+hands
47.75%
Keystep Recognition Benchmark. The Top-1 Accuracy of keystep recognition on hold-out validation dataset. Star (*) denotes our TimeSFormer re-implementation, with all other results reported directly from Ego-Exo4D. Rows are ranked by performance.
BibTeX
@misc{zachavis2025dexfocus,
title = {Improving Keystep Recognition in Ego-Video via Dexterous Focus},
author = {Zach Chavis and Stephen J. Guy and Hyun Soo Park},
year={2025},
archivePrefix={arXiv},
}