Improving Keystep Recognition in Ego-Video via Dexterous Focus

University of Minnesota
CVPR 2025: Second Joint Egocentric Vision (EgoVis) Workshop
Teaser Img

Dexterous Focus Egocentric videos often contain significant dynamic motion, head tilting, and distracting elements in e.g. dexterous tasks. By restricting the ego video to only tracking the area around the camera-wearer's hands, we allow the model to mainly focus on relevant features for activity understanding, and we show improved performance on human activity understanding without needing to augment existing video network architectures.

Teaser Img

Algorithm. First, we use a hand detector, D, focus point selection, F, and a stabilizer, S, to obtain a sample trajectory across the video which focuses on the ego-viewer's hands. Then we crop the video frames based on the trajectory to obtain the final hand-focused video.

Results


Dexterous Focus Improvement

Method (pretraining) Training Data Top-1 Accuracy (%)
TimeSFormer (K600) ego 39.18
TimeSFormer (K600) hands 45.81 (+17%)
TimeSFormer (K600) ego+hands 47.75 (+22%)

Keystep Recognition with Hands. The Top-1 Accuracy of keystep recognition on hold-out validation dataset. We see that focusing on hands results in a 17% improvement over ego alone, and both combined results in a 22% improvement.


Ablation

Ablation Top-1 Accuracy
Crop Ego+Crop
Fcenter, Doff 34.89% 42.27%
Doff 41.93% 45.17%
D10% 44.01% 46.72%
Soff 44.02% 47.37%
All (D, F, S) 45.81% 47.75%

Ablation Study. Top-1 Accuracy of keystep recognition is reported with various components of the framework turned off or altered. Our ablation results indicate that each component of our method adds to the overall performance of dexterous focus.


Benchmark Results Comparison

Method (pretraining) Training Data Top-1 Accuracy
TimeSFormer (K600) exo 32.68%
TimeSFormer (K600) ego 35.13%
EgoVLPv2 (Ego4D) ego,exo 35.84%
EgoVLPv2 (EgoExo4D) ego 36.04%
Ego-Exo Transfer MAE ego,exo 37.17%
Viewpoint Distillation ego,exo 38.19%
EgoVLPv2 (EgoExo4D) ego,exo 39.10%
TimeSFormer* (K600) ego 39.18%
VI Encoder (EgoExo4D) ego,exo 40.34%
TimeSFormer* (K600) hands 45.81%
TimeSFormer* (K600) ego+hands 47.75%

Keystep Recognition Benchmark. The Top-1 Accuracy of keystep recognition on hold-out validation dataset. Star (*) denotes our TimeSFormer re-implementation, with all other results reported directly from Ego-Exo4D. Rows are ranked by performance.


BibTeX

@misc{zachavis2025dexfocus,
  title  = {Improving Keystep Recognition in Ego-Video via Dexterous Focus},
  author = {Zach Chavis and Stephen J. Guy and Hyun Soo Park},
  year={2025},
  archivePrefix={arXiv},
}