This repo is the official implementation of EgoHOD at ICLR 2025
"Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning"
Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang,
Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
- HOD data release
- Pretrained code release
- Finetuned code release
- Pretrained model checkpoints release
- Finetuned model checkpoints release
- Evaluation code release
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks.
https://github.com/OpenRobotLab/EgoHOD.git
conda env create -f environment.yml
conda activate hod
pip install -r requirements.txt
You can get our HOD annotations from this Huggingface link.
For training EgoVideo model without adapter, you can simply run the following code:
bash ./exps/pretrain.sh
Notes:
- Modify the yml files in
./configs
before running the scripts. - For training without slurm script, you can simply run
python main_pretrain.py --config_file configs/clip_base.yml
- For model with Adapter, we will release the pretraining code soon.
For our pretrained model, you can download checkpoint from this link.
We will update the code soon.
For zero-shot evaluation, you can simply run the scripts in exps
as follows:
bash exps/eval_ekcls.sh
We provide the evaluation code for EK100-MIR, EK100-CLS, EGTEA, and EGOMCQ.
If you find this repository useful, please use the following BibTeX entry for citation.
@misc{pei2025modeling,
title={Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning},
author={Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang},
year={2025},
eprint={2503.00986},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
This repository is built based on mae and AVION. Thanks to the contributors of the great codebase.