Skip to content

Latest commit

 

History

History
92 lines (67 loc) · 4.37 KB

README.md

File metadata and controls

92 lines (67 loc) · 4.37 KB

EgoHOD

This repo is the official implementation of EgoHOD at ICLR 2025

"Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning"
Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang,
Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang

Todo

  • HOD data release
  • Pretrained code release
  • Finetuned code release
  • Pretrained model checkpoints release
  • Finetuned model checkpoints release
  • Evaluation code release

Introduction

In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks.

Installation

https://github.com/OpenRobotLab/EgoHOD.git
conda env create -f environment.yml
conda activate hod
pip install -r requirements.txt

Datasets

You can get our HOD annotations from this Huggingface link.

Pretraining

For training EgoVideo model without adapter, you can simply run the following code:

bash ./exps/pretrain.sh

Notes:

  1. Modify the yml files in ./configs before running the scripts.
  2. For training without slurm script, you can simply run
    python main_pretrain.py --config_file configs/clip_base.yml
  3. For model with Adapter, we will release the pretraining code soon.

Pretrained Model

For our pretrained model, you can download checkpoint from this link.

Finetuning

We will update the code soon.

Zero-shot Evaluation

For zero-shot evaluation, you can simply run the scripts in exps as follows:

bash exps/eval_ekcls.sh

We provide the evaluation code for EK100-MIR, EK100-CLS, EGTEA, and EGOMCQ.

Cite

If you find this repository useful, please use the following BibTeX entry for citation.

@misc{pei2025modeling,
      title={Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning}, 
      author={Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang},
      year={2025},
      eprint={2503.00986},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

This repository is built based on mae and AVION. Thanks to the contributors of the great codebase.