Skip to content

saidwivedi/InteractVLM

Repository files navigation

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

CVPR 2025


1Max Planck Institute for Intelligent Systems, Tübingen
2University of Amsterdam     3Inria, France



InteractVLM estimates 3D contact points on both human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. We introduce a novel task, Semantic Human Contact, which goes beyond the traditional Binary Human Contact to infer object-specific contacts on bodies. By leveraging the rich visual knowledge of large Vision-Language Models, we address the limited availability of ground-truth 3D interaction data for training, resulting in better generalization to diverse real-world interactions.

Joint Human-Object Reconstruction

Semantic Human Contact

Input Image Contact Prediction Joint Reconstruction

🎯 Model Zoo

# Model Type Training Datasets Comment Status
1 interactvlm-3d-hcontact-damon hcontact DAMON Winner of RHOBIN Human Contact Challenge (CVPR 2025) Available
2 interactvlm-3d-hcontact-wScene hcontact DAMON + LEMON-HU + RICH Best in-the-wild 3D Human Contact Estimation (with foot ground contact) Available
3 interactvlm-3d-oafford-lemon-piad oafford LEMON-OBJ + PIAD Estimates Object Affordance Available
4 interactvlm-2d-hcontact h2dcontact Extended LISA by projecting DAMON contact on images 2D Human Contact Segmentation via Referring Segmentation Available
5 interactvlm-3d-hcontact-ocontact* hcontact ocontact DAMON + LEMON-HU + RICH + LEMON-OBJ + PIAD + PICO + HOI-VQA# Single Model for Joint 3D Human Object Contact Estimation Available

* The interactvlm-3d-hcontact-ocontact model is trained with our new PICO Dataset (CVPR 2025), which enables accurate 3D object contact estimation unlike object affordance using LEMON-OBJ and PIAD dataset.

# We use GPT-4o image model to generate HOI-VQA dataset for training using DAMON, LEMON and PIAD images. The script for calling OpenAI API, raw data and preprocessing scripts are here.


⚙️ Installation

🛠️ Setup Environment

  1. Install Micromamba (if not already installed):

    curl -Ls https://micro.mamba.pm/api/download/linux-64/latest | tar -xvj bin/micromamba
    sudo mv bin/micromamba /usr/local/bin/
  2. Create and activate environment:

    micromamba create -n interactvlm python=3.10 -c conda-forge
    micromamba activate interactvlm
  3. Install PyTorch with CUDA 12.1:

    pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
  4. Clone the repository:

    git clone https://github.com/saidwivedi/InteractVLM.git
    cd InteractVLM
  5. Install dependencies:

    micromamba install -c conda-forge gcc_linux-64=12.2.0 gxx_linux-64=12.2.0 ffmpeg x264 -y 
    pip install -r requirements.txt
    pip install flash-attn --no-build-isolation
    DS_BUILD_FUSED_ADAM=1 pip install deepspeed==0.15.1
  6. Setup Environment:

    # Before running demo, training or evaluation scripts, ensure CUDA is properly configured
    export CUDA_HOME=/usr/local/cuda  # or your CUDA installation path
    export PATH=$CUDA_HOME/bin:$PATH
    export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

📁 Code Structure

InteractVLM/
├── 📁 model/                         # Core model implementation
├── 📁 datasets/                      # Data loading and processing
├── 📁 utils/                         # Utility functions
├── 📁 preprocess_data/               # Data preprocessing scripts
├── 📁 scripts/                       # Execution scripts
├── 📁 data/                          # Dataset folders, Body models, Demo samples
├── 📁 trained_models/                # Trained models
├── 📄 train.py                       # Main training script
├── 📄 evaluate.py                    # Main evaluation script
├── 📄 optim/fit.py                   # Main optimization script
├── 📄 run_demo.py                    # Run Demo
└── 📄 requirements.txt               # Python dependencies

📦 Data and Model Downloads

📁 Essential Data Files

To run InteractVLM, you need to download essential data files and pre-trained models. We provide a convenient script to handle this process.

🚀 Download Script Usage

  1. Register for access at https://interactvlm.is.tue.mpg.de/login.php to get your credentials

  2. Run the download script:

    bash fetch_data.sh

🎮 Demo

Run the demo on your own images with either human or object interaction estimation modes:

# For 3D human contact estimation
bash scripts/run_demo.sh hcontact data/demo_samples folder

# For 2D human contact segmentation
bash scripts/run_demo.sh h2dcontact data/demo_samples file

# For 3D object affordance estimation  
bash scripts/run_demo.sh oafford data/demo_samples folder

# For joint 3D fitting (human + object)
bash scripts/run_optim.sh 

For joint reconstruction, see optim/ module.

Demo Requirements:

  • Human Contact Demo: The canonical human mesh and rendered input are already provided. Simply run the script to estimate 3D contact points on human bodies. We now also support human contact estimation with scene (e.g. ground or undefined objects) with the latest released model. Download the latest model using hcontact-wScene argument in fetch_data.sh and use the same argument while running the demo script. The object name in the image filename serves as the query object for contact estimation (e.g., "bottle" or "chair"). To estimate contact with the scene or ground, use "scene" as the query or prefix the filename with "scene".

  • 2D Human Contact Demo: Performs 2D contact segmentation directly on the input image using referring segmentation. This extends LISA's capabilities for human-object contact detection in 2D space. The object name in the image filename serves as the query object for contact estimation.

  • Object Affordance Demo: Can work with either object meshes or single images. For single images, first use our Object Retrieval pipeline to retrieve the 3D object shape and save it as object_mesh.obj, then the script will render multiple views for affordance prediction.

Input Modes:

The demo supports two input structures:

  1. Folder-based mode (default): Each sample in its own folder (required for 3D human contact and object affordance)
  2. File-based mode: All samples as files in a single folder. Supported for:
    • 2D Human Contact (h2dcontact): Direct segmentation on input images
    • 3D Human Contact (hcontact): Estimating human contact for video frames

Sample Data: The data/demo_samples/ directory contains ready-to-use samples for testing both human contact and object affordance estimation. One should get the following results:

🏋️ Training and Evaluation

🔧 Data Generation

To generate the data needed for training, run the following script. We also provide preprocessed dataset for DAMON, LEMON, PIAD and PICO. Run fetch_data.sh file with approriate params to download the preprocessed data.

Available Preprocessed Datasets:

  • DAMON - Human contact annotations from DAMON dataset
  • LEMON - Human-object interaction data from LEMON dataset
  • PIAD - Object affordance annotations from PIAD dataset
  • PICO - Object Contact data from PICO dataset

To generate yourself, run the following command,

# Generate preprocessed data
bash scripts/run_datagen.sh

🚀 Training

To train 3D Human Contact Estimation using DAMON dataset, download the preprocessed dataset using the following command and place it under data/damon. Then run the training script.

# Download preprocessed DAMON dataset
bash fetch_data.sh damon-dataset

# Train human contact with DAMON dataset
bash scripts/run_train.sh hcontact-damon

📊 Evaluation

Model Weight Preparation

If you have trained a new model, prepare the weights for evaluation:

# Prepare weights for model 0 (adjust number as needed)
bash scripts/run_prepare_weights.sh 0

Run Evaluation on Pre-trained Models

# Evaluate the model on either DAMON or PIAD. Adjust the congfiguration accordingly
bash scripts/run_eval.sh

📋 Code Release Status

Released

  • 3D Human Contact Estimation - Training, evaluation, and demo code available
  • 3D Object Contact/Affordance Estimation - Training, evaluation, and demo code available
  • Object Shape Retrieval from Single Image - Code available at Object_Retrieval
  • Optimization Framework for Joint Reconstruction - Code available at optim

🙏 Acknowledgements

We thank Alpár Cseke for his assistance with evaluating joint human-object reconstruction. We also thank Tsvetelina Alexiadis and Taylor Obersat for MTurk evaluation, Yao Feng, Peter Kulits, and Markos Diomataris for their valuable feedback and Benjamin Pellkofer for IT support. SKD is supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). The UvA part of the team is supported by an ERC Starting Grant (STRIPES, 101165317, PI: D. Tzionas).

Code and Datasets

InteractVLM builds upon several excellent open-source projects and datasets:

  • LISA - InteractVLM is built on top of this foundational framework
  • LEMON, DECO, PIAD, PICO and RICH - For human contact and object affordance data
  • Blendify - For rendering

Optimization Framework

Our optimization framework integrates the following repositories (see optim for details):

  • OpenShape - For object shape retrieval
  • OSX - For SMPLX human pose estimation
  • Grounded-SAM - For object detection and segmentation

📝 Citation

If you find this code useful for your research, please consider citing the following paper:

@inproceedings{dwivedi_interactvlm_2025,
    title     = {{InteractVLM}: {3D} Interaction Reasoning from {2D} Foundational Models},
    author    = {Dwivedi, Sai Kumar and Antić, Dimitrije and Tripathi, Shashank and Taheri, Omid and Schmid, Cordelia and Black, Michael J. and Tzionas, Dimitrios},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
}

⚖️ License

This code is available for non-commercial scientific research purposes as defined in the LICENSE file. By downloading and using this code you agree to the terms in the LICENSE. Third-party datasets and software are subject to their respective licenses.

📧 Contact

For code related questions, please contact [email protected]

For commercial licensing (and all related questions for business applications), please contact [email protected].

About

[CVPR 2025] InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

  •  
  •