InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Sai Kumar Dwivedi¹, Dimitrije Antić², Shashank Tripathi¹, Omid Taheri¹,
Cordelia Schmid³, Michael J. Black¹, Dimitrios Tzionas²

¹Max Planck Institute for Intelligent Systems, Tübingen
²University of Amsterdam ³Inria, France

InteractVLM estimates 3D contact points on both human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. We introduce a novel task, Semantic Human Contact, which goes beyond the traditional Binary Human Contact to infer object-specific contacts on bodies. By leveraging the rich visual knowledge of large Vision-Language Models, we address the limited availability of ground-truth 3D interaction data for training, resulting in better generalization to diverse real-world interactions.

Joint Human-Object Reconstruction

Semantic Human Contact

🎯 Model Zoo

#	Model	Training Datasets	Comment
1	`interactvlm-3d-hcontact-damon`	_DAMON	Winner of RHOBIN Human Contact Challenge (CVPR 2025)
2	`interactvlm-3d-hcontact-wScene`	_{DAMON + LEMON-HU + RICH}	Best in-the-wild 3D Human Contact Estimation (with foot ground contact)
3	`interactvlm-3d-oafford-lemon-piad`	_{LEMON-OBJ + PIAD}	Estimates Object Affordance
4	`interactvlm-2d-hcontact`	_{Extended LISA by projecting DAMON contact on images}	2D Human Contact Segmentation via Referring Segmentation
5	`interactvlm-3d-hcontact-ocontact^*`	_{DAMON + LEMON-HU + RICH + LEMON-OBJ + PIAD + PICO + HOI-VQA^#}	Single Model for Joint 3D Human Object Contact Estimation

^* The interactvlm-3d-hcontact-ocontact model is trained with our new PICO Dataset (CVPR 2025), which enables accurate 3D object contact estimation unlike object affordance using LEMON-OBJ and PIAD dataset.

^# We use GPT-4o image model to generate HOI-VQA dataset for training using DAMON, LEMON and PIAD images. The script for calling OpenAI API, raw data and preprocessing scripts are here.

⚙️ Installation

🛠️ Setup Environment

Install Micromamba (if not already installed):

curl -Ls https://micro.mamba.pm/api/download/linux-64/latest | tar -xvj bin/micromamba
sudo mv bin/micromamba /usr/local/bin/

Create and activate environment:

micromamba create -n interactvlm python=3.10 -c conda-forge
micromamba activate interactvlm

Install PyTorch with CUDA 12.1:

pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121

Clone the repository:

git clone https://github.com/saidwivedi/InteractVLM.git
cd InteractVLM

Install dependencies:

micromamba install -c conda-forge gcc_linux-64=12.2.0 gxx_linux-64=12.2.0 ffmpeg x264 -y 
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
DS_BUILD_FUSED_ADAM=1 pip install deepspeed==0.15.1

Setup Environment:

# Before running demo, training or evaluation scripts, ensure CUDA is properly configured
export CUDA_HOME=/usr/local/cuda  # or your CUDA installation path
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

📁 Code Structure

InteractVLM/
├── 📁 model/                         # Core model implementation
├── 📁 datasets/                      # Data loading and processing
├── 📁 utils/                         # Utility functions
├── 📁 preprocess_data/               # Data preprocessing scripts
├── 📁 scripts/                       # Execution scripts
├── 📁 data/                          # Dataset folders, Body models, Demo samples
├── 📁 trained_models/                # Trained models
├── 📄 train.py                       # Main training script
├── 📄 evaluate.py                    # Main evaluation script
├── 📄 optim/fit.py                   # Main optimization script
├── 📄 run_demo.py                    # Run Demo
└── 📄 requirements.txt               # Python dependencies

📦 Data and Model Downloads

📁 Essential Data Files

To run InteractVLM, you need to download essential data files and pre-trained models. We provide a convenient script to handle this process.

🚀 Download Script Usage

Register for access at https://interactvlm.is.tue.mpg.de/login.php to get your credentials
Run the download script:
```
bash fetch_data.sh
```

🎮 Demo

Run the demo on your own images with either human or object interaction estimation modes:

# For 3D human contact estimation
bash scripts/run_demo.sh hcontact data/demo_samples folder

# For 2D human contact segmentation
bash scripts/run_demo.sh h2dcontact data/demo_samples file

# For 3D object affordance estimation  
bash scripts/run_demo.sh oafford data/demo_samples folder

# For joint 3D fitting (human + object)
bash scripts/run_optim.sh

For joint reconstruction, see optim/ module.

Demo Requirements:

Human Contact Demo: The canonical human mesh and rendered input are already provided. Simply run the script to estimate 3D contact points on human bodies. We now also support human contact estimation with scene (e.g. ground or undefined objects) with the latest released model. Download the latest model using hcontact-wScene argument in fetch_data.sh and use the same argument while running the demo script. The object name in the image filename serves as the query object for contact estimation (e.g., "bottle" or "chair"). To estimate contact with the scene or ground, use "scene" as the query or prefix the filename with "scene".
2D Human Contact Demo: Performs 2D contact segmentation directly on the input image using referring segmentation. This extends LISA's capabilities for human-object contact detection in 2D space. The object name in the image filename serves as the query object for contact estimation.
Object Affordance Demo: Can work with either object meshes or single images. For single images, first use our Object Retrieval pipeline to retrieve the 3D object shape and save it as object_mesh.obj, then the script will render multiple views for affordance prediction.

Input Modes:

The demo supports two input structures:

Folder-based mode (default): Each sample in its own folder (required for 3D human contact and object affordance)
File-based mode: All samples as files in a single folder. Supported for:
- 2D Human Contact (h2dcontact): Direct segmentation on input images
- 3D Human Contact (hcontact): Estimating human contact for video frames

Sample Data: The data/demo_samples/ directory contains ready-to-use samples for testing both human contact and object affordance estimation. One should get the following results:

🏋️ Training and Evaluation

🔧 Data Generation

To generate the data needed for training, run the following script. We also provide preprocessed dataset for DAMON, LEMON, PIAD and PICO. Run fetch_data.sh file with approriate params to download the preprocessed data.

Available Preprocessed Datasets:

DAMON - Human contact annotations from DAMON dataset
LEMON - Human-object interaction data from LEMON dataset
PIAD - Object affordance annotations from PIAD dataset
PICO - Object Contact data from PICO dataset

To generate yourself, run the following command,

# Generate preprocessed data
bash scripts/run_datagen.sh

🚀 Training

To train 3D Human Contact Estimation using DAMON dataset, download the preprocessed dataset using the following command and place it under data/damon. Then run the training script.

# Download preprocessed DAMON dataset
bash fetch_data.sh damon-dataset

# Train human contact with DAMON dataset
bash scripts/run_train.sh hcontact-damon

📊 Evaluation

Model Weight Preparation

If you have trained a new model, prepare the weights for evaluation:

# Prepare weights for model 0 (adjust number as needed)
bash scripts/run_prepare_weights.sh 0

Run Evaluation on Pre-trained Models

# Evaluate the model on either DAMON or PIAD. Adjust the congfiguration accordingly
bash scripts/run_eval.sh

📋 Code Release Status

✅ Released

3D Human Contact Estimation - Training, evaluation, and demo code available
3D Object Contact/Affordance Estimation - Training, evaluation, and demo code available
Object Shape Retrieval from Single Image - Code available at Object_Retrieval
Optimization Framework for Joint Reconstruction - Code available at optim

🙏 Acknowledgements

We thank Alpár Cseke for his assistance with evaluating joint human-object reconstruction. We also thank Tsvetelina Alexiadis and Taylor Obersat for MTurk evaluation, Yao Feng, Peter Kulits, and Markos Diomataris for their valuable feedback and Benjamin Pellkofer for IT support. SKD is supported by the International Max Planck Research School for Intelligent Systems (IMPRS-IS). The UvA part of the team is supported by an ERC Starting Grant (STRIPES, 101165317, PI: D. Tzionas).

Code and Datasets

InteractVLM builds upon several excellent open-source projects and datasets:

LISA - InteractVLM is built on top of this foundational framework
LEMON, DECO, PIAD, PICO and RICH - For human contact and object affordance data
Blendify - For rendering

Optimization Framework

Our optimization framework integrates the following repositories (see optim for details):

OpenShape - For object shape retrieval
OSX - For SMPLX human pose estimation
Grounded-SAM - For object detection and segmentation

📝 Citation

If you find this code useful for your research, please consider citing the following paper:

@inproceedings{dwivedi_interactvlm_2025,
    title     = {{InteractVLM}: {3D} Interaction Reasoning from {2D} Foundational Models},
    author    = {Dwivedi, Sai Kumar and Antić, Dimitrije and Tripathi, Shashank and Taheri, Omid and Schmid, Cordelia and Black, Michael J. and Tzionas, Dimitrios},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
}

⚖️ License

This code is available for non-commercial scientific research purposes as defined in the LICENSE file. By downloading and using this code you agree to the terms in the LICENSE. Third-party datasets and software are subject to their respective licenses.

📧 Contact

For code related questions, please contact [email protected]

For commercial licensing (and all related questions for business applications), please contact [email protected].

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Joint Human-Object Reconstruction

Semantic Human Contact

🎯 Model Zoo

⚙️ Installation

🛠️ Setup Environment

📁 Code Structure

📦 Data and Model Downloads

📁 Essential Data Files

🚀 Download Script Usage

🎮 Demo

🏋️ Training and Evaluation

🔧 Data Generation

🚀 Training

📊 Evaluation

Model Weight Preparation

Run Evaluation on Pre-trained Models

📋 Code Release Status

✅ Released

🙏 Acknowledgements

Code and Datasets

Optimization Framework

📝 Citation

⚖️ License

📧 Contact

About

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
datasets		datasets
model		model
optim		optim
preprocess_data		preprocess_data
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
fetch_data.sh		fetch_data.sh
merge_lora_weights_and_save_hf_model.py		merge_lora_weights_and_save_hf_model.py
requirements.txt		requirements.txt
run_demo.py		run_demo.py
train.py		train.py

License

saidwivedi/InteractVLM

Folders and files

Latest commit

History

Repository files navigation

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models

Joint Human-Object Reconstruction

Semantic Human Contact

🎯 Model Zoo

⚙️ Installation

🛠️ Setup Environment

📁 Code Structure

📦 Data and Model Downloads

📁 Essential Data Files

🚀 Download Script Usage

🎮 Demo

🏋️ Training and Evaluation

🔧 Data Generation

🚀 Training

📊 Evaluation

Model Weight Preparation

Run Evaluation on Pre-trained Models

📋 Code Release Status

✅ Released

🙏 Acknowledgements

Code and Datasets

Optimization Framework

📝 Citation

⚖️ License

📧 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages