Omnidata: PyTorch Models and Dataloaders

Project Website · Paper · Github · Data · >> [PyTorch Utils + Weights] << · Annotator

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans (ICCV 2021)

You can download our pretrained models for surface normal estimation and depth estimation. For each task there are two versions of the models--a V1 used in the paper, and a stronger V2 released in March 2022.

This repo also contains PyTorch dataloaders to load the starter dataset and other data generated from the Omnidata annotator, and code to train monocular depth and surface normal estimation. It includes a first publicly available implementation for MiDaS training code. It also contains an implementation of the 3D image refocusing augmentation introduced in the paper.

Pretrained models for depth and surface normal estimation
Run pretrained models locally
Dataloaders: Single- and multi-view (3D) dataloaders
Training: our public implementation of MiDaS loss our training code
3D Image refocusing augmentation
Train state-of-the-art models on Omnidata
- Depth Estimation
- Surface Normal Estimation
Citing

Pretrained models [new 2024: easy loading w/ torch.hub + HF]

We provide huggingface demos for monocular surface normal estimation and depth estimation.

You can load/run the models with the following code. The only dependencies are torch + timm:

import torch
# you may need to install timm for the DPT (we use 0.4.12)

# Surface normal estimation model: expects input images 384x384 normalized [0,1]
model_normal = torch.hub.load('alexsax/omnidata_models', 'surface_normal_dpt_hybrid_384')

# Depth estimation model: expects input images 384x384 normalized [-1,1]
model_depth = torch.hub.load('alexsax/omnidata_models', 'depth_dpt_hybrid_384')

# Without pre-trained weights
model_custom = torch.hub.load('alexsax/omnidata_models', 'dpt_hybrid_384', pretrained=False, task='normal')

If you only want to run the models/demo

Then use this lite package -- it contains only the model definition and demos for running the model. It is much smaller than this repo, which also contains utilities for generating and loading data.

Network Architecture

Version 2 models (stronger than V1) [March 2022]:
These are DPT architectures trained on more data using both 3D Data Augmentations and Cross-Task Consistency. Here's the list of updates in Version 2 models:
- Monocular Depth Estimation:
  - Habitat-Matterport 3D Dataset (HM3D) and 5 MiDaS dataset components (RedWebDataset, HRWSIDataset, MegaDepthDataset, TartanAirDataset, BlendedMVS) are added to the training data.
  - 1 week of training with 2D and 3D data augmentations and 1 week of training with cross-task consistency on 4xV100.
- Monocular Surface Normal Estimation:
  - New model is based on DPT architecture.
  - Habitat-Matterport 3D Dataset (HM3D) is added to the training data.
  - 1 week of training with 2D and 3D data augmentations and 1 week of training with cross-task consistency on 4xV100.
Version 1 Models
- Monocular Depth Estimation:
  - have DPT-based architectures (similar to MiDaS v3.0) and are trained with scale- and shift-invariant loss and scale-invariant gradient matching term introduced in MiDaS, and also virtual normal loss. We're making our implementation available here, since there is currently no other public implementation. We provide 2 pretrained depth models for both DPT-hybrid and DPT-large architectures with input resolution 384.
- Monocular Surface Normal Estimation:
  - The surface normal network is based on the UNet architecture (6 down/6 up). It is trained with both angular and L1 loss and input resolutions between 256 and 512.

Packages

You can see the complete list of required packages in requirements.txt. We recommend using conda or virtualenv for the installation.

conda create -n testenv -y python=3.8
source activate testenv
pip install -r requirements.txt

Run our models on your own image

After downloading the pretrained models like above, you can run them on your own image with the following command:

python demo.py --task $TASK --img_path $PATH_TO_IMAGE_OR_FOLDER --output_path $PATH_TO_SAVE_OUTPUT

The --task flag should be either normal or depth. To run the script for a normal target on an example image:

python demo.py --task normal --img_path assets/demo/test1.png --output_path assets/

Single- and Multi-View Dataloaders

We provide a set of modular PyTorch dataloaders in the dataloaders directory (here) that work for multiple components of the dataset or for any combination of modalities.

The notebook here shows how to use the dataloader, how to load multiple overlapping views, and how to unproject the images into the same scene.
New components datasets (e.g. those annotated with the annotator) can be added as a file in the dataloader/component_datasets and used with the dataloader. The current dataloaders work for Taskonomy, Replica, GSO-in-Replica, Hypersim, HM3D, and BlendedMVS++
See the dataloader README for more info on how to load multiple views, get the camera intrinsics and pose, and use with PyTorch3D or PyTorch-Lightning


Notebook visualization of multiple views

MiDaS Implementation

We provide a public implementation of the MiDaS loss that we used for training, based on this implementation. The loss is not available in the original MiDaS repo. Both the ssimae (scale- and shift invariant MAE) loss and the scale-invariant gradient matching term are in losses/midas_loss.py. MiDaS loss is useful for training depth estimation models on mixed datasets with different depth ranges and scales, similar to our dataset. An example usage is shown below:

from losses.midas_loss import MidasLoss
midas_loss = MidasLoss(alpha=0.1)
midas_loss, ssi_mae_loss, reg_loss = midas_loss(depth_prediction, depth_gt, mask)

alpha specifies the weight of the gradient matching term in the total loss, and mask indicates the valid pixels of the image.

3D Image Refocusing

Mid-level cues can be used for data augmentations in addition to training targets. The availability of full scene geometry in our dataset makes the possibility of doing Image Refocusing as a 3D data augmentation. You can find an implementation of this augmentation in data/refocus_augmentation.py. You can run this augmentation on some sample images from our dataset with the following command.

python demo_refocus.py --input_path assets/demo_refocus/ --output_path assets/demo_refocus

This will refocus RGB images by blurring them according to depth_euclidean for each image. You can specify some parameters of the augmentation with the following tags: --num_quantiles (number of qualtiles to use in blur stack), --min_aperture (smallest aperture to use), --max_aperture (largest aperture to use). Aperture size is selected log-uniformly in the range between min and max aperture.

Shallow Focus	Mid Focus	Far Focus

Training State-of-the-Art Models:

Omnidata is a means to train state-of-the-art models in different vision tasks. Here, we provide the code for training our depth and surface normal estimation models. You can train the models with the following commands:

Depth Estimation

We train DPT-based models on Omnidata using 3 different losses: scale- and shift-invariant loss and scale-invariant gradient matching term introduced in MiDaS, and also virtual normal loss introduced here.

python train_depth.py --config_file config/depth.yml --experiment_name rgb2depth --val_check_interval 3000 --limit_val_batches 100 --max_epochs 10

Surface Normal Estimation

We train a UNet architecture (6 down/6 up) for surface normal estimation using L1 Loss and Cosine Angular Loss.

python train_normal.py --config_file config/normal.yml --experiment_name rgb2normal --val_check_interval 3000 --limit_val_batches 100 --max_epochs 10

Citation

If you find the code or models useful, please cite our paper:

@inproceedings{eftekhar2021omnidata,
  title={Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets From 3D Scans},
  author={Eftekhar, Ainaz and Sax, Alexander and Malik, Jitendra and Zamir, Amir},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  pages={10786--10796},
  year={2021}
}

In case you use our latest pretrained models please also cite the following paper:

@inproceedings{kar20223d,
  title={3D Common Corruptions and Data Augmentation},
  author={Kar, O{\u{g}}uzhan Fatih and Yeo, Teresa and Atanov, Andrei and Zamir, Amir},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18963--18974},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Omnidata: PyTorch Models and Dataloaders

Table of Contents

Pretrained models [new 2024: easy loading w/ torch.hub + HF]

If you only want to run the models/demo

Network Architecture

Packages

Run our models on your own image

Single- and Multi-View Dataloaders

MiDaS Implementation

3D Image Refocusing

Training State-of-the-Art Models:

Depth Estimation

Surface Normal Estimation

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Omnidata: PyTorch Models and Dataloaders

Table of Contents

Pretrained models [new 2024: easy loading w/ torch.hub + HF]

If you only want to run the models/demo

Network Architecture

Packages

Run our models on your own image

Single- and Multi-View Dataloaders

MiDaS Implementation

3D Image Refocusing

Training State-of-the-Art Models:

Depth Estimation

Surface Normal Estimation

Citation