DepthPro is a foundational model designed for zero-shot monocular depth estimation. Leveraging a multi-scale vision transformer (ViT-based, Dinov2), the model optimizes for dense predictions by processing images at multiple scales. Each image is split into patches, encoded using a shared patch encoder across scales, then merged, upsampled, and fused via a DPT decoder.
- Research Paper: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
- Authors: Aleksei Bochkovskii, Amaël Delaunoy, et al.
- Official Code: apple/ml-depth-pro
- Official Weights: apple/DepthPro
- Unofficial Weights: geetu040/DepthPro
- Web UI Interface: spaces/geetu040/DepthPro
- Interface in Transformers (Open PR): huggingface/transformers#34583
In this repository, we use this architechture and the available pretrained weights for depth-estimation, to explore its capabilities in further image processings tasks like Image Segmentation and Image Super Resolution.
Quick Links
Task | Web UI Interface | Code-Based Inference and Weights | Training Code on Colab | Training Code on Kaggle | Training Logs | Validation Outputs |
---|---|---|---|---|---|---|
Depth Estimation | DepthPro | geetu040 / DepthPro | - | - | - | - |
Human Segmentation | DepthPro Segmentation Human | geetu040 / DepthPro Segmentation Human | - | Training Logs | Validation Outputs | |
Super Resolution (4x 256p) | DepthPro SR 4x 256p | geetu040 / DepthPro SR 4x 256p | Training Logs | Validation Outputs | ||
Super Resolution (4x 384p) | DepthPro SR 4x 384p | geetu040 / DepthPro SR 4x 384p | Training Logs | Validation Outputs |
- For Web UI Interface: spaces/geetu040/DepthPro_Segmentation_Human
- For Code-Based Inference and model weights: geetu040/DepthPro_Segmentation_Human
- For Training, check the notebook on:
Input Image | Ground Truth | Prediction |
---|---|---|
![]() |
We modify Apple's DepthPro for Monocular Depth Estimation model for Image Segmentation Task
.
- The pre-trained depth estimation model is used with slight changes in the head layer to make it compatible with the segmentation task.
- Hidden features maps have been generated to get the insights of the encoder and fusion stages of the model.
- For
training
andvalidation
, we useHuman Segmentation Dataset - Supervise.ly
, from kaggle: tapakah68/supervisely-filtered-segmentation-person-dataset- It contains 2667 samples which are randomly split into 80% training and 20% validation.
- each sample contains an image and its corresponding mask.
- The model produces exceptional results on validation set with an
IoU score of 0.964
andDice score of 0.982
, beating the previous state of art IoU score of 0.95 on this dataset.
- For Web UI Interface: spaces/geetu040/DepthPro_SR_4x_256p
- For Code-Based Inference and model weights: geetu040/DepthPro_SR_4x_256p
- For Training, check the notebook on:
Low Resolution 256px (Input Image) | Super Resolution 1024px (Depth Pro) | High Resolution 1024px (Ground Truth) |
---|---|---|
![]() |
We then modify Apple's DepthPro for Monocular Depth Estimation model for Image Super Resolution Task
.
- The base model architechture is modified for the task of Image Super Resolution from 256px to 1024px (4x upsampling).
- For
training
andvalidation
, we useDiv2k
dataset, introduced in NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study- It contains high resolution images in 2k resolution, which have been downsampled to
LR_SIZE=256
andHR_SIZE=1024
for training and validation. - It contains
- 800 training samples
- 200 validation samples
- Dataset has been downloaded from kaggle: soumikrakshit/div2k-high-resolution-images
- It contains high resolution images in 2k resolution, which have been downsampled to
- For
testing
, we useUrban100
dataset, introduced in Single Image Super-Resolution From Transformed Self-Exemplars- It contains images in 2 resolutions, 256 (low) and 1024 (high).
- It contains 100 samples.
- Dataset has been downloaded from kaggle: harshraone/urban100
- Results:
- Model achieves best
PSRN score of 24.80
andSSIM score of 0.74
on validation set. PSRN score of 21.36
andSSIM score of 0.62
on test set.- Model has been able to restore some of the information from low resolution images.
- Results are better than most of the generative techniques applied on kaggle, but still has a long way to go to achieve the state of art results.
- This is because of the nature of Vision Transformers, which are not specifically designed for Super Resolution tasks.
- Model achieves best
- For Web UI Interface: spaces/geetu040/DepthPro_SR_4x_384p
- For Code-Based Inference and model weights: geetu040/DepthPro_SR_4x_384p
- For Training, check the notebook on:
Low Resolution 384px (Input Image) | Super Resolution 1536px (Depth Pro) | High Resolution 1536px (Ground Truth) |
---|---|---|
![]() |
We use the modified Apple's DepthPro for Monocular Depth Estimation model for Image Super Resolution Task
.
- The base model architechture is modified for the task of Image Super Resolution from 384px to 1536px (4x upsampling).
- For
training
andvalidation
, we useDiv2k
dataset, introduced in NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study- It contains high resolution images in 2k resolution, which have been downsampled to
LR_SIZE=384
andHR_SIZE=1536
for training and validation. - It contains
- 800 training samples
- 200 validation samples
- Dataset has been downloaded from kaggle: soumikrakshit/div2k-high-resolution-images
- It contains high resolution images in 2k resolution, which have been downsampled to
- Results:
- Model achieves best
PSRN score of 27.19
andSSIM score of 0.81
on validation set. - Model has been able to restore some of the information from low resolution images.
- Results are better than the generative techniques applied on kaggle, but are slightly off to achieve the state of art results.
- This is because of the nature of Vision Transformers, which are not specifically designed for Super Resolution tasks.
- Model achieves best