- [1998] A general framework for object detection [Paper]
- [2002] Learning a sparse representation for object detection [Paper]
- [2008] Visual perception and robotic manipulation: 3D object recognition, tracking and hand-eye coordination [Paper]
- [2010] Multi-source remote sensing data fusion: status and trends [Paper]
- [2013] Vision meets robotics: The kitti dataset [Paper]
- [2014] Beyond pascal: A benchmark for 3d object detection in the wild [Paper]
- [2014] Hand-crafted features or machine learnt features? together they improve RGB-D object recognition [Paper]
- [2015] Learning deep object detectors from 3d models [Paper]
- [2015] Sun rgb-d: A rgb-d scene understanding benchmark suite [Paper]
- [2015] Visual object recognition with 3D-aware features in KITTI urban scenes [Paper]
- [2015] An introduction to convolutional neural networks [Paper]
- [2016] You only look once: Unified, real-time object detection [Paper]
- [2016] Ssd: Single shot multibox detector [Paper]
- [2016] Faster R-CNN: Towards real-time object detection with region proposal networks [Paper]
- [2016] A large-scale 3D object recognition dataset [Paper]
- [2017] Spatial memory for context reasoning in object detection [Paper]
- [2017] Pointnet: Deep learning on point sets for 3d classification and segmentation [Paper]
- [2017] Multi-view 3d object detection network for autonomous driving [Paper]
- [2018] Volumetric object recognition using 3-D CNNs on depth data [Paper]
- [2018] Falling things: A synthetic dataset for 3d object detection and pose estimation [Paper]
- [2018] Voxelnet: End-to-end learning for point cloud based 3d object detection [Paper]
- [2018] Second: Sparsely embedded convolutional detection [Paper]
- [2018] Frustum pointnets for 3d object detection from rgb-d data [Paper]
- [2018] Joint 3d proposal generation and object detection from view aggregation [Paper]
- [2018] Pointfusion: Deep sensor fusion for 3d bounding box estimation [Paper]
- [2019] A survey on 3d object detection methods for autonomous driving applications [Paper]
- [2019] Mvx-net: Multimodal voxelnet for 3d object detection [Paper]
- [2019] Dfusenet: Deep fusion of rgb and sparse depth information for image guided dense depth completion [Paper]
- [2020] End-to-end multi-view fusion for 3d object detection in lidar point clouds [Paper]
- [2020] Pointpainting: Sequential fusion for 3d object detection [Paper]
- [2020] 3d-cvf: Generating joint camera and lidar features using cross-view spatial feature fusion for 3d object detection [Paper]
- [2020] Epnet: Enhancing point features with image semantics for 3d object detection [Paper]
- [2020] Pointaugment: an auto-augmentation framework for point cloud classification [Paper]
- [2020] Dsgn: Deep stereo geometry network for 3d object detection [Paper]
- [2020] A novel local geometry capture in pointnet++ for 3d classification [Paper]
- [2020] Pv-rcnn: Point-voxel feature set abstraction for 3d object detection [Paper]
- [2021] Recent advances in 3D object detection based on RGB-D: A survey [Paper]
- [2021] Towards a weakly supervised framework for 3D point cloud object detection and annotation [Paper]
- [2021] Objects are different: Flexible monocular 3d object detection [Paper]
- [2021] 3D point cloud multi-target detection method based on PointNet++ [Paper]
- [2021] Lidar r-cnn: An efficient and universal 3d object detector [Paper]
- [2021] Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection [Paper]
- [2021] Bevdet: High-performance multi-camera 3d object detection in bird-eye-view [Paper]
- [2021] Hvpr: Hybrid voxel-point representation for single-stage 3d object detection [Paper]
- [2022] Deep learning-based object detection in augmented reality: A systematic review [Paper]
- [2022] 3D object detection for autonomous driving: A survey [Paper]
- [2022] Performance and challenges of 3D object detection methods in complex scenes for autonomous driving [Paper]
- [2022] A survey on deep-learning-based lidar 3d object detection for autonomous driving [Paper]
- [2022] When, where and how does it fail? A spatial-temporal visual analytics approach for interpretable object detection in autonomous driving [Paper]
- [2022] VPFNet: Improving 3D object detection with virtual point based LiDAR and stereo data fusion [Paper]
- [2022] Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds [Paper]
- [2022] Fully sparse 3d object detection [Paper]
- [2022] Cross-modal learning for domain adaptation in 3d semantic segmentation [Paper]
- [2022] Crossmodal few-shot 3d point cloud semantic segmentation [Paper]
- [2022] Sparse fuse dense: Towards high quality 3d detection with depth completion [Paper]
- [2022] Deepfusion: Lidar-camera deep fusion for multi-modal 3d object detection [Paper]
- [2022] Transfusion: Robust lidar-camera fusion for 3d object detection with transformers [Paper]
- [2022] Unifying voxel-based representation with transformer for 3d object detection [Paper]
- [2022] Deepinteraction: 3d object detection via modality interaction [Paper]
- [2022] Vista: Boosting 3d object detection via dual cross-view spatial attention [Paper]
- [2022] Srcn3d: Sparse r-cnn 3d for compact convolutional multi-view 3d object detection and tracking [Paper]
- [2022] Monocular 3d object detection with depth from motion [Paper]
- [2022] Language-grounded indoor 3d semantic segmentation in the wild [Paper]
- [2022] OccAM's laser: Occlusion-based attribution maps for 3D object detectors on LiDAR data [Paper]
- [2022] 3d-vfield: Adversarial augmentation of point clouds for domain generalization in 3d object detection [Paper]
- [2022] Learning to prompt for open-vocabulary object detection with vision-language model [Paper]
- [2022] Transformers in vision: A survey [Paper]
- [2022] A unified sequence interface for vision tasks [Paper]
- [2023] Object detection in 20 years: A survey [Paper]
- [2023] 3D object detection for autonomous driving: A comprehensive survey [Paper]
- [2023] 2D and 3D object detection algorithms from images: A Survey [Paper]
- [2023] An attention mechanism based AVOD network for 3D vehicle detection [Paper]
- [2023] Voxel graph attention for 3-D object detection from point clouds [Paper]
- [2023] Voxelnext: Fully sparse voxelnet for 3d object detection and tracking [Paper]
- [2023] Cross modal transformer: Towards fast and robust 3d object detection [Paper]
- [2023] Uni3d: A unified baseline for multi-dataset 3d object detection [Paper]
- [2023] Leveraging vision-centric multi-modal expertise for 3d object detection [Paper]
- [2023] VoxelNextFusion: A Simple, Unified, and Effective Voxel Fusion Framework for Multimodal 3-D Object Detection [Paper]
- [2023] FusionFormer: A Multi-sensory Fusion in Bird's-Eye-View and Temporal Consistent Transformer for 3D Object Detection [Paper]
- [2023] PaLM-E: An Embodied Multimodal Language Model [Paper]
- [2023] Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [Paper]
- [2023] Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition [Paper]
- [2023] Instruct 3d-to-3d: Text instruction guided 3d-to-3d conversion [Paper]
- [2023] Openscene: 3d scene understanding with open vocabularies [Paper]
- [2023] 3d-llm: Injecting the 3d world into large language models [Paper]
- [2023] Openmask3d: Open-vocabulary 3d instance segmentation [Paper]
- [2023] Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding [Paper]
- [2023] Openshape: Scaling up 3d shape representation towards open-world understanding [Paper]
- [2023] Voxelnext: Fully sparse voxelnet for 3d object detection and tracking [Paper]
- [2023] Unsupervised 3d perception with 2d vision-language distillation for autonomous driving [Paper]
- [2023] Omni3d: A large benchmark and model for 3d object detection in the wild [Paper]
- [2023] Coda: Collaborative novel box discovery and cross-modal alignment for open-vocabulary 3d object detection [Paper]
- [2023] Language-guided 3d object detection in point cloud for autonomous driving [Paper]
- [2023] Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning [Paper]
- [2023] Robust fine-tuning of vision-language models for domain generalization [Paper]
- [2023] Leveraging vlm-based pipelines to annotate 3d objects [Paper]
- [2023] DLFusion: Painting-Depth Augmenting-LiDAR for Multimodal Fusion 3D Object Detection [Paper]
- [2023] Transformer for object detection: Review and benchmark [Paper]
- [2023] Visual instruction tuning towards general-purpose multimodal model: A survey [Paper]
- [2024] YOLOv10 to its genesis: a decadal and comprehensive review of the you only look once (YOLO) series [Paper]
- [2024] An empirical study of the generalization ability of lidar 3d object detectors to unseen domains [Paper]
- [2024] Dpft: Dual perspective fusion transformer for camera-radar-based object detection [Paper]
- [2024] Lidarformer: A unified transformer-based multi-task network for lidar perception [Paper]
- [2024] Unibevfusion: Unified radar-vision bevfusion for 3d object detection [Paper]
- [2024] Gafusion: Adaptive fusing lidar and camera with multiple guidance for 3d object detection [Paper]
- [2024] Is-fusion: Instance-scene collaborative fusion for multimodal 3d object detection [Paper]
- [2024] Diffubox: Refining 3d object detection with point diffusion [Paper]
- [2024] 3difftection: 3d object detection with geometry-aware diffusion features [Paper]
- [2024] Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data [Paper]
- [2024] Panopticus: Omnidirectional 3D Object Detection on Resource-constrained Edge Devices [Paper]
- [2024] MonoTAKD: Teaching assistant knowledge distillation for monocular 3D object detection [Paper]
- [2024] Llava-3d: A simple yet effective pathway to empowering lmms with 3d-awareness [Paper]
- [2024] Cogvlm: Visual expert for pretrained language models [Paper]
- [2024] Hpe-cogvlm: New head pose grounding task exploration on vision language model [Paper]
- [2024] M3d: Advancing 3d medical image analysis with multi-modal large language models [Paper]
- [2024] Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution [Paper]
- [2024] Find n’Propagate: Open-Vocabulary 3D Object Detection in Urban Environments [Paper]
- [2024] OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference [Paper]
- [2024] Text2loc: 3d point cloud localization from natural language [Paper]
- [2024] Agent3d-zero: An agent for zero-shot 3d understanding [Paper]
- [2024] VLA-3D: A dataset for 3D semantic scene understanding and navigation [Paper]
- [2024] Spatialvlm: Endowing vision-language models with spatial reasoning capabilities [Paper]
- [2024] Vision-Language Guidance for LiDAR-based Unsupervised 3D Object Detection [Paper]
- [2024] Unlocking textual and visual wisdom: Open-vocabulary 3d object detection enhanced by comprehensive guidance from text and image [Paper]
- [2024] Uni3DL: A unified model for 3D vision-language understanding [Paper]
- [2024] Vision-language pre-training with object contrastive learning for 3d scene understanding [Paper]
- [2024] Zero-shot automatic annotation and instance segmentation using llm-generated datasets: Eliminating field imaging and manual annotation for deep learning model development [Paper]
- [2024] Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency [Paper]
- [2024] G3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding [Paper]
- [2024] Advancing AI Understanding in Language & Vision [Paper]
- [2024] When llms step into the 3d world: A survey and meta-analysis of 3d tasks via multi-modal large language models [Paper]
- [2024] I Know About" Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction [Paper]
- [2024] Chat-scene: Bridging 3d scene and large language models with object identifiers [Paper]
- [2024] T2Vs Meet VLMs: A Scalable Multimodal Dataset for Visual Harmfulness Recognition [Paper]
- [2024] Multi-frame, lightweight & efficient vision-language models for question answering in autonomous driving [Paper]
- [2024] Vlm agents generate their own memories: Distilling experience into embodied programs of thought [Paper]
- [2024] A comprehensive study of the robustness for lidar-based 3d object detectors against adversarial attacks [Paper]
- [2024] Leveraging vision-language models for improving domain generalization in image classification [Paper]
- [2024] A survey of efficient fine-tuning methods for vision-language models—prompt and adapter [Paper]
- [2024] Multi-object hallucination in vision language models [Paper]
- [2024] VLDadaptor: Domain Adaptive Object Detection With Vision-Language Model Distillation [Paper]
- [2024] VLM-guided Explicit-Implicit Complementary novel class semantic learning for few-shot object detection [Paper]
- [2024] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [Paper]
- [2024] Opensight: A simple open-vocabulary framework for lidar-based object detection [Paper]
- [2024] X-vila: Cross-modality alignment for large language model [Paper]
- [2024] Multi-modal visual tracking based on textual generation [Paper]
- [2024] Mmscan: A multi-modal 3d scene dataset with hierarchical grounded language annotations [Paper]
- [2024] Synthetic meets authentic: Leveraging llm generated datasets for yolo11 and yolov10-based apple detection through machine vision sensors [Paper]
- [2024] Yolov10 to its genesis: A decadal and comprehensive review of the you only look once series [Paper]
- [2024] Comparing YOLOv11 and YOLOv8 for instance segmentation of occluded and non-occluded immature green fruits in complex orchard environment [Paper]
- [2024] YOLOv10 to its genesis: a decadal and comprehensive review of the you only look once (YOLO) series [Paper]
- [2024] Rgb-d cube r-cnn: 3d object detection with selective modality dropout [Paper]
- [2024] Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [Paper]
- [2024] Unlocking textual and visual wisdom: Open-vocabulary 3d object detection enhanced by comprehensive guidance from text and image [Paper]
- [2024] An empirical study of the generalization ability of lidar 3d object detectors to unseen domains [Paper]
- [2024] Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives [Paper]
- [2024] Vision-language models for vision tasks: A survey [Paper]
- [2024] Multimodal Alignment and Fusion: A Survey [Paper]
- [2024] Vcoder: Versatile vision encoders for multimodal large language models [Paper]
- [2025] MonoDFNet: Monocular 3D Object Detection with Depth Fusion and Adaptive Optimization [Paper]
- [2025] Multimodal large language models for image, text, and speech data augmentation: A survey [Paper]
- [2025] PromptDet: A Lightweight 3D Object Detection Framework with LiDAR Prompts [Paper]
- [2025] SparseVoxFormer: Sparse Voxel-based Transformer for Multi-modal 3D Object Detection [Paper]
- [2025] PillarFocusNet for 3D object detection with perceptual diffusion and key feature understanding [Paper]
- [2025] Unidet3d: Multi-dataset indoor 3d object detection [Paper]
- [2025] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models [Paper]
- [2025] Talk2PC: Enhancing 3D Visual Grounding through LiDAR and Radar Point Clouds Fusion for Autonomous Driving [Paper]
- [2025] 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [Paper]
- [2025] OV-SCAN: Semantically Consistent Alignment for Novel Object Discovery in Open-Vocabulary 3D Object Detection [Paper]
- [2025] Language-driven active learning for diverse open-set 3d object detection [Paper]
- [2025] Scene-LLM: Extending Language Model for 3D Visual Reasoning [Paper]
- [2025] 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning [Paper]
- [2025] Contextual object detection with multimodal large language models [Paper]
- [2025] 3VL: Using Trees to Improve Vision-Language Models’ Interpretability [Paper]
- [2025] Vldbench: Vision language models disinformation detection benchmark [Paper]
- [2025] Object Detection with Multimodal Large Vision-Language Models: An In-depth Review [Paper]
- [2025] Benchmark evaluations, applications, and challenges of large vision language models: A survey [Paper]
- [2025] Exploring the Potential of Encoder-free Architectures in 3D LMMs [Paper]
- [2025] Regression in EO: Are VLMs Up to the Challenge? [Paper]
- [2025] Towards Robust and Secure Embodied AI: A Survey on Vulnerabilities and Attacks [Paper]
- [2025] VLM2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues [Paper]
- [2025] RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation [Paper]
- [2025] How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model [Paper]
- [2025] Improved yolov12 with llm-generated synthetic data for enhanced apple detection and benchmarking against yolov11 and yolov10 [Paper]
- [2025] MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse [Paper]
- [2025] Comprehensive analysis of transparency and accessibility of chatgpt, deepseek, and other sota large language models [Paper]
If you find this list useful please cite us in your papers !
@misc{sapkota2025review3dobjectdetection,
title={A Review of 3D Object Detection with Vision-Language Models},
author={Ranjan Sapkota and Konstantinos I Roumeliotis and Rahul Harsha Cheppally and Marco Flores Calero and Manoj Karkee},
year={2025},
eprint={2504.18738},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2504.18738},
}