vision-factory

Pick a model, pick a backend, get a ROS2 package — ready to colcon build and deploy. No dependency soup, no 30 GB images, no "works on my laptop."

Vision-factory is a code generator that produces self-contained ROS2 vision packages. Instead of one monolithic node that drags in PyTorch, TensorRT, ONNX Runtime, OpenVINO, Ultralytics, and Transformers all at once, each generated package ships only what it needs. YOLO on TensorRT? You get Ultralytics + TensorRT. Depth Anything on ONNX? You get onnxruntime + transformers. Clean, auditable, conflict-free.

Supported Models

Model	What it does	Variants
YOLO	Object detection	v8n/s/m/l/x, v11n/s/m/l
Grounding DINO	Open-vocabulary detection (text prompts)	tiny, base
Depth Anything	Monocular depth estimation	vits, vitb, vitl (v1 & v2)
ZoeDepth	Metric monocular depth	nyu, kitti, nyu-kitti
Segment Anything	Promptable segmentation	hiera_t/s/b+/l, large, huge
Florence	Multi-task vision-language	base, base-ft, large, large-ft
RTMPose	Keypoint / pose estimation	body-s/m/l, hand, face, wholebody-s/m/l
ByteTrack	Multi-object tracking	—

Runners & Hardware

Every generated package includes a runner — a thin layer that wires up device setup, model loading, and the preprocess -> forward -> postprocess loop. You choose the runner at generation time with --backend, and the generated code is specialized for that backend. No runtime dispatch, no unused code paths.

Backend	Runner	Device	What you get
`cuda`	`CudaRunner`	NVIDIA GPU	PyTorch + `torch.compile` for fused ops after warmup. Broadest model support.
`tensorrt`	`TensorRtRunner`	NVIDIA GPU	Exports and caches TensorRT engines (FP16/INT8). Fastest inference on NVIDIA hardware.
`rocm`	`RocmRunner`	AMD GPU	Same PyTorch code paths as CUDA, built against the ROCm stack.
`openvino`	`OpenVinoRunner`	Intel CPU / iGPU / NPU	Intel-optimized inference via OpenVINO Runtime. Great for NUCs and edge PCs.
`onnx`	`OnnxRunner`	Any	Portable ONNX Runtime with selectable execution providers (CPU, CUDA, ROCm, OpenVINO, NPU).

The CUDA and ROCm runners optionally apply torch.compile for graph-level fusion — the first frame compiles, then every subsequent frame runs the optimized graph. The TensorRT runner goes further: it exports a TensorRT engine on first launch, caches it to disk, and reloads it instantly on subsequent runs. The ONNX runner follows the same caching pattern for its exported .onnx models.

See docs/hardware.md for per-platform details, example commands, and NPU guidance.

Quick Start

pip install -e .

Generate a package:

vision-factory generate \
  --model yolo \
  --backend cuda \
  --variant yolo_v8s \
  --package-name my_detector \
  --output ./output

List available models:

vision-factory list-models

Documentation

Hardware Platforms — per-device guidance, example commands, NPU support
Backends — detailed backend descriptions (cuda, rocm, tensorrt, onnx, openvino)
Architecture — pipeline diagram, template layers, call chain, project structure
Development — adding new models, testing, installation

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
e2e-output		e2e-output
e2e		e2e
generator		generator
models		models
specs		specs
tests		tests
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vision-factory

Supported Models

Runners & Hardware

Quick Start

Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vision-factory

Supported Models

Runners & Hardware

Quick Start

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages