Reproducibly build + serve CUDA-accelerated AI/ML stacks with Flox and Nix

Flox is built on open-source Nix, a reproducible package / environment manager and build system. Nix defines build recipes as code and treats each build as a pure function of its declared inputs. The same Nix expression works one month, one year, even five years after testing.

Flox supplements Nix with private catalogs; FloxHub (a central site to share, version, and manage Flox environments); and declarative environment manifests expressed as TOML.

This repo hosts GPU- and CPU-targeted build recipes, production model-serving runtimes, model quantization tooling, monitoring environments, and other resources for the NVIDIA CUDA ecosystem.

This repo would not exist without the dedicated excellence of the Nix ecosystem and the trailblazing ingenuity and brilliance of the Nix CUDA team. Thanks to both.

What's here

Build recipes

Repository	What it does
pytorch-cuda	Parametric Nix builder for GPU- and CPU-targeted PyTorch. Covers PyTorch 2.8.0 through 2.10.0 on CUDA 12.8 through 13.1. Generates concrete, hardware-specific package definitions from metadata tables.
gpu-specific-pytorch-2.9.1	Example repo to complement this guide. Custom PyTorch 2.9.1 variants with CUDA 12.9.1 support, targeted for specific GPU architectures (SM61 through SM120) and CPU instruction sets (AVX, AVX2, AVX-512, ARMv9).
onnx-runtime	ONNX Runtime 1.18 through 1.24.2 for Python 3.12 and 3.13, CUDA 12.4 and 12.9. Versions segmented across Git branches.
magma	MAGMA 2.9.0 for NVIDIA GPUs. Single-architecture static builds that replace the ~10 GB all-architecture closure in nixpkgs.
vllm	vLLM 0.13.0 through 0.15.1, with 0.16.0 coming soon. Versions segmented across Git branches.
llamacpp	GPU-specific llama.cpp recipes pinned to specific versions, plus recipes that always build latest.

Model-serving runtimes

Each runtime is a declarative, reproducible Flox environment that runs directly on bare metal, in VMs, uncontained on Kubernetes, or as the basis for distroless OCI images. Containers aren't required, but (for container-based workflows) Flox and Nix make containers even better: minimal, truly declarative, deterministic.

Repository	What it serves
triton-runtime	NVIDIA Triton Inference Server v2.66.0 with Python, ONNX Runtime v1.24.2, TensorRT v10.23, and vLLM v0.15.1 backends. HTTP, gRPC, Prometheus metrics, and optional OpenAI-compatible frontend.
triton-runtime with tensorrt-llm	NVIDIA Triton Inference Server v2.66.0 with Python, ONNX Runtime v1.24.2, TensorRT v10.23, TensorRT-LLM 1.10, and vLLM v0.15.1 backends. HTTP, gRPC, Prometheus metrics, and optional OpenAI-compatible frontend.
triton-monitoring	Example Grafana + Prometheus monitoring stack for NVIDIA Triton Inference Server v2.66.0. Just works everywhere: on x86-64 and ARM Linux and macOS; locally, in CI, in prod.
vllm-runtime	vLLM v0.16.0 on CUDA 12.9. OpenAI-compatible API, multi-GPU tensor and pipeline parallelism, multi-source model provisioning (local, HuggingFace, S3, R2), and three-stage model validation.
vllm-monitoring	Example Grafana + Prometheus monitoring stack for vLLM. Just works everywhere: on x86-64 and ARM Linux and macOS; locally, in CI, in prod.
llamacpp-runtime	llama.cpp on CUDA 12.9. Serves GGUF-quantized models via llama-server with continuous batching, Flash Attention, multi-GPU layer splitting, and GGUF artifact validation (magic bytes, header parsing, optional SHA256 pinning).
llamacpp-monitoring	Example Grafana + Prometheus monitoring stack for llama.cpp. Just works everywhere: on x86-64 and ARM Linux and macOS; locally, in CI, in prod.

Quantization and conversion tooling

Repository	What it does
model-quantizer	Quantize HuggingFace models for offline inference. AWQ 4-bit, FP8 via torchao, LLM Compressor (FP8), and GGUF for llama.cpp. Local and production command variants with strict validation, locking, and structured error reporting. x86-64 Linux.
triton-trtllm-tools	Convert HuggingFace models into TensorRT-LLM checkpoints, then compile checkpoints into TensorRT engines for Triton serving. Includes benchmarking, evaluation, pruning, refitting, and local validation tools. x86-64 Linux.

Why build GPU-specific packages?

Generic PyTorch wheels pull in support for more than half a dozen CUDA architectures, plus Intel- and Apple-specific backends. Building for the hardware you actually run:

Shrinks artifacts. A targeted CUDA PyTorch closure is roughly 60% the size of upstream nixpkgs PyTorch. On macOS/Darwin, less than one-third: 2.66 GB. CPU-only builds clock in at just over 1.0 GB.
Improves performance. Compiling for one SM architecture and/or one CPU ISA lets the compiler optimize without compromise.
Reduces the attack surface. Fewer unused backends and code paths mean less is exposed at runtime.
Pins one artifact everywhere. Publish targeted builds for dev, CI, and production instead of relying on whichever upstream packages happen to exist. Prototype and train on GPUs, optimize for GPU or CPU inferencing in eval, run minimal CPU- or GPU-optimized builds in production.

A final reason is that building with Nix is both rewarding and surprisingly straightforward. AI coding tools and agents are exceptionally fluent in Nix's functional language. These tools can reliably generate Nix expressions or flakes that reproduce exactly the same behavior and outcome one month, one year, or five years on.

Why declarative environments for model serving?

Unlike a Dockerfile-first workflow, where the OCI image is the thing you build, tag, and promote, Flox and Nix make the declarative environment the unit of promotion.

The same Flox or Nix environment travels across the SDLC:

On developers' CUDA-accelerated laptops and desktops
On NVIDIA DGX Spark locally
On Slurm-managed GPU clusters for research, eval, and batch inference
On Kubernetes GPU clusters, VMs, or bare metal for eval and production

Changes are atomic edits committed to Git or published to FloxHub as a new generation. Rollbacks are switching to an earlier generation or, in GitOps flows, pointing to an earlier commit.

You can also build minimal, distroless containers from Flox or Nix environments.

NVIDIA CUDA Kickstart Program

These repositories are part of the Flox CUDA Kickstart Program. Flox can help you customize and implement build recipes and serving environments for your own AI/ML CUDA workloads.

@flox · flox.dev · FloxHub · LinkedIn · Bluesky

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproducibly build + serve CUDA-accelerated AI/ML stacks with Flox and Nix

What's here

Build recipes

Model-serving runtimes

Quantization and conversion tooling

Why build GPU-specific packages?

Why declarative environments for model serving?

NVIDIA CUDA Kickstart Program

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Reproducibly build + serve CUDA-accelerated AI/ML stacks with Flox and Nix

What's here

Build recipes

Model-serving runtimes

Quantization and conversion tooling

Why build GPU-specific packages?

Why declarative environments for model serving?

NVIDIA CUDA Kickstart Program

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages