FastEmbed Inference Helm Chart

A Helm chart for deploying a suite of stateless, scalable text embedding and reranking microservices. Built on top of Qdrant's FastEmbed library, this chart provides standard REST APIs for Dense, Sparse, and Reranker models, complete with Prometheus metrics, network policies, and GPU support.

Overview

This Helm chart packages three independent but related inference services:

Service	Description	Default Model	Default Port
Dense	Generates dense vector embeddings from text.	`BAAI/bge-small-en-v1.5`	`8200`
Sparse	Generates sparse vector embeddings for text.	`Qdrant/minicoil-v1`	`8201`
Reranker	Re-ranks a list of documents based on a query.	`Xenova/ms-marco-MiniLM-L-6-v2`	`8202`

Each service is deployed as a separate Kubernetes Deployment, exposed via a ClusterIP Service, and can be independently scaled, configured, and enabled or disabled.

Architecture

                 ┌─────────────────────────────────────┐
                 │  Kubernetes Cluster                  │
                 │                                      │
┌──────────┐     │  ┌───────────┐   ┌──────────────┐   │
│ Client/  │────▶│  │  Dense    │   │  Reranker    │   │
│ RAG App  │     │  │  Service  │   │  Service     │   │
└──────────┘     │  │  (8200)   │   │  (8202)      │   │
                 │  └───────────┘   └──────────────┘   │
                 │  ┌───────────┐                       │
                 │  │  Sparse   │                       │
                 │  │  Service  │                       │
                 │  │  (8201)   │                       │
                 │  └───────────┘                       │
                 │                                      │
                 │  ┌──────────────────────────────┐    │
                 │  │ Prometheus Metrics Endpoint  │    │
                 │  │ (/metrics on each service)   │    │
                 │  └──────────────────────────────┘    │
                 └─────────────────────────────────────┘

Prerequisites

Kubernetes 1.21+
Helm 3.8+
(Optional) A CNI plugin that supports NetworkPolicy (e.g., Calico, Cilium) if network policies are enabled.
(Optional) NVIDIA GPU operator and nodes with nvidia.com/gpu resources for GPU acceleration.
(Optional) Prometheus Operator if using monitoring.mode: servicemonitor.

Quick Start(Idempotent)

Optionally export HF_TOKEN and Add the Helm repository and install the chart with default values:

export HF_TOKEN="your_huggingface_token_here" 

# Add the Helm repository 
helm repo add fastembed https://athithya-sakthivel.github.io/fastembed-inference-helm 2>/dev/null || true
helm repo update
# Create namespace if it doesn't exist
kubectl create namespace fastembed --dry-run=client -o yaml | kubectl apply -f -
# Set up Hugging Face token 
kubectl create secret generic hf-token \
  --namespace fastembed \
  --from-literal=HF_TOKEN=$HF_TOKEN \
  --dry-run=client -o yaml | kubectl apply -f -
# Install or upgrade the release 
helm upgrade --install fastembed fastembed/fastembed-inference \
  --namespace fastembed \
  --set global.huggingface.existingSecret=hf-token \
  --set dense.preloadModel=true \
  --set sparse.preloadModel=true \
  --set reranker.preloadModel=true \
  --wait \
  --timeout 10m

This will deploy all three services in CPU-only mode with sensible defaults.

Configuration

The chart is configured via a single values.yaml file. The primary configuration sections are:

Global Settings

Parameter	Description	Default
`global.createNamespace`	Create the release namespace if it doesn't exist.	`true`
`global.cuda`	Master switch for GPU support. Requires GPU-compatible images.	`false`
`global.monitoring.enabled`	Master switch to expose Prometheus `/metrics` endpoints on all services.	`true`
`global.monitoring.mode`	Scrape mode: `static` (manual Prometheus config) or `servicemonitor` (CRD).	`static`
`global.networkPolicy.enabled`	Enforce namespace-based network isolation.	`true`
`global.networkPolicy.allowedNamespaces`	List of namespaces allowed to call these services.	`[inference, indexing, monitoring, kube-prometheus-stack]`
`global.huggingface.existingSecret`	Name of a Kubernetes Secret containing an `HF_TOKEN` for gated models.	`""`

Service-Specific Settings

Each service (dense, sparse, reranker) can be configured with the following common parameters:

Parameter	Description	Dense Default	Sparse Default	Reranker Default
`enabled`	Enable or disable the service deployment.	`true`	`true`	`true`
`modelName`	Model ID from Hugging Face Hub or a local path.	`BAAI/bge-small-en-v1.5`	`Qdrant/minicoil-v1`	`Xenova/ms-marco-MiniLM-L-6-v2`
`batchSize`	Max number of texts/documents per request.	`16`	`16`	`16`
`gpuCount`	Number of `nvidia.com/gpu` resources to request (only when `global.cuda: true`).	`1`	`0`	`1`
`port`	Container HTTP port.	`8200`	`8201`	`8202`
`preloadModel`	Load the ML model on startup instead of lazily on the first request.	`false`	`false`	`true`
`replicas`	Number of pods to run when HPA is disabled.	`1`	`1`	`1`
`hpa.enabled`	Enable Horizontal Pod Autoscaling based on CPU.	`false`	`false`	`false`
`hpa.min`/`hpa.max`	Min/Max replicas for HPA.	`1` / `3`	`1` / `3`	`1` / `3`
`hpa.targetCPU`	Target average CPU utilization for HPA.	`60`	`60`	`60`
`pdb.enabled`	Create a PodDisruptionBudget when `replicas > 1`.	`true`	`true`	`true`

For a full list of all tunable parameters, see the values.yaml file.

Monitoring & Observability

Each service exposes a rich set of Prometheus metrics at the /metrics endpoint, including:

dense|sparse|reranker_requests_total - Total requests by status.
dense|sparse|reranker_request_duration_seconds - Request latency histograms.
dense|sparse|reranker_requests_in_progress - Gauge of in-flight requests.
dense|sparse|reranker_errors_total - Error counts by type.

When global.monitoring.mode: servicemonitor, a Prometheus Operator ServiceMonitor is automatically created to scrape all services. For static mode, you must configure your Prometheus instance to scrape the service endpoints manually. See the monitoring documentation for examples.

GPU Support

By default, the chart runs on CPU for maximum portability and simplicity. To enable GPU acceleration:

Set global.cuda: true.
Set the desired gpuCount on a service (e.g., reranker.gpuCount: 1).
Provide a custom CUDA-enabled container image. The default images are CPU-only.
Ensure your Kubernetes nodes have the necessary nvidia.com/gpu resources.

The sparse service is typically left on CPU. For detailed instructions, see the CUDA documentation.

Network Security

The chart enforces a zero-trust networking model using Kubernetes NetworkPolicy resources when global.networkPolicy.enabled: true.

Default-Deny All: All ingress to pods is denied by default.
Explicit Allowed Ingress: Only pods in the namespaces listed under allowedNamespaces can reach the services.
Egress to DNS: Always allowed.
Egress to Internet: Controlled by global.networkPolicy.allowInternetEgress. This is required for Hugging Face model downloads on first use if models are not pre-cached.

See the network policy documentation for a detailed explanation.

API Endpoints

All services share a common set of management endpoints.

Dense Service (`:8200`)

POST /embed - Generates dense embeddings for a list of texts.
GET /health - Liveness check. Returns service status and configuration.
GET /readyz - Readiness probe. Returns 200 OK when the model is loaded and ready.
GET /metrics - Prometheus metrics endpoint.

See the full dense service documentation for usage examples and supported models.

Sparse Service (`:8201`)

POST /embed - Generates sparse embeddings (indices and values) for a list of texts.
GET /health - Liveness check.
GET /readyz - Readiness probe.
GET /metrics - Prometheus metrics endpoint.

See the full sparse service documentation for usage examples and supported models.

Reranker Service (`:8202`)

POST /rerank - Re-ranks a list of documents based on a query string. Returns a list of relevance scores.
GET /health - Liveness check.
GET /readyz - Readiness probe.
GET /metrics - Prometheus metrics endpoint.

See the full reranker service documentation for usage examples and supported models.

Example Usage

# Kill existing port-forwards and start all 3
pkill -f "port-forward.*fastembed" 2>/dev/null || true
sleep 1
kubectl port-forward -n fastembed svc/fastembed-dense-svc 8200:8200 &>/dev/null &
kubectl port-forward -n fastembed svc/fastembed-sparse-svc 8201:8201 &>/dev/null &
kubectl port-forward -n fastembed svc/fastembed-reranker-svc 8202:8202 &>/dev/null &
sleep 2
# Test all endpoints compactly with metric values
echo "=== DENSE ===" && curl -sf http://localhost:8200/health && curl -sf http://localhost:8200/readyz && curl -sf -X POST http://localhost:8200/embed -H "Content-Type: application/json" -d '{"texts":["test"]}' && echo "" && curl -sf http://localhost:8200/metrics | grep "dense_requests_total{"
echo "=== SPARSE ===" && curl -sf http://localhost:8201/health && curl -sf http://localhost:8201/readyz && curl -sf -X POST http://localhost:8201/embed -H "Content-Type: application/json" -d '{"texts":["test"]}' && echo "" && curl -sf http://localhost:8201/metrics | grep "sparse_requests_total{"
echo "=== RERANKER ===" && curl -sf http://localhost:8202/health && curl -sf http://localhost:8202/readyz && curl -sf -X POST http://localhost:8202/rerank -H "Content-Type: application/json" -d '{"query":"test","documents":["a","b"]}' && echo "" && curl -sf http://localhost:8202/metrics | grep "reranker_requests_total{"
echo "All services running on localhost:8200-8202"
echo "Stop: pkill -f 'port-forward.*fastembed'"

Documentation

Service Docs:
Infrastructure Docs:

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github/workflows		.github/workflows
chart		chart
docs		docs
images		images
.gitignore		.gitignore
README.md		README.md
artifacthub-repo.yml		artifacthub-repo.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastEmbed Inference Helm Chart

Overview

Architecture

Prerequisites

Quick Start(Idempotent)

Configuration

Global Settings

Service-Specific Settings

Monitoring & Observability

GPU Support

Network Security

API Endpoints

Dense Service (`:8200`)

Sparse Service (`:8201`)

Reranker Service (`:8202`)

Example Usage

Documentation

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FastEmbed Inference Helm Chart

Overview

Architecture

Prerequisites

Quick Start(Idempotent)

Configuration

Global Settings

Service-Specific Settings

Monitoring & Observability

GPU Support

Network Security

API Endpoints

Dense Service (:8200)

Sparse Service (:8201)

Reranker Service (:8202)

Example Usage

Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Dense Service (`:8200`)

Sparse Service (`:8201`)

Reranker Service (`:8202`)

Packages