A Helm chart for deploying a suite of stateless, scalable text embedding and reranking microservices. Built on top of Qdrant's FastEmbed library, this chart provides standard REST APIs for Dense, Sparse, and Reranker models, complete with Prometheus metrics, network policies, and GPU support.
This Helm chart packages three independent but related inference services:
| Service | Description | Default Model | Default Port |
|---|---|---|---|
| Dense | Generates dense vector embeddings from text. | BAAI/bge-small-en-v1.5 |
8200 |
| Sparse | Generates sparse vector embeddings for text. | Qdrant/minicoil-v1 |
8201 |
| Reranker | Re-ranks a list of documents based on a query. | Xenova/ms-marco-MiniLM-L-6-v2 |
8202 |
Each service is deployed as a separate Kubernetes Deployment, exposed via a ClusterIP Service, and can be independently scaled, configured, and enabled or disabled.
┌─────────────────────────────────────┐
│ Kubernetes Cluster │
│ │
┌──────────┐ │ ┌───────────┐ ┌──────────────┐ │
│ Client/ │────▶│ │ Dense │ │ Reranker │ │
│ RAG App │ │ │ Service │ │ Service │ │
└──────────┘ │ │ (8200) │ │ (8202) │ │
│ └───────────┘ └──────────────┘ │
│ ┌───────────┐ │
│ │ Sparse │ │
│ │ Service │ │
│ │ (8201) │ │
│ └───────────┘ │
│ │
│ ┌──────────────────────────────┐ │
│ │ Prometheus Metrics Endpoint │ │
│ │ (/metrics on each service) │ │
│ └──────────────────────────────┘ │
└─────────────────────────────────────┘
- Kubernetes 1.21+
- Helm 3.8+
- (Optional) A CNI plugin that supports
NetworkPolicy(e.g., Calico, Cilium) if network policies are enabled. - (Optional) NVIDIA GPU operator and nodes with
nvidia.com/gpuresources for GPU acceleration. - (Optional) Prometheus Operator if using
monitoring.mode: servicemonitor.
Optionally export HF_TOKEN and Add the Helm repository and install the chart with default values:
export HF_TOKEN="your_huggingface_token_here"
# Add the Helm repository
helm repo add fastembed https://athithya-sakthivel.github.io/fastembed-inference-helm 2>/dev/null || true
helm repo update
# Create namespace if it doesn't exist
kubectl create namespace fastembed --dry-run=client -o yaml | kubectl apply -f -
# Set up Hugging Face token
kubectl create secret generic hf-token \
--namespace fastembed \
--from-literal=HF_TOKEN=$HF_TOKEN \
--dry-run=client -o yaml | kubectl apply -f -
# Install or upgrade the release
helm upgrade --install fastembed fastembed/fastembed-inference \
--namespace fastembed \
--set global.huggingface.existingSecret=hf-token \
--set dense.preloadModel=true \
--set sparse.preloadModel=true \
--set reranker.preloadModel=true \
--wait \
--timeout 10mThis will deploy all three services in CPU-only mode with sensible defaults.
The chart is configured via a single values.yaml file. The primary configuration sections are:
| Parameter | Description | Default |
|---|---|---|
global.createNamespace |
Create the release namespace if it doesn't exist. | true |
global.cuda |
Master switch for GPU support. Requires GPU-compatible images. | false |
global.monitoring.enabled |
Master switch to expose Prometheus /metrics endpoints on all services. |
true |
global.monitoring.mode |
Scrape mode: static (manual Prometheus config) or servicemonitor (CRD). |
static |
global.networkPolicy.enabled |
Enforce namespace-based network isolation. | true |
global.networkPolicy.allowedNamespaces |
List of namespaces allowed to call these services. | [inference, indexing, monitoring, kube-prometheus-stack] |
global.huggingface.existingSecret |
Name of a Kubernetes Secret containing an HF_TOKEN for gated models. |
"" |
Each service (dense, sparse, reranker) can be configured with the following common parameters:
| Parameter | Description | Dense Default | Sparse Default | Reranker Default |
|---|---|---|---|---|
enabled |
Enable or disable the service deployment. | true |
true |
true |
modelName |
Model ID from Hugging Face Hub or a local path. | BAAI/bge-small-en-v1.5 |
Qdrant/minicoil-v1 |
Xenova/ms-marco-MiniLM-L-6-v2 |
batchSize |
Max number of texts/documents per request. | 16 |
16 |
16 |
gpuCount |
Number of nvidia.com/gpu resources to request (only when global.cuda: true). |
1 |
0 |
1 |
port |
Container HTTP port. | 8200 |
8201 |
8202 |
preloadModel |
Load the ML model on startup instead of lazily on the first request. | false |
false |
true |
replicas |
Number of pods to run when HPA is disabled. | 1 |
1 |
1 |
hpa.enabled |
Enable Horizontal Pod Autoscaling based on CPU. | false |
false |
false |
hpa.min/hpa.max |
Min/Max replicas for HPA. | 1 / 3 |
1 / 3 |
1 / 3 |
hpa.targetCPU |
Target average CPU utilization for HPA. | 60 |
60 |
60 |
pdb.enabled |
Create a PodDisruptionBudget when replicas > 1. |
true |
true |
true |
For a full list of all tunable parameters, see the values.yaml file.
Each service exposes a rich set of Prometheus metrics at the /metrics endpoint, including:
dense|sparse|reranker_requests_total- Total requests by status.dense|sparse|reranker_request_duration_seconds- Request latency histograms.dense|sparse|reranker_requests_in_progress- Gauge of in-flight requests.dense|sparse|reranker_errors_total- Error counts by type.
When global.monitoring.mode: servicemonitor, a Prometheus Operator ServiceMonitor is automatically created to scrape all services. For static mode, you must configure your Prometheus instance to scrape the service endpoints manually. See the monitoring documentation for examples.
By default, the chart runs on CPU for maximum portability and simplicity. To enable GPU acceleration:
- Set
global.cuda: true. - Set the desired
gpuCounton a service (e.g.,reranker.gpuCount: 1). - Provide a custom CUDA-enabled container image. The default images are CPU-only.
- Ensure your Kubernetes nodes have the necessary
nvidia.com/gpuresources.
The sparse service is typically left on CPU. For detailed instructions, see the CUDA documentation.
The chart enforces a zero-trust networking model using Kubernetes NetworkPolicy resources when global.networkPolicy.enabled: true.
- Default-Deny All: All ingress to pods is denied by default.
- Explicit Allowed Ingress: Only pods in the namespaces listed under
allowedNamespacescan reach the services. - Egress to DNS: Always allowed.
- Egress to Internet: Controlled by
global.networkPolicy.allowInternetEgress. This is required for Hugging Face model downloads on first use if models are not pre-cached.
See the network policy documentation for a detailed explanation.
All services share a common set of management endpoints.
POST /embed- Generates dense embeddings for a list of texts.GET /health- Liveness check. Returns service status and configuration.GET /readyz- Readiness probe. Returns200 OKwhen the model is loaded and ready.GET /metrics- Prometheus metrics endpoint.
See the full dense service documentation for usage examples and supported models.
POST /embed- Generates sparse embeddings (indices and values) for a list of texts.GET /health- Liveness check.GET /readyz- Readiness probe.GET /metrics- Prometheus metrics endpoint.
See the full sparse service documentation for usage examples and supported models.
POST /rerank- Re-ranks a list of documents based on a query string. Returns a list of relevance scores.GET /health- Liveness check.GET /readyz- Readiness probe.GET /metrics- Prometheus metrics endpoint.
See the full reranker service documentation for usage examples and supported models.
# Kill existing port-forwards and start all 3
pkill -f "port-forward.*fastembed" 2>/dev/null || true
sleep 1
kubectl port-forward -n fastembed svc/fastembed-dense-svc 8200:8200 &>/dev/null &
kubectl port-forward -n fastembed svc/fastembed-sparse-svc 8201:8201 &>/dev/null &
kubectl port-forward -n fastembed svc/fastembed-reranker-svc 8202:8202 &>/dev/null &
sleep 2
# Test all endpoints compactly with metric values
echo "=== DENSE ===" && curl -sf http://localhost:8200/health && curl -sf http://localhost:8200/readyz && curl -sf -X POST http://localhost:8200/embed -H "Content-Type: application/json" -d '{"texts":["test"]}' && echo "" && curl -sf http://localhost:8200/metrics | grep "dense_requests_total{"
echo "=== SPARSE ===" && curl -sf http://localhost:8201/health && curl -sf http://localhost:8201/readyz && curl -sf -X POST http://localhost:8201/embed -H "Content-Type: application/json" -d '{"texts":["test"]}' && echo "" && curl -sf http://localhost:8201/metrics | grep "sparse_requests_total{"
echo "=== RERANKER ===" && curl -sf http://localhost:8202/health && curl -sf http://localhost:8202/readyz && curl -sf -X POST http://localhost:8202/rerank -H "Content-Type: application/json" -d '{"query":"test","documents":["a","b"]}' && echo "" && curl -sf http://localhost:8202/metrics | grep "reranker_requests_total{"
echo "All services running on localhost:8200-8202"
echo "Stop: pkill -f 'port-forward.*fastembed'"- Service Docs:
- Infrastructure Docs: