Skip to content

Conversation

@liavweiss
Copy link
Contributor

@liavweiss liavweiss commented Dec 11, 2025

Re-enable Gemma Embedding Model Support

Task

Fix #790

Summary

This PR re-enables support for the google/embeddinggemma-300m gated model across the codebase, following the resolution of HuggingFace token access (HF_TOKEN now configured by maintainers in CI).

Background

The EmbeddingGemma-300m model (google/embeddinggemma-300m) is a gated model on HuggingFace that requires authentication via HF_TOKEN. Due to CI/CD authentication limitations, Gemma support was previously disabled in work related to Issue #573 to allow tests to pass without the gated model.

Now that the maintainer has configured HF_TOKEN in the CI environment, we can restore full Gemma embedding model support.

Changes Made

1. Model Download Configuration (tools/make/models.mk)

  • ✅ Added embeddinggemma-300m to download-models-minimal target
  • ✅ Added embeddinggemma-300m to download-models-lora target
  • ✅ Updated comments to reflect that HF_TOKEN is now available in CI
  • ✅ Added graceful skip logic with informative messages when HF_TOKEN is not available

2. Go Test Constants (candle-binding/semantic-router_test.go)

  • ✅ Set GemmaEmbeddingModelPath = "../models/embeddinggemma-300m"
  • ✅ Removed t.Skip() from InitGemmaOnly test
  • ✅ Updated test assertions to handle both Qwen3 (1024-dim) and Gemma (768-dim) embeddings
  • ✅ Added isModelInitializationError checks for graceful test skipping when model unavailable

3. Rust Embedding Initialization (candle-binding/src/ffi/embedding.rs)

Key Fix: Made Gemma model initialization optional to prevent embedding initialization failures when Gemma cannot be loaded.

  • ✅ Modified init_embedding_models() to continue with Qwen3-only initialization if Gemma fails to load
  • ✅ Changed Gemma registration error handling from return false to warning log + continue
  • ✅ Added informative warning messages explaining that Gemma is optional
  • ✅ This fix ensures that embedding functionality remains available even when Gemma model is not downloaded (e.g., missing HF_TOKEN for gated models)

4. Model Manager Graceful Skip Logic (src/model_manager/)

  • __init__.py: Added GatedModelError exception handling in ensure_all() to gracefully skip gated models when HF_TOKEN is not available

    • Logs warning messages instead of failing
    • Continues processing other models
    • Returns partial results (models that were successfully downloaded)
  • downloader.py: Enhanced error handling to properly detect and convert gated model errors:

    • Handles RepositoryNotFoundError (404) for gated models (HuggingFace returns 404 instead of 401 to avoid revealing repository existence)
    • Detects known gated models (e.g., "embeddinggemma", "gemma") in repository IDs
    • Converts authentication-related errors to GatedModelError for consistent handling
    • Handles GatedRepoError from huggingface_hub library

5. E2E Profile Configurations

  • ✅ Updated e2e/profiles/ai-gateway/values.yaml: Added gemma_model_path
  • ✅ Updated e2e/profiles/dynamic-config/values.yaml:
    • Set gemma_model_path to "models/embeddinggemma-300m"
    • Changed EMBEDDING_MODEL_OVERRIDE from "qwen3" to "auto" for intelligent model selection
  • ✅ Verified e2e/profiles/routing-strategies/values.yaml: Already configured correctly

6. Helm Chart Configuration (deploy/helm/semantic-router/values.yaml)

  • ✅ Added embeddinggemma-300m to initContainer.models list
  • ✅ Configured initContainer.env to use HF_TOKEN from Kubernetes secret (hf-token-secret)
  • ✅ Set optional: true to allow deployment even if secret doesn't exist (for local testing)
  • ✅ Simplified init container script to gracefully skip Gemma download if HF_TOKEN is not available

7. GitHub Actions Workflow Updates

The following GitHub Actions workflows were updated to include HF_TOKEN for downloading gated models:

  1. integration-test-k8s.yml - Kubernetes E2E integration tests
  2. test-and-build.yml - Main test and build workflow
  3. integration-test-docker.yml - Docker Compose integration tests
  4. performance-test.yml - Performance benchmarking tests
  5. performance-nightly.yml - Nightly performance baseline tests
  6. integration-test-helm.yml - Helm chart installation tests

All workflows now:

  • Pass HF_TOKEN: ${{ secrets.HF_TOKEN }} as an environment variable to the model download steps
  • Create Kubernetes secrets (hf-token-secret) when HF_TOKEN is available
  • Include graceful skip logic for forks where HF_TOKEN is not available

8. E2E Framework HF_TOKEN Secret Creation (e2e/pkg/framework/runner.go)

  • Added createHFTokenSecret() function: Creates a Kubernetes secret named hf-token-secret in the vllm-semantic-router-system namespace from the HF_TOKEN environment variable

    • Ensures the namespace exists (creating it if necessary) before creating the secret
    • Handles cases where the secret already exists (updates it) or the namespace doesn't exist yet
    • The secret is namespace-scoped and must be in the same namespace as the semantic-router deployment
  • Added secret creation call in Run() method: After creating the Kubernetes client, the framework checks for the HF_TOKEN environment variable and automatically creates the secret if present

    • Logs appropriate messages indicating whether the secret was created successfully, or if HF_TOKEN is not set
    • This fix resolves the 401 Unauthorized and GatedRepoError issues that were preventing Gemma model downloads in E2E tests

9. Helm Deployment Template Fix (deploy/helm/semantic-router/templates/deployment.yaml)

  • ✅ Fixed YAML indentation issue in initContainer.env section (nindent 10 changed to nindent 8) that caused Helm installation failures when initContainer.env was actually defined
  • ✅ Simplified init container script to conditionally download Gemma only when HF_TOKEN is available

Testing

  • ✅ Go linter: Passed (0 issues)
  • ✅ Go mod tidy: Passed
  • ✅ Configuration files verified: All changes in place
  • ✅ Model download: Correctly attempts to download Gemma (401 expected without HF_TOKEN)
  • ✅ Graceful skip: Verified that system continues to work with Qwen3-only when Gemma is unavailable
  • With HF_TOKEN available (maintainer CI): Gemma model downloads successfully, full embedding functionality available
  • Without HF_TOKEN (fork PRs): Gemma download is skipped gracefully, system continues with Qwen3-only mode, tests pass
  • E2E Tests: dynamic-config profile now passes because embedding initialization no longer fails when Gemma is unavailable

@netlify
Copy link

netlify bot commented Dec 11, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit 31800e3
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/6953b71bbd2b710008873d12
😎 Deploy Preview https://deploy-preview-816--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions
Copy link

github-actions bot commented Dec 11, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 Root Directory

Owners: @rootfs, @Xunzhuo
Files changed:

  • .github/workflows/integration-test-docker.yml
  • .github/workflows/integration-test-k8s.yml
  • .github/workflows/test-and-build.yml

📁 candle-binding

Owners: @rootfs
Files changed:

  • candle-binding/semantic-router_test.go
  • candle-binding/src/ffi/embedding.rs

📁 deploy

Owners: @rootfs, @Xunzhuo
Files changed:

  • deploy/docker-compose/docker-compose.ci.yml
  • deploy/docker-compose/docker-compose.yml
  • deploy/helm/semantic-router/templates/deployment.yaml
  • deploy/helm/semantic-router/values.yaml

📁 e2e

Owners: @Xunzhuo
Files changed:

  • e2e/pkg/framework/runner.go
  • e2e/profiles/ai-gateway/values.yaml
  • e2e/profiles/dynamic-config/values.yaml

📁 src

Owners: @rootfs, @Xunzhuo, @wangchen615
Files changed:

  • src/semantic-router/pkg/modeldownload/downloader.go

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch 3 times, most recently from 45e4516 to 52451b1 Compare December 11, 2025 21:05
@liavweiss liavweiss marked this pull request as draft December 11, 2025 21:18
@liavweiss
Copy link
Contributor Author

This PR will remain in draft for now—I’m waiting to confirm whether an HF_TOKEN is already configured.

@rootfs
Copy link
Collaborator

rootfs commented Dec 12, 2025

@liavweiss the HF_TOKEN is configured

@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch 2 times, most recently from 8aa1c5e to cec298c Compare December 13, 2025 18:46
@liavweiss liavweiss marked this pull request as ready for review December 13, 2025 18:47
@liavweiss liavweiss marked this pull request as draft December 13, 2025 18:54
@liavweiss
Copy link
Contributor Author

liavweiss commented Dec 13, 2025

@liavweiss the HF_TOKEN is configured

@rootfs I added debug lines into test-ci-compose (in the download models section) and the output confirms the issue:

  • Repository context: vllm-project/semantic-router (workflow runs in upstream ✅)
  • Event: pull_request
  • PR head repo: liavweiss/semantic-router (PR is from a fork)
  • HF_TOKEN: Not available ❌

Root cause: GitHub Actions does not make secrets available in pull_request events when the PR is from a fork, even though the workflow runs in the upstream repository context. This is a security feature to prevent malicious fork code from accessing secrets.

Current Solution

The workflow gracefully skips Gemma download when the token is unavailable(currently implemented only on Integration Docker compose workflow), allowing PRs to pass while still running full tests on push events (after merge) where secrets are available.

Options

  1. Keep current approach (graceful skip) - Secure, but contributors can't test Gemma on PRs (only if they set a private hf_token in their fork)
  2. Use pull_request_target - Allows secrets, but has security implications
  3. Hybrid: Keep graceful skip + document manual testing via workflow_dispatch with fork secrets

What do you think about these approaches?
My suggestion is to keep the current approach (graceful skip). It's secure and allows contributors to test manually via workflow_dispatch with their fork secrets (but we need to document it).

@rootfs
Copy link
Collaborator

rootfs commented Dec 15, 2025

@liavweiss thanks for the analysis. Let's skip it then. If there is a need to use gemma, we'll revisit this issue.

@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch from cec298c to fee395f Compare December 15, 2025 20:07
@liavweiss liavweiss marked this pull request as ready for review December 15, 2025 20:08
@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch 3 times, most recently from f745a5c to ee7c0c5 Compare December 15, 2025 20:33
@liavweiss liavweiss marked this pull request as draft December 15, 2025 20:46
@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch from ee7c0c5 to a156f28 Compare December 16, 2025 09:35
@liavweiss liavweiss marked this pull request as ready for review December 16, 2025 09:36
@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch 12 times, most recently from 5812d70 to d08388d Compare December 21, 2025 13:27
@liavweiss
Copy link
Contributor Author

PR is ready fro review

@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch 2 times, most recently from 1389599 to 2cdb2bc Compare December 23, 2025 07:47
@liavweiss
Copy link
Contributor Author

After merging #862, my branch needs a deep refactor
Please don't review it at this moment.

@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch from f8407d4 to 66b621c Compare December 24, 2025 14:31
@liavweiss liavweiss force-pushed the feature/gemma-model-enable branch from 66b621c to e13013f Compare December 25, 2025 08:31
@liavweiss
Copy link
Contributor Author

PR is ready to review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Re-enable EmbeddingGemma-300m Support with HF_TOKEN Configuration

5 participants