Skip to content

refactor: decouple model inference from the application and support pluggable inference backends #117

@yjing86

Description

@yjing86

refactor: decouple model inference from the application and support pluggable inference backends

While looking through the current model path, I noticed that Sugar-AI is still tightly coupled to local Hugging Face inference inside the application layer.

Right now, model loading and inference are part of the main app flow, instead of being treated as a separate runtime concern. That works for a small local setup, but it makes the system harder to extend, deploy, and operate as model support grows.

I think model inference should be decoupled from the business application and exposed through a provider-based runtime.

Why this is worth changing

Keeping inference tightly coupled to the app has a few downsides:

  • the application is tied to one inference path
  • adding new backends becomes harder than it should be
  • deployment flexibility is limited
  • model operations are harder to manage in production
  • runtime model switching is awkward
  • future support for embedding / rerank services becomes harder to design cleanly

Proposed direction

Introduce a provider abstraction for inference and allow Sugar-AI to use external inference backends such as:

  • Ollama
  • vLLM
  • SGLang

instead of assuming that the app itself is responsible for directly loading and serving Hugging Face models.

Why these backends are useful

Ollama

Good fit for simple local and self-hosted setups:

  • easy to install and run
  • straightforward model management
  • practical for development and lightweight deployments
  • good default option for contributors who want a low-friction setup

vLLM

Good fit for higher-throughput serving:

  • efficient batching
  • strong GPU utilization
  • OpenAI-compatible API support
  • better serving performance for production-style workloads

SGLang

Good fit for more advanced inference workflows:

  • optimized serving performance
  • good support for structured generation patterns
  • useful foundation if Sugar-AI grows toward more advanced agentic or multi-step generation flows

Suggested architecture

I think the app should move toward something like this:

  • application layer
  • provider abstraction
  • inference backend

Where the application only asks for generation through a stable interface, and the actual backend can be swapped independently.

For example:

  • huggingface-local
  • ollama
  • vllm
  • sglang

could all implement the same generation contract.

Model management

I also think model configuration should be persisted in the database instead of being only environment-driven.

That would allow Sugar-AI to store things like:

  • model name
  • provider type
  • base URL / endpoint
  • API key if needed
  • context length / metadata
  • active/inactive status

This would make it possible to manage models more cleanly from the application side.

Runtime model switching

On top of that, Sugar-AI should support runtime model switching.

That means:

  • register multiple model backends
  • mark one as active
  • rebuild or swap the in-memory runtime without restarting the whole app

This would make the system much more practical for real deployments and experimentation.

Longer-term benefit

This does not only help text generation.

The same pattern would also make it easier later to support:

  • embedding providers
  • reranker providers
  • different retrieval backends
  • provider-specific health checks
  • model discovery from compatible APIs

So I think this is a good foundational refactor, not just a model-serving convenience change.

Acceptance criteria

  • inference is no longer hardwired to one local Hugging Face path
  • model invocation is abstracted behind a provider layer
  • at least one external inference backend is supported
  • model/provider configuration can be persisted in the database
  • the active model can be changed at runtime
  • the app remains usable for local development

If maintainers think this direction makes sense, I’d be happy to help with it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions