refactor: decouple model inference from the application and support pluggable inference backends

# refactor: decouple model inference from the application and support pluggable inference backends

While looking through the current model path, I noticed that Sugar-AI is still tightly coupled to local Hugging Face inference inside the application layer.

Right now, model loading and inference are part of the main app flow, instead of being treated as a separate runtime concern. That works for a small local setup, but it makes the system harder to extend, deploy, and operate as model support grows.

I think model inference should be decoupled from the business application and exposed through a provider-based runtime.

## Why this is worth changing

Keeping inference tightly coupled to the app has a few downsides:

- the application is tied to one inference path
- adding new backends becomes harder than it should be
- deployment flexibility is limited
- model operations are harder to manage in production
- runtime model switching is awkward
- future support for embedding / rerank services becomes harder to design cleanly

## Proposed direction

Introduce a provider abstraction for inference and allow Sugar-AI to use external inference backends such as:

- **Ollama**
- **vLLM**
- **SGLang**

instead of assuming that the app itself is responsible for directly loading and serving Hugging Face models.

## Why these backends are useful

### Ollama
Good fit for simple local and self-hosted setups:

- easy to install and run
- straightforward model management
- practical for development and lightweight deployments
- good default option for contributors who want a low-friction setup

### vLLM
Good fit for higher-throughput serving:

- efficient batching
- strong GPU utilization
- OpenAI-compatible API support
- better serving performance for production-style workloads

### SGLang
Good fit for more advanced inference workflows:

- optimized serving performance
- good support for structured generation patterns
- useful foundation if Sugar-AI grows toward more advanced agentic or multi-step generation flows

## Suggested architecture

I think the app should move toward something like this:

- application layer
- provider abstraction
- inference backend

Where the application only asks for generation through a stable interface, and the actual backend can be swapped independently.

For example:

- `huggingface-local`
- `ollama`
- `vllm`
- `sglang`

could all implement the same generation contract.

## Model management

I also think model configuration should be persisted in the database instead of being only environment-driven.

That would allow Sugar-AI to store things like:

- model name
- provider type
- base URL / endpoint
- API key if needed
- context length / metadata
- active/inactive status

This would make it possible to manage models more cleanly from the application side.

## Runtime model switching

On top of that, Sugar-AI should support runtime model switching.

That means:

- register multiple model backends
- mark one as active
- rebuild or swap the in-memory runtime without restarting the whole app

This would make the system much more practical for real deployments and experimentation.

## Longer-term benefit

This does not only help text generation.

The same pattern would also make it easier later to support:

- embedding providers
- reranker providers
- different retrieval backends
- provider-specific health checks
- model discovery from compatible APIs

So I think this is a good foundational refactor, not just a model-serving convenience change.

## Acceptance criteria

- inference is no longer hardwired to one local Hugging Face path
- model invocation is abstracted behind a provider layer
- at least one external inference backend is supported
- model/provider configuration can be persisted in the database
- the active model can be changed at runtime
- the app remains usable for local development

If maintainers think this direction makes sense, I’d be happy to help with it.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: decouple model inference from the application and support pluggable inference backends #117

refactor: decouple model inference from the application and support pluggable inference backends

Why this is worth changing

Proposed direction

Why these backends are useful

Ollama

vLLM

SGLang

Suggested architecture

Model management

Runtime model switching

Longer-term benefit

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

refactor: decouple model inference from the application and support pluggable inference backends #117

Description

refactor: decouple model inference from the application and support pluggable inference backends

Why this is worth changing

Proposed direction

Why these backends are useful

Ollama

vLLM

SGLang

Suggested architecture

Model management

Runtime model switching

Longer-term benefit

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions