refactor: decouple model inference from the application and support pluggable inference backends
While looking through the current model path, I noticed that Sugar-AI is still tightly coupled to local Hugging Face inference inside the application layer.
Right now, model loading and inference are part of the main app flow, instead of being treated as a separate runtime concern. That works for a small local setup, but it makes the system harder to extend, deploy, and operate as model support grows.
I think model inference should be decoupled from the business application and exposed through a provider-based runtime.
Why this is worth changing
Keeping inference tightly coupled to the app has a few downsides:
- the application is tied to one inference path
- adding new backends becomes harder than it should be
- deployment flexibility is limited
- model operations are harder to manage in production
- runtime model switching is awkward
- future support for embedding / rerank services becomes harder to design cleanly
Proposed direction
Introduce a provider abstraction for inference and allow Sugar-AI to use external inference backends such as:
instead of assuming that the app itself is responsible for directly loading and serving Hugging Face models.
Why these backends are useful
Ollama
Good fit for simple local and self-hosted setups:
- easy to install and run
- straightforward model management
- practical for development and lightweight deployments
- good default option for contributors who want a low-friction setup
vLLM
Good fit for higher-throughput serving:
- efficient batching
- strong GPU utilization
- OpenAI-compatible API support
- better serving performance for production-style workloads
SGLang
Good fit for more advanced inference workflows:
- optimized serving performance
- good support for structured generation patterns
- useful foundation if Sugar-AI grows toward more advanced agentic or multi-step generation flows
Suggested architecture
I think the app should move toward something like this:
- application layer
- provider abstraction
- inference backend
Where the application only asks for generation through a stable interface, and the actual backend can be swapped independently.
For example:
huggingface-local
ollama
vllm
sglang
could all implement the same generation contract.
Model management
I also think model configuration should be persisted in the database instead of being only environment-driven.
That would allow Sugar-AI to store things like:
- model name
- provider type
- base URL / endpoint
- API key if needed
- context length / metadata
- active/inactive status
This would make it possible to manage models more cleanly from the application side.
Runtime model switching
On top of that, Sugar-AI should support runtime model switching.
That means:
- register multiple model backends
- mark one as active
- rebuild or swap the in-memory runtime without restarting the whole app
This would make the system much more practical for real deployments and experimentation.
Longer-term benefit
This does not only help text generation.
The same pattern would also make it easier later to support:
- embedding providers
- reranker providers
- different retrieval backends
- provider-specific health checks
- model discovery from compatible APIs
So I think this is a good foundational refactor, not just a model-serving convenience change.
Acceptance criteria
- inference is no longer hardwired to one local Hugging Face path
- model invocation is abstracted behind a provider layer
- at least one external inference backend is supported
- model/provider configuration can be persisted in the database
- the active model can be changed at runtime
- the app remains usable for local development
If maintainers think this direction makes sense, I’d be happy to help with it.
refactor: decouple model inference from the application and support pluggable inference backends
While looking through the current model path, I noticed that Sugar-AI is still tightly coupled to local Hugging Face inference inside the application layer.
Right now, model loading and inference are part of the main app flow, instead of being treated as a separate runtime concern. That works for a small local setup, but it makes the system harder to extend, deploy, and operate as model support grows.
I think model inference should be decoupled from the business application and exposed through a provider-based runtime.
Why this is worth changing
Keeping inference tightly coupled to the app has a few downsides:
Proposed direction
Introduce a provider abstraction for inference and allow Sugar-AI to use external inference backends such as:
instead of assuming that the app itself is responsible for directly loading and serving Hugging Face models.
Why these backends are useful
Ollama
Good fit for simple local and self-hosted setups:
vLLM
Good fit for higher-throughput serving:
SGLang
Good fit for more advanced inference workflows:
Suggested architecture
I think the app should move toward something like this:
Where the application only asks for generation through a stable interface, and the actual backend can be swapped independently.
For example:
huggingface-localollamavllmsglangcould all implement the same generation contract.
Model management
I also think model configuration should be persisted in the database instead of being only environment-driven.
That would allow Sugar-AI to store things like:
This would make it possible to manage models more cleanly from the application side.
Runtime model switching
On top of that, Sugar-AI should support runtime model switching.
That means:
This would make the system much more practical for real deployments and experimentation.
Longer-term benefit
This does not only help text generation.
The same pattern would also make it easier later to support:
So I think this is a good foundational refactor, not just a model-serving convenience change.
Acceptance criteria
If maintainers think this direction makes sense, I’d be happy to help with it.