[Feature Request] Add vLLM CPU inference image for SageMaker

### What
Requesting a CPU-only vLLM inference container for SageMaker, similar to the existing GPU-based vLLM images.

### Why
This would enable running vLLM on CPU instances for cost-effective workloads that don't require GPU acceleration, such as:
- Reranking
- Scoring
- Embeddings
- Small generative models

### Reference Implementation
I previously submitted PR #5670 which contains a working implementation (Dockerfile + buildspec) that was validated on EC2 (c5.4xlarge):
- Successful image build (~3.5 GB)
- Health endpoint returning 200
- `/v1/completions` working with facebook/opt-125m
- Reranker endpoint working with Alibaba-NLP model

### Key Implementation Details from PR
- vLLM v0.15.1 with CPU target build
- Python 3.12, Ubuntu 22.04
- tcmalloc + Intel OpenMP for performance
- Reuses existing SageMaker entrypoint script
- Tag format: `0.15.1-cpu-py312-ubuntu22.04-sagemaker`

I understand external contributions are not accepted, so filing this as a feature request for the team to consider. The PR can serve as a reference for the implementation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add vLLM CPU inference image for SageMaker #5809

What

Why

Reference Implementation

Key Implementation Details from PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request] Add vLLM CPU inference image for SageMaker #5809

Description

What

Why

Reference Implementation

Key Implementation Details from PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions