Skip to content

[Feature Request] Add vLLM CPU inference image for SageMaker #5809

@timelfrink

Description

@timelfrink

What

Requesting a CPU-only vLLM inference container for SageMaker, similar to the existing GPU-based vLLM images.

Why

This would enable running vLLM on CPU instances for cost-effective workloads that don't require GPU acceleration, such as:

  • Reranking
  • Scoring
  • Embeddings
  • Small generative models

Reference Implementation

I previously submitted PR #5670 which contains a working implementation (Dockerfile + buildspec) that was validated on EC2 (c5.4xlarge):

  • Successful image build (~3.5 GB)
  • Health endpoint returning 200
  • /v1/completions working with facebook/opt-125m
  • Reranker endpoint working with Alibaba-NLP model

Key Implementation Details from PR

  • vLLM v0.15.1 with CPU target build
  • Python 3.12, Ubuntu 22.04
  • tcmalloc + Intel OpenMP for performance
  • Reuses existing SageMaker entrypoint script
  • Tag format: 0.15.1-cpu-py312-ubuntu22.04-sagemaker

I understand external contributions are not accepted, so filing this as a feature request for the team to consider. The PR can serve as a reference for the implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions