Skip to content

Commit 7d238dd

Browse files
nicolexinrobscott
andauthored
Complete the InferencePool documentation (#673)
* Initial guide for inference pool * Add extensionReference to the InferencePool spec * Fix list formatting * Remove unused labels * Autogenerate the spec * Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott <[email protected]> * Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott <[email protected]> * Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott <[email protected]> * Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott <[email protected]> * Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott <[email protected]> * Update site-src/api-types/inferencepool.md Co-authored-by: Rob Scott <[email protected]> * Rename llm-pool names in rollout example * Add use cases for replacing an inference pool * Rewording the background section * Create replacing-inference-pool.md * Replace instructions with a link for how to replace an inference pool * Update replacing-inference-pool.md * Update mkdocs.yml * Update replacing-inference-pool.md * Update inferencemodel_types.go * Update inferencepool.md * Update site-src/guides/replacing-inference-pool.md Co-authored-by: Rob Scott <[email protected]> --------- Co-authored-by: Rob Scott <[email protected]>
1 parent 45209f6 commit 7d238dd

File tree

5 files changed

+352
-56
lines changed

5 files changed

+352
-56
lines changed

api/v1alpha2/inferencemodel_types.go

+1-1
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@ type PoolObjectReference struct {
126126
}
127127

128128
// Criticality defines how important it is to serve the model compared to other models.
129-
// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional(use a pointer), and set no default.
129+
// Criticality is intentionally a bounded enum to contain the possibilities that need to be supported by the load balancing algorithm. Any reference to the Criticality field must be optional (use a pointer), and set no default.
130130
// This allows us to union this with a oneOf field in the future should we wish to adjust/extend this behavior.
131131
// +kubebuilder:validation:Enum=Critical;Standard;Sheddable
132132
type Criticality string

mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ nav:
6363
- Getting started: guides/index.md
6464
- Adapter Rollout: guides/adapter-rollout.md
6565
- Metrics: guides/metrics.md
66+
- Replacing an Inference Pool: guides/replacing-inference-pool.md
6667
- Implementer's Guide: guides/implementers.md
6768
- Performance:
6869
- Benchmark: performance/benchmark/index.md

site-src/api-types/inferencepool.md

+43-15
Original file line numberDiff line numberDiff line change
@@ -7,28 +7,56 @@
77

88
## Background
99

10-
The InferencePool resource is a logical grouping of compute resources, e.g. Pods, that run model servers. The InferencePool would deploy its own routing, and offer administrative configuration to the Platform Admin.
10+
The **InferencePool** API defines a group of Pods (containers) dedicated to serving AI models. Pods within an InferencePool share the same compute configuration, accelerator type, base language model, and model server. This abstraction simplifies the management of AI model serving resources, providing a centralized point of administrative configuration for Platform Admins.
1111

12-
It is expected for the InferencePool to:
12+
An InferencePool is expected to be bundled with an [Endpoint Picker](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) extension. This extension is responsible for tracking key metrics on each model server (i.e. the KV-cache utilization, queue length of pending requests, active LoRA adapters, etc.) and routing incoming inference requests to the optimal model server replica based on these metrics. An EPP can only be associated with a single InferencePool. The associated InferencePool is specified by the [poolName](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L54) and [poolNamespace](https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/config/manifests/inferencepool-resources.yaml#L56) flags. An HTTPRoute can have multiple backendRefs that reference the same InferencePool and therefore routes to the same EPP. An HTTPRoute can have multiple backendRefs that reference different InferencePools and therefore routes to different EPPs.
1313

14-
- Enforce fair consumption of resources across competing workloads
15-
- Efficiently route requests across shared compute (as displayed by the PoC)
16-
17-
It is _not_ expected for the InferencePool to:
14+
Additionally, any Pod that seeks to join an InferencePool would need to support the [model server protocol](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/docs/proposals/003-model-server-protocol), defined by this project, to ensure the Endpoint Picker has adequate information to intelligently route requests.
1815

19-
- Enforce any common set of adapters or base models are available on the Pods
20-
- Manage Deployments of Pods within the Pool
21-
- Manage Pod lifecycle of pods within the pool
16+
## How to Configure an InferencePool
2217

23-
Additionally, any Pod that seeks to join an InferencePool would need to support a protocol, defined by this project, to ensure the Pool has adequate information to intelligently route requests.
18+
The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).
2419

25-
`InferencePool` has some small overlap with `Service`, displayed here:
20+
In summary, the InferencePoolSpec consists of 3 major parts:
21+
22+
- The `selector` field specifies which Pods belong to this pool. The labels in this selector must exactly match the labels applied to your model server Pods.
23+
- The `targetPortNumber` field defines the port number that the Inference Gateway should route to on model server Pods that belong to this pool.
24+
- The `extensionRef` field references the [endpoint picker extension](https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/pkg/epp) (EPP) service that monitors key metrics from model servers within the InferencePool and provides intelligent routing decisions.
25+
26+
### Example Configuration
27+
28+
Here is an example InferencePool configuration:
29+
30+
```
31+
apiVersion: inference.networking.x-k8s.io/v1alpha2
32+
kind: InferencePool
33+
metadata:
34+
name: vllm-llama3-8b-instruct
35+
spec:
36+
targetPortNumber: 8000
37+
selector:
38+
app: vllm-llama3-8b-instruct
39+
extensionRef:
40+
name: vllm-llama3-8b-instruct-epp
41+
port: 9002
42+
failureMode: FailClose
43+
```
44+
45+
In this example:
46+
47+
- An InferencePool named `vllm-llama3-8b-instruct` is created in the `default` namespace.
48+
- It will select Pods that have the label `app: vllm-llama3-8b-instruct`.
49+
- Traffic routed to this InferencePool will call out to the EPP service `vllm-llama3-8b-instruct-epp` on port `9002` for making routing decisions. If EPP fails to pick an endpoint, or is not responsive, the request will be dropped.
50+
- Traffic routed to this InferencePool will be forwarded to the port `8000` on the selected Pods.
51+
52+
## Overlap with Service
53+
54+
**InferencePool** has some small overlap with **Service**, displayed here:
2655

2756
<!-- Source: https://docs.google.com/presentation/d/11HEYCgFi-aya7FS91JvAfllHiIlvfgcp7qpi_Azjk4E/edit#slide=id.g292839eca6d_1_0 -->
2857
<img src="/images/inferencepool-vs-service.png" alt="Comparing InferencePool with Service" class="center" width="550" />
2958

30-
The InferencePool is _not_ intended to be a mask of the Service object, simply exposing the absolute bare minimum required to allow the Platform Admin to focus less on networking, and more on Pool management.
31-
32-
## Spec
59+
The InferencePool is not intended to be a mask of the Service object. It provides a specialized abstraction tailored for managing and routing traffic to groups of LLM model servers, allowing Platform Admins to focus on pool-level management rather than low-level networking details.
3360

34-
The full spec of the InferencePool is defined [here](/reference/spec/#inferencepool).
61+
## Replacing an InferencePool
62+
Please refer to the [Replacing an InferencePool](/guides/replacing-inference-pool) guide for details on uses cases and how to replace an InferencePool.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Replacing an InferencePool
2+
3+
## Background
4+
5+
Replacing an InferencePool is a powerful technique for performing various infrastructure and model updates with minimal disruption and built-in rollback capabilities. This method allows you to introduce changes incrementally, monitor their impact, and revert to the previous state if necessary.
6+
7+
## Use Cases
8+
Use Cases for Replacing an InferencePool:
9+
10+
- Upgrading or replacing your model server framework
11+
- Upgrading or replacing your base model
12+
- Transitioning to new hardware
13+
14+
## How to replace an InferencePool
15+
16+
To replacing an InferencePool:
17+
18+
1. **Deploy new infrastructure**: Create a new InferencePool configured with the new hardware / model server / base model that you chose.
19+
1. **Configure traffic splitting**: Use an HTTPRoute to split traffic between the existing InferencePool and the new InferencePool. The `backendRefs.weight` field controls the traffic percentage allocated to each pool.
20+
1. **Maintain InferenceModel integrity**: Keep your InferenceModel configuration unchanged. This ensures that the system applies the same LoRA adapters consistently across both base model versions.
21+
1. **Preserve rollback capability**: Retain the original nodes and InferencePool during the roll out to facilitate a rollback if necessary.
22+
23+
### Example
24+
25+
You start with an existing lnferencePool named `llm-pool-v1`. To replace the original InferencePool, you create a new InferencePool named `llm-pool-v2`. By configuring an **HTTPRoute**, as shown below, you can incrementally split traffic between the original `llm-pool-v1` and new `llm-pool-v2`.
26+
27+
1. Save the following sample manifest as `httproute.yaml`:
28+
29+
```yaml
30+
apiVersion: gateway.networking.k8s.io/v1
31+
kind: HTTPRoute
32+
metadata:
33+
name: llm-route
34+
spec:
35+
parentRefs:
36+
- group: gateway.networking.k8s.io
37+
kind: Gateway
38+
name: inference-gateway
39+
rules:
40+
backendRefs:
41+
- group: inference.networking.x-k8s.io
42+
kind: InferencePool
43+
name: llm-pool-v1
44+
weight: 90
45+
- group: inference.networking.x-k8s.io
46+
kind: InferencePool
47+
name: llm-pool-v2
48+
weight: 10
49+
```
50+
51+
1. Apply the sample manifest to your cluster:
52+
53+
```
54+
kubectl apply -f httproute.yaml
55+
```
56+
57+
The original `llm-pool-v1` InferencePool receives most of the traffic, while the `llm-pool-v2` InferencePool receives the rest.
58+
59+
1. Increase the traffic weight gradually for the `llm-pool-v2` InferencePool to complete the new InferencePool roll out.

0 commit comments

Comments
 (0)