Skip to content

Commit 7eb167a

Browse files
committed
simplify
Signed-off-by: Dmitry Shmulevich <dshmulevich@nvidia.com>
1 parent 5479531 commit 7eb167a

41 files changed

Lines changed: 724 additions & 659 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

charts/topograph/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ global:
4343
provider:
4444
name: dra # or aws, gcp, oci, nebius, netq, infiniband-k8s, ...
4545
engine:
46-
name: k8s # or slurm, slinky
46+
name: k8s # or slurm, slinky, graph
4747
```
4848
4949
For the full list of values and their defaults, see [`values.yaml`](./values.yaml). Example values files for specific deployment patterns:
@@ -95,7 +95,7 @@ Both are installed together when you install this chart. Their values are access
9595
- **Project documentation site**: <https://topograph.docs.buildwithfern.com/topograph>
9696
- **Main repository**: <https://github.com/NVIDIA/topograph>
9797
- **Provider-specific setup**: `docs/providers/` in the main repository
98-
- **Engine documentation**: `docs/engines/k8s.md`, `docs/engines/slinky.md`, `docs/engines/slurm.md`
98+
- **Engine documentation**: `docs/engines/k8s.md`, `docs/engines/slinky.md`, `docs/engines/slurm.md`, `docs/engines/graph.md`
9999
- **Node-labels reference**: `docs/reference/node-labels.md`
100100
- **Contributing**: see [`CONTRIBUTING.md`](https://github.com/NVIDIA/topograph/blob/main/CONTRIBUTING.md) in the main repository
101101

charts/topograph/values.schema.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@
4545
"name": {
4646
"type": "string",
4747
"description": "Scheduler-output engine. Must match a registered engine in pkg/registry/registry.go.",
48-
"enum": ["k8s", "slinky", "slurm"]
48+
"enum": ["graph", "k8s", "slinky", "slurm"]
4949
}
5050
},
5151
"required": ["name"]

charts/topograph/values.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,10 @@
44

55
global:
66
provider:
7-
# name: "aws", "oci", "gcp", "nebius", "netq", "infiniband-k8s", "dra" or "test".
7+
# name: "aws", "oci", "gcp", "nebius", "nscale", "netq", "infiniband-k8s", "dra" or "test".
88
name: test
99
engine:
10-
# name: "k8s" or "slinky"
10+
# name: "k8s", "slinky", "slurm" or "graph"
1111
name: k8s
1212

1313
service:

docs/api.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ http:
2020
provider: test
2121

2222
# engine: the engine that topograph will use (optional)
23-
# Valid options include "slurm", "k8s", or "slinky".
23+
# Valid options include "slurm", "k8s", "slinky", or "graph".
2424
# Can be overridden if the engine is specified in a topology request to topograph
2525
engine: slurm
2626

@@ -73,11 +73,11 @@ Topograph exposes three endpoints for interacting with the service. Below are th
7373
- **creds**: (optional) A key-value map with provider-specific parameters for authentication.
7474
- **params**: (optional) A key-value map with provider-specific parameters. The `test` provider uses these parameters for response simulation; for complete behavior and examples, see [Test Mode and Test Provider](./providers/test.md).
7575
- **engine**: (optional) Selects the topology output and provides any engine-specific parameters.
76-
- **name**: (optional) A string specifying the topology output, either `slurm`, `k8s`, or `slinky`. This parameter will override the engine set in the topograph config.
76+
- **name**: (optional) A string specifying the topology output, either `slurm`, `k8s`, `slinky`, or `graph`. This parameter will override the engine set in the topograph config.
7777
- **params**: (optional) A key-value map with engine-specific parameters.
7878
- **plugin**: (optional) Used in: [`slurm`, `slinky`]. A string specifying the cluster-wide topology plugin: `topology/tree` or `topology/block`. For `slurm`, this defaults to `topology/tree` when neither `plugin` nor `topologies` is set. Do not set `plugin` together with `topologies`.
7979
- **blockSizes**: (optional) Used in: [`slurm`, `slinky`]. An array of block sizes for `topology/block`.
80-
- **topologyConfigPath**: Used in: [`slurm`, `slinky`]. Optional for `slurm`; required for `slinky`. For `slurm`, a file path for the topology configuration; if omitted, the topology config content is returned in the HTTP response. For `slinky`, the key for the topology config in the ConfigMap.
80+
- **topologyConfigPath**: Used in: [`slurm`, `slinky`, `graph`]. Optional for `slurm` and `graph`; required for `slinky`. For `slurm`, a file path for the topology configuration; if omitted, the topology config content is returned in the HTTP response. For `slinky`, the key for the topology config in the ConfigMap. For `graph`, an existing path on the Topograph host where instance JSON should be written; if omitted, the JSON is returned in the topology response.
8181
- **topologies**: (optional) Used in: [`slurm`, `slinky`]. A map of named per-partition topology settings. Do not set top-level `plugin` together with `topologies`.
8282
- **plugin**: Used in: [`slurm`, `slinky`]. A required string specifying the per-partition topology plugin: `topology/tree`, `topology/block`, or `topology/flat`.
8383
- **blockSizes**: (optional) Used in: [`slurm`, `slinky`]. An array of block sizes for `topology/block`.

docs/engines/graph.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# Topograph Graph Engine
2+
3+
The `graph` engine returns instance-oriented topology metadata as JSON. It is intended for clients that need per-instance GPU and placement context rather than scheduler-specific output such as `topology.conf`, Kubernetes node labels, or a Slinky ConfigMap.
4+
5+
The engine preserves the provider/engine boundary: providers still discover topology and instance metadata, while the `graph` engine only assembles and emits the requested instance records.
6+
7+
## Output
8+
9+
By default, the generated JSON is returned in the `/v1/topology` response:
10+
11+
```json
12+
{
13+
"instances": [
14+
{
15+
"id": "I21",
16+
"type": "H100",
17+
"provider": "test",
18+
"region": "us-west",
19+
"network_layers": ["leaf-a", "spine-a"],
20+
"attributes": {
21+
"nvlink": "nvl-1",
22+
"gpu": {
23+
"status": "known",
24+
"collected_at": "2026-01-01T13:59:00.000Z",
25+
"gpus": [
26+
{
27+
"index": 0,
28+
"pci_bus_id": "00000000:0F:00.0",
29+
"uuid": "GPU-example",
30+
"model": "NVIDIA H100 SXM5 80GB",
31+
"memory_mib": 81920
32+
}
33+
]
34+
}
35+
}
36+
}
37+
]
38+
}
39+
```
40+
41+
Set `engine.params.topologyConfigPath` to write the JSON to an existing validated path on the Topograph host. When `topologyConfigPath` is set, the HTTP result body is `OK`.
42+
43+
## Request
44+
45+
The engine needs the instance IDs to export. Supply `nodes` in the request, or use a provider that can supply compute instances directly. The initial implementation is covered by the `test` provider and model-backed simulation providers.
46+
47+
```json
48+
{
49+
"provider": {
50+
"name": "test",
51+
"params": {
52+
"modelFileName": "small-tree.yaml"
53+
}
54+
},
55+
"engine": {
56+
"name": "graph"
57+
},
58+
"nodes": [
59+
{
60+
"region": "none",
61+
"instances": {
62+
"I21": "n-I21",
63+
"I22": "n-I22"
64+
}
65+
}
66+
]
67+
}
68+
```

docs/get-started/quickstart-k8s.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Install on Kubernetes
22

3-
Topograph installs on a Kubernetes cluster via a Helm chart. The same chart supports two engines:
3+
Topograph installs on a Kubernetes cluster via a Helm chart. This quickstart covers the two Kubernetes-facing scheduler engines:
44

55
- **[`k8s` engine](#engine-k8s)** — labels Kubernetes nodes with topology keys so schedulers (native `podAffinity`, KAI Scheduler, Kueue TAS, etc.) can make topology-aware placement decisions
66
- **[`slinky` engine](#engine-slinky)** — writes Slurm topology configuration into a `ConfigMap` for [Slinky](https://github.com/SlinkyProject) (Slurm-on-Kubernetes) deployments

docs/index.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@ navigation:
4949
path: engines/k8s.md
5050
- page: Slinky
5151
path: engines/slinky.md
52+
- page: Graph
53+
path: engines/graph.md
5254

5355
- section: Reference
5456
contents:

docs/modeling.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -126,8 +126,9 @@ The `nodes` map describes compute nodes directly. Each key is the node name. The
126126

127127
| Field | Description |
128128
|---|---|
129-
| `name` | Optional. If set, it must match the map key. Usually omitted. |
130-
| `capacity_block_id` | Optional accelerated domain ID. If set and `capacity_blocks` is omitted, Topograph creates the corresponding capacity block entry. |
129+
| `id` | Optional. If set, it must match the map key. Usually omitted. |
130+
| `type` | Optional instance type metadata used by instance-oriented exports. |
131+
| `capacity_block` | Optional accelerated domain ID. If set and `capacity_blocks` is omitted, Topograph creates the corresponding capacity block entry. |
131132
| `attributes.nvlink` | Optional accelerated-domain / NVLink identifier. Used by block topology simulation paths. |
132133
| `attributes.status` | Optional node status metadata. |
133134
| `attributes.timestamp` | Optional timestamp metadata. |
@@ -138,7 +139,7 @@ Example:
138139
```yaml
139140
nodes:
140141
n1:
141-
capacity_block_id: cb1
142+
capacity_block: cb1
142143
attributes:
143144
nvlink: nvl1
144145
n2:
@@ -148,10 +149,10 @@ nodes:
148149

149150
Node rules:
150151

151-
- `capacity_block_id` is optional.
152-
- Nodes without `capacity_block_id` are still valid compute nodes.
153-
- If `capacity_block_id` is set and `capacity_blocks` is omitted, Topograph creates the capacity block and adds the node to it.
154-
- If a node is listed under `capacity_blocks.<id>.nodes`, Topograph fills in the node's missing `capacity_block_id`.
152+
- `capacity_block` is optional.
153+
- Nodes without `capacity_block` are still valid compute nodes.
154+
- If `capacity_block` is set and `capacity_blocks` is omitted, Topograph creates the capacity block and adds the node to it.
155+
- If a node is listed under `capacity_blocks.<id>.nodes`, Topograph fills in the node's missing `capacity_block`.
155156
- If both sides specify different capacity block IDs for the same node, model loading fails.
156157

157158
## Capacity Blocks
@@ -210,7 +211,7 @@ After YAML parsing, Topograph completes the model before simulation uses it:
210211
- Node names are copied from their map keys.
211212
- Switch names are copied from their map keys.
212213
- Missing nodes can be created from `capacity_blocks.<id>.nodes`.
213-
- Missing capacity block entries can be created from node `capacity_block_id` values.
214+
- Missing capacity block entries can be created from node `capacity_block` values.
214215
- Node `NetLayers` is derived from the switch path from leaf to root.
215216
- Node `Metadata` is built by merging switch metadata along the same path.
216217
- `Instances` is derived from node names and grouped by `metadata.region`; nodes without a region use `none`.
@@ -249,12 +250,12 @@ After loading:
249250

250251
### Capacity Blocks From Nodes
251252

252-
This model omits `capacity_blocks`. Topograph creates `cb1` from `n1.capacity_block_id`.
253+
This model omits `capacity_blocks`. Topograph creates `cb1` from `n1.capacity_block`.
253254

254255
```yaml
255256
nodes:
256257
n1:
257-
capacity_block_id: cb1
258+
capacity_block: cb1
258259
attributes:
259260
nvlink: nvl1
260261
n2:
@@ -275,7 +276,7 @@ This is valid. It declares a capacity block that currently has no nodes.
275276
```yaml
276277
nodes:
277278
n1:
278-
capacity_block_id: cb1
279+
capacity_block: cb1
279280
280281
capacity_blocks:
281282
cb1: {}
@@ -368,5 +369,5 @@ Before using a new model in a regression test:
368369
- Confirm every switch child has only one parent.
369370
- Confirm every switched node is defined in `nodes` or generated from `capacity_blocks`.
370371
- Confirm no node appears under two switches.
371-
- Confirm capacity block membership does not conflict with node `capacity_block_id`.
372+
- Confirm capacity block membership does not conflict with node `capacity_block`.
372373
- Run the relevant provider simulation test or API flow with the target engine.

docs/overview.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ Currently supported engines:
4949
- [SLURM](./engines/slurm.md)
5050
- [Kubernetes](./engines/k8s.md)
5151
- [SLURM-on-Kubernetes (Slinky)](./engines/slinky.md)
52+
- [Graph](./engines/graph.md) - returns instance-oriented topology metadata as JSON
5253

5354
### Choosing a Provider
5455

docs/providers/test.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ The test provider is configured through `provider.params` in the `/v1/generate`
9393
}
9494
```
9595

96-
The `engine` object follows the normal Topograph engine configuration. For example, use `slurm` parameters to request `topology/tree` or `topology/block` output, use `k8s` parameters to write node labels, or use `slinky` parameters to update a Slinky ConfigMap.
96+
The `engine` object follows the normal Topograph engine configuration. For example, use `slurm` parameters to request `topology/tree` or `topology/block` output, use `k8s` parameters to write node labels, use `slinky` parameters to update a Slinky ConfigMap, or use `graph` to return model-backed instance metadata as JSON.
9797

9898
### Test Provider Parameters
9999

0 commit comments

Comments
 (0)