Skip to content

Commit c895eb6

Browse files
ravisoundardmitsh
authored andcommitted
Added top level results attribute to the json marshalling
Signed-off-by: Ravi Shankar <ravish@nvidia.com>
1 parent c44d300 commit c895eb6

8 files changed

Lines changed: 415 additions & 50 deletions

File tree

docs/api.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ http:
1515
ssl: false
1616

1717
# provider: the provider that topograph will use (optional)
18-
# Valid options include "aws", "oci", "gcp", "nebius", "netq", "dra", "infiniband-k8s", "infiniband-bm" or "test".
18+
# Valid options include "aws", "oci", "gcp", "nebius", "nscale", "netq", "dra", "infiniband-k8s", "infiniband-bm" or "test".
1919
# Can be overridden if the provider is specified in a topology request to topograph
2020
provider: test
2121

@@ -69,7 +69,7 @@ Topograph exposes three endpoints for interacting with the service. Below are th
6969
- **Payload:** The request body is a JSON object organized into three top-level sections:
7070

7171
- **provider**: (optional) Selects the topology source and provides any provider-specific authentication or parameters.
72-
- **name**: (optional) A string specifying the Service Provider, such as `aws`, `oci`, `gcp`, `nebius`, `netq`, `dra`, `infiniband-k8s`, `infiniband-bm` or `test`. This parameter will override the provider set in the topograph config.
72+
- **name**: (optional) A string specifying the Service Provider, such as `aws`, `oci`, `gcp`, `nebius`, `nscale`, `netq`, `dra`, `infiniband-k8s`, `infiniband-bm` or `test`. This parameter will override the provider set in the topograph config.
7373
- **creds**: (optional) A key-value map with provider-specific parameters for authentication.
7474
- **params**: (optional) A key-value map with provider-specific parameters. The `test` provider uses these parameters for response simulation; for complete behavior and examples, see [Test Mode and Test Provider](./providers/test.md).
7575
- **engine**: (optional) Selects the topology output and provides any engine-specific parameters.

docs/overview.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ Currently supported providers:
3737
- [OCI](./providers/oci.md)
3838
- [GCP](./providers/gcp.md)
3939
- [Nebius](./providers/nebius.md)
40+
- [Nscale](./providers/nscale.md)
4041
- [NetQ](./providers/netq.md)
4142
- [DRA](./providers/dra.md) — reads `nvidia.com/gpu.clique` labels set by the NVIDIA GPU operator DRA driver
4243
- [InfiniBand (bare-metal)](./providers/infiniband.md#infiniband-bm-bare-metal)
@@ -53,7 +54,7 @@ Currently supported engines:
5354

5455
| Scenario | Recommended provider |
5556
|---|---|
56-
| Cloud cluster (AWS, GCP, OCI, Nebius) | Use the matching CSP provider |
57+
| Cloud cluster (AWS, GCP, OCI, Nebius, Nscale) | Use the matching CSP provider |
5758
| Spectrum-X fabric | [NetQ](./providers/netq.md) |
5859
| Multi-Node NVLink (MNNVL), infrastructure visibility | [NetQ](./providers/netq.md) |
5960
| MNNVL on Kubernetes (scheduling) | [DRA](./providers/dra.md) |

docs/providers/nscale.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# Nscale Topology Provider
2+
3+
The `nscale` topology provider reads topology data from the Nscale Radar API and converts it into Topograph's canonical three-tier topology graph.
4+
5+
The provider uses two Nscale APIs:
6+
7+
- **Radar API**: returns each instance's network path via `GET /v1/topology`
8+
- **Instance API**: returns instance metadata via `GET /v2/instances?organizationID=<org>&regionID=<region>`
9+
10+
The Radar response supplies the provider instance ID, switch path, and optional block ID. The Instance API response maps provider instance IDs to hostnames using `metadata.id` and `metadata.name`; this is used by the Slurm engine when Topograph discovers Slurm nodes automatically.
11+
12+
## When to Use This Provider
13+
14+
Use this provider for Nscale environments where Radar is the topology source. It is most commonly used with the Slurm engine to generate `topology.conf` from the current Slurm node list.
15+
16+
If the request payload supplies explicit `nodes`, Topograph uses those instance ID to node name mappings directly. If `nodes` is omitted and the Slurm engine is used, Topograph runs `scontrol show nodes -o`, asks the Nscale Instance API for the instance catalog in the configured region, and keeps entries whose `metadata.name` matches a Slurm node name.
17+
18+
## Prerequisites
19+
20+
- A Radar API endpoint reachable from the Topograph host
21+
- An Instance API endpoint reachable from the Topograph host
22+
- An Nscale organization ID
23+
- An API token with permission to read topology and instance metadata
24+
- The Nscale region ID for the cluster
25+
- For Slurm auto-discovery, `scontrol` must be available to the Topograph process
26+
27+
## Credentials
28+
29+
| Field | Required | Description |
30+
|---|---|---|
31+
| `org` | Yes | Nscale organization ID |
32+
| `token` | Yes | Bearer token used for Radar and Instance API requests |
33+
| `region` | Required for Slurm auto-discovery | Nscale region ID used for Instance API lookup and Slurm region assignment |
34+
35+
Store credentials in a YAML file:
36+
37+
```yaml
38+
org: <ORGANIZATION_ID>
39+
token: <API_TOKEN>
40+
region: <REGION_ID>
41+
```
42+
43+
Reference that file from the Topograph config:
44+
45+
```yaml
46+
credentialsPath: /etc/topograph/nscale-credentials.yaml
47+
```
48+
49+
Credentials can also be supplied directly in the topology request payload under `provider.creds`.
50+
51+
## Parameters
52+
53+
| Field | Required | Description |
54+
|---|---|---|
55+
| `radarApiUrl` | Yes | Base URL for the Radar API, for example `https://radar.example.com` |
56+
| `instanceApiUrl` | Yes | Base URL for the Instance API, for example `https://api.example.com` |
57+
| `trimTiers` | No | Number of highest topology tiers to trim from output. Defaults to `0` |
58+
59+
The top-level Topograph `pageSize` setting controls pagination for the Radar topology request.
60+
61+
## Configuration
62+
63+
Example Topograph config for Slurm:
64+
65+
```yaml
66+
http:
67+
port: 49021
68+
ssl: false
69+
70+
provider: nscale
71+
engine: slurm
72+
73+
requestAggregationDelay: 15s
74+
credentialsPath: /etc/topograph/nscale-credentials.yaml
75+
76+
providerParams:
77+
radarApiUrl: https://radar.example.com
78+
instanceApiUrl: https://api.example.com
79+
80+
engineParams:
81+
plugin: topology/tree
82+
topologyConfigPath: /etc/slurm/topology.conf
83+
```
84+
85+
Example request payload:
86+
87+
```json
88+
{
89+
"provider": {
90+
"name": "nscale",
91+
"creds": {
92+
"org": "<ORGANIZATION_ID>",
93+
"token": "<API_TOKEN>",
94+
"region": "<REGION_ID>"
95+
},
96+
"params": {
97+
"radarApiUrl": "https://radar.example.com",
98+
"instanceApiUrl": "https://api.example.com"
99+
}
100+
},
101+
"engine": {
102+
"name": "slurm",
103+
"params": {
104+
"plugin": "topology/tree"
105+
}
106+
}
107+
}
108+
```
109+
110+
If you already have the instance ID to hostname mapping, you can include it explicitly:
111+
112+
```json
113+
{
114+
"provider": {
115+
"name": "nscale",
116+
"creds": {
117+
"org": "<ORGANIZATION_ID>",
118+
"token": "<API_TOKEN>",
119+
"region": "<REGION_ID>"
120+
},
121+
"params": {
122+
"radarApiUrl": "https://radar.example.com",
123+
"instanceApiUrl": "https://api.example.com"
124+
}
125+
},
126+
"engine": {
127+
"name": "slurm"
128+
},
129+
"nodes": [
130+
{
131+
"region": "<REGION_ID>",
132+
"instances": {
133+
"<INSTANCE_ID_1>": "node001",
134+
"<INSTANCE_ID_2>": "node002"
135+
}
136+
}
137+
]
138+
}
139+
```
140+
141+
## How It Works
142+
143+
For each region in the compute instance list, the provider fetches topology pages from Radar:
144+
145+
```text
146+
GET <radarApiUrl>/v1/topology?limit=<pageSize>&offset=<offset>
147+
Authorization: Bearer <token>
148+
X-Organization: <org>
149+
X-Region: <region>
150+
```
151+
152+
Each returned instance is translated as follows:
153+
154+
| Radar field | Topograph field |
155+
|---|---|
156+
| `instance_id` | Instance ID |
157+
| `network_node_path[0]` | Core tier |
158+
| `network_node_path[1]` | Spine tier |
159+
| `network_node_path[2]` | Leaf tier |
160+
| `block_id` | Accelerator / NVLink domain |
161+
162+
For Slurm auto-discovery, the provider also fetches instance metadata:
163+
164+
```text
165+
GET <instanceApiUrl>/v2/instances?organizationID=<org>&regionID=<region>
166+
Authorization: Bearer <token>
167+
```
168+
169+
It builds the same map produced by:
170+
171+
```bash
172+
curl -s -H "Authorization: Bearer $TOKEN" \
173+
"$INSTANCE_API_URL/v2/instances?organizationID=$ORG&regionID=$REGION" \
174+
| jq -r '.[] | "\(.metadata.id)\t\(.metadata.name)"'
175+
```
176+
177+
## Verifying the Output
178+
179+
First verify that the Instance API returns the hostnames Slurm knows:
180+
181+
```bash
182+
curl -s -H "Authorization: Bearer $TOKEN" \
183+
"$INSTANCE_API_URL/v2/instances?organizationID=$ORG&regionID=$REGION" \
184+
| jq -r '.[] | "\(.metadata.id)\t\(.metadata.name)"'
185+
```
186+
187+
Then trigger topology generation:
188+
189+
```bash
190+
id=$(curl -s -X POST -H "Content-Type: application/json" -d @payload.json http://localhost:49021/v1/generate)
191+
curl -s "http://localhost:49021/v1/topology?uid=$id"
192+
```
193+
194+
For the Slurm engine, verify that the generated `topology.conf` contains the expected switch hierarchy or block topology for the Nscale instances.

internal/httpreq/httpreq.go

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -125,8 +125,6 @@ func DoRequestWithRetries(f RequestFunc, insecureSkipVerify bool) ([]byte, *http
125125
for {
126126
attempt++
127127
resp, body, err := DoRequest(f, insecureSkipVerify)
128-
// TODO: remove the line below after troubleshooting is completed
129-
klog.Infof("BODY: %s", string(body))
130128
if err == nil || attempt == maxRetries || !ShouldRetry(err.Code()) {
131129
return body, err
132130
}

0 commit comments

Comments
 (0)