Skip to content

Commit 5621975

Browse files
committed
make importExternalBuffer() async (among other changes)
1 parent d3e2be5 commit 5621975

File tree

1 file changed

+44
-41
lines changed

1 file changed

+44
-41
lines changed

mltensor-explainer.md

+44-41
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ An `MLTensor` is an opaque tensor which may be created, written to, and read fro
2727
## Non-Goals
2828

2929
* Guarantee *zero-copy* buffer-sharing between WebNN and WebGPU
30+
* Synchronization of work between WebNN and WebGPU without CPU involvement
3031
* Provide partial views over an `MLTensor`
3132

3233
## Key Scenarios
@@ -55,13 +56,19 @@ await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': output
5556
// Proposed approach to reuse a given input buffer, using an input MLTensor
5657

5758
// Copy the image data into the required format.
58-
const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE_TO});
59-
mlContext.writeBuffer(imageAsMlTensor, imageAsArrayBuffer);
59+
const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE});
60+
mlContext.writeTensor(imageAsMlTensor, imageAsArrayBuffer);
6061

6162
// Execute the graphs - no additional copies!
6263
mlContext.dispatch(graph1, {'input': imageAsMlTensor}, {'output': outputMlTensor1});
6364
mlContext.dispatch(graph2, {'input': imageAsMlTensor}, {'output': outputMlTensor2});
6465
mlContext.dispatch(graph3, {'input': imageAsMlTensor}, {'output': outputMlTensor3});
66+
67+
await Promise.all([
68+
mlContext.readTensor(outputMlTensor1, outputArrayBuffer1);
69+
mlContext.readTensor(outputMlTensor2, outputArrayBuffer2);
70+
mlContext.readTensor(outputMlTensor3, outputArrayBuffer3);
71+
]);
6572
```
6673

6774
### Chained Inference
@@ -81,19 +88,19 @@ await mlContext.compute(graph2, {'input': imageAsArrayBuffer}, {'output': output
8188
await mlContext.compute(graph3, {'input': imageAsArrayBuffer}, {'output': outputArrayBuffer3});
8289
```
8390

84-
Using `MLTensor`s enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected such that each `MLTensor` is guaranteed to be modified in the order the methods using the tensor are called from script. In this example, the ML context should be working continuously from the `writeBuffer()` call until the work for the last `readBuffer()` completes. Better utilization of the ML context will result in significantly better throughput.
91+
Using `MLTensor` enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected such that each `MLTensor` is guaranteed to be modified in the order the methods using the tensor are called from script. In this example, the ML context should be working continuously from the `writeTensor()` call until the work for the last `readTensor()` completes. Better utilization of the ML context will result in significantly better throughput.
8592

8693
```js
8794
// Proposed approach to queue tasks to the ML context timeline
8895

8996
// Post a task to the ML context timeline to allocate and zero out a tensor,
9097
// then return to script.
91-
const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE_TO});
98+
const imageAsMlTensor = await mlContext.createTensor({..., usage: MLTensorUsage.WRITE});
9299

93100
// Post a task to the ML context timeline to write to the tensor. Note that we do
94101
// not await completion of this write. The ML context will ensure any operations
95102
// which depend on the contents of `imageAsMlTensor` will queue behind this task.
96-
mlContext.writeBuffer(imageAsMlTensor, imageAsArrayBuffer);
103+
mlContext.writeTensor(imageAsMlTensor, imageAsArrayBuffer);
97104

98105
// Post a task to the ML context timeline to execute the graph. The ML context will
99106
// ensure this queues behind the write above.
@@ -110,7 +117,7 @@ const outputs = await Promise.all([
110117
outputMlTensor1,
111118
outputMlTensor2,
112119
outputMlTensor3
113-
].map((tensor) => { return mlContext.readBuffer(tensor); }));
120+
].map((tensor) => { return mlContext.readTensor(tensor); }));
114121
```
115122

116123
Since the queueing mechanism respects data dependencies, chained inference allows an `MLTensor` to be passed as an output from one graph and then immediately as an input to the next. A collection of graphs and buffers may be repeatedly dispatched without the need for synchronization via script.
@@ -125,20 +132,20 @@ const add = builder.add(fn1, fn2);
125132
const graph = await builder.build({'F_n': add});
126133

127134
const usages = [
128-
MLTensorUsage.WRITE_TO, // To initialize F_0
129-
MLTensorUsage.WRITE_TO, // To initialize F_1
135+
MLTensorUsage.WRITE, // To initialize F_0
136+
MLTensorUsage.WRITE, // To initialize F_1
130137
0
131138
];
132-
usages[N % 3] |= MLTensorUsage.READ_FROM; // To read the output
139+
usages[N % 3] |= MLTensorUsage.READ; // To read the output
133140

134141
const tensors = await Promise.all([
135142
mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[0]}),
136143
mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[1]}),
137144
mlContext.createTensor({dataType: "int32", shape: [1], usage: usages[2]})
138145
]);
139146

140-
mlContext.writeBuffer(tensors[0], new Int32Array([0])); // F_0 = 0
141-
mlContext.writeBuffer(tensors[1], new Int32Array([1])); // F_1 = 1
147+
mlContext.writeTensor(tensors[0], new Int32Array([0])); // F_0 = 0
148+
mlContext.writeTensor(tensors[1], new Int32Array([1])); // F_1 = 1
142149

143150
for (let n = 2; n <= N; n++) {
144151
// Each dispatch depends on tensors used in the previous dispatch.
@@ -147,7 +154,7 @@ for (let n = 2; n <= N; n++) {
147154
{'F_n': tensors[n % 3]});
148155
}
149156

150-
const f_n = new Int32Array(await mlContext.readBuffer(tensors[N % 3]))[0];
157+
const f_n = new Int32Array(await mlContext.readTensor(tensors[N % 3]))[0];
151158
```
152159

153160
### Resource Management
@@ -179,7 +186,8 @@ mlContext.dispatch(graph1, inputs, outputs);
179186
// Explicitly ask for its resources to be released!
180187
graph1.destroy();
181188

182-
// We can selectively release only the resources we expect won't be needed.
189+
// We can selectively release only the resources we expect won't be needed
190+
// by calling destroy() on a subset of MLTensors.
183191
destroyBuffers(inputs);
184192
// Don't destroy the output tensors yet, in case we want to reuse them later.
185193

@@ -193,9 +201,9 @@ const constant = builder.constant(descriptor, veryLargeBufferOfWeights);
193201

194202
A privacy-conscious user wants to perform real-time selfie segmentation of a video feed on their local device.
195203

196-
Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`. This is unlikely to be performed in real-time.
204+
Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`.
197205

198-
An `MLTensor` may be imported into WebGPU, which in the best case provides zero-copy buffer sharing between the two APIs, and in all cases provides a synchronization mechanism between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines), avoiding the need for expensive synchronization via script.
206+
An `MLTensor` may be imported into WebGPU, minimizing the number of buffer copies required to render the results of some ML compute. Zero-copy buffer sharing between the two APIs may be supported in some cases.
199207

200208
```js
201209
// Create a couple MLTensors to be used to facilitate WebGPU interop.
@@ -205,8 +213,8 @@ const mlTensor2 = await mlContext.createTensor({..., usage: MLTensorUsage.WEBGPU
205213
const applyEffectToFrame = async () => {
206214
const gpuVideoTexture = gpuDevice.importExternalTexture({source: video});
207215

208-
// Rent out the MLTensor to WebGPU.
209-
const tensorizedGpuBuffer = gpuDevice.importExternalBuffer(mlTensor1);
216+
// Wait for all ML work involving `mlTensor1` to complete, then rent it out to WebGPU.
217+
const tensorizedGpuBuffer = await gpuDevice.importExternalBuffer(mlTensor1);
210218

211219
// Create a bind group for `gpuVideoTexture`, create a command encoder, etc.
212220
// to "tensorize" `gpuVideoTexture` and store the result in `tensorizedGpuBuffer`
@@ -225,8 +233,8 @@ const applyEffectToFrame = async () => {
225233
/*outputs=*/{'output': mlTensor2},
226234
);
227235

228-
// Rent the other MLTensor out to WebGPU.
229-
const tensorizedGpuBufferAfterInference = gpuDevice.importExternalBuffer(mlTensor2);
236+
// Wait for all ML work involving `mlTensor2` to complete, then rent it out to WebGPU.
237+
const tensorizedGpuBufferAfterInference = await gpuDevice.importExternalBuffer(mlTensor2);
230238

231239
// Create a bind group for `tensorizedGpuBufferAfterInference`,
232240
// create a command encoder, etc to feed `tensorizedGpuBufferAfterInference`
@@ -258,21 +266,15 @@ The WebNN API requires the developer to declare how an `MLTensor` will be used (
258266

259267
For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.WEBGPU_INTEROP` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy.
260268

261-
The `MLTensorUsage.READ_FROM` and `MLTensorUsage.WRITE_TO` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script.
269+
The `MLTensorUsage.READ` and `MLTensorUsage.WRITE` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script.
262270

263271
### Importing an `MLTensor` to WebGPU
264272

265-
Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`, though cross-device buffer sharing may require expensive data copies. Sharing the tensor requires coordinating between the respective WebNN and WebGPU timelines. Below is an example of how the user agent may coordinate this handoff:
273+
Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer.
274+
275+
While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor.
266276

267-
- Two fences are created:
268-
1. a "start access" fence which is to be signaled by WebNN and waited on by WebGPU. A data copy may be required alongside the signaling of this fence
269-
2. an "end access" fence which is to be signaled by WebGPU and waited on by WebNN. A data copy may be required alongside the signaling of this fence
270-
- The `GPUDevice` enqueues a command to its `GPUQueue` to wait for the "start access" fence to be signaled
271-
- WebNN will signal the "start access" fence after the completion of all currently-enqueued operations that use the `MLTensor` which is to be imported (this is very similar to how [`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync) works)
272-
- Until the "end access" fence is signaled:
273-
- The `GPUDevice` has exclusive, read/write access to the imported buffer
274-
- All WebNN work involving the imported `MLTensor` is blocked
275-
- When the `GPUBuffer` is destroyed, the "end access" fence is signaled and the `MLTensor` may be used again by WebNN
277+
Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads explicitly dependent on WebNN operations, which is may not be possible on platforms which [don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364) and/or don't express ML compute in terms of GPU commands.
276278

277279
### `compute()` vs. `dispatch()`
278280

@@ -282,11 +284,12 @@ It's possible `compute()` may have a performance advantage on some platforms for
282284

283285
### Open Questions
284286

285-
- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations sufficient](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878)? See [#477](https://github.com/webmachinelearning/webnn/issues/477)
286-
- On non-UMA systems, does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749)
287-
- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readBuffer()` and `writeBuffer()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697).
287+
- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878) and losing the `MLContext` sufficient? See [#477](https://github.com/webmachinelearning/webnn/issues/477)
288+
- Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749)
289+
- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697).
288290
- If an `MLContext` is not created from a `GPUDevice`, does there need to be some mechanism - above and beyond the `MLTensorUsage.WEBGPU_INTEROP` flag - for identifying the specific `GPUDevice` with which interop is desired?
289291
- What are the usage flags of a `GPUBuffer` created from an `MLTensor`?
292+
- Is a sync variant of the `importExternalBuffer()` method feasible on platforms where the WebNN timeline _is_ the WebGPU timeline? (i.e. ML compute is expressed in terms of GPU commands on the same `GPUDevice`)
290293

291294
## Considered Alternatives
292295

@@ -371,8 +374,8 @@ Many thanks for valuable feedback and advice from:
371374
typedef [EnforceRange] unsigned long MLTensorUsageFlags;
372375

373376
namespace MLTensorUsage {
374-
const MLFlagsConstant READ_FROM = 0x0001;
375-
const MLFlagsConstant WRITE_TO = 0x0002;
377+
const MLFlagsConstant READ = 0x0001;
378+
const MLFlagsConstant WRITE = 0x0002;
376379
const MLFlagsConstant WEBGPU_INTEROP = 0x0004;
377380
};
378381

@@ -393,12 +396,12 @@ interface MLTensor {
393396
partial interface MLContext {
394397
Promise<MLTensor> createTensor(MLTensorDescriptor descriptor);
395398

396-
void writeBuffer(MLTensor dstTensor, [AllowShared] ArrayBuffer srcData);
397-
void writeBuffer(MLTensor dstTensor, [AllowShared] ArrayBufferView srcData);
399+
void writeTensor(MLTensor tensor, [AllowShared] ArrayBuffer sourceData);
400+
void writeTensor(MLTensor tensor, [AllowShared] ArrayBufferView sourceData);
398401

399-
Promise<ArrayBuffer> readBuffer(MLTensor srcTensor);
400-
Promise<void> readBuffer(MLTensor srcTensor, [AllowShared] ArrayBuffer dstData);
401-
Promise<void> readBuffer(MLTensor srcTensor, [AllowShared] ArrayBufferView dstData);
402+
Promise<ArrayBuffer> readTensor(MLTensor tensor);
403+
Promise<void> readTensor(MLTensor tensor, [AllowShared] ArrayBuffer outputData);
404+
Promise<void> readTensor(MLTensor tensor, [AllowShared] ArrayBufferView outputData);
402405

403406
void dispatch(MLGraph graph, MLNamedTensors inputs, MLNamedTensors outputs);
404407
};
@@ -414,7 +417,7 @@ dictionary GPUExternalBufferDescriptor
414417
};
415418

416419
partial interface GPUDevice {
417-
GPUExternalBuffer importExternalBuffer(GPUExternalBufferDescriptor descriptor);
420+
Promise<GPUExternalBuffer> importExternalBuffer(GPUExternalBufferDescriptor descriptor);
418421
}
419422

420423
partial interface ML {

0 commit comments

Comments
 (0)