You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using `MLTensor`s enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected such that each `MLTensor` is guaranteed to be modified in the order the methods using the tensor are called from script. In this example, the ML context should be working continuously from the `writeBuffer()` call until the work for the last `readBuffer()` completes. Better utilization of the ML context will result in significantly better throughput.
91
+
Using `MLTensor` enables a programming model similar to [WebGPU's](https://www.w3.org/TR/webgpu/#programming-model). Tasks are posted to the ML context's [timeline](#timelines) and are executed as the ML context sees fit - so far as data dependencies are respected such that each `MLTensor` is guaranteed to be modified in the order the methods using the tensor are called from script. In this example, the ML context should be working continuously from the `writeTensor()` call until the work for the last `readTensor()` completes. Better utilization of the ML context will result in significantly better throughput.
85
92
86
93
```js
87
94
// Proposed approach to queue tasks to the ML context timeline
88
95
89
96
// Post a task to the ML context timeline to allocate and zero out a tensor,
Since the queueing mechanism respects data dependencies, chained inference allows an `MLTensor` to be passed as an output from one graph and then immediately as an input to the next. A collection of graphs and buffers may be repeatedly dispatched without the need for synchronization via script.
A privacy-conscious user wants to perform real-time selfie segmentation of a video feed on their local device.
195
203
196
-
Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`. This is unlikely to be performed in real-time.
204
+
Currently, using WebNN for this task would require - for each frame - an expensive readback of `GPUBuffer` data to script, uploading the data to the ML context device (which may be the same GPU!), copying the result back to script, and then uploading the frame to be rendered back into a `GPUBuffer`.
197
205
198
-
An `MLTensor` may be imported into WebGPU, which in the best case provides zero-copy buffer sharing between the two APIs, and in all cases provides a synchronization mechanism between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines), avoiding the need for expensive synchronization via script.
206
+
An `MLTensor` may be imported into WebGPU, minimizing the number of buffer copies required to render the results of some ML compute. Zero-copy buffer sharing between the two APIs may be supported in some cases.
199
207
200
208
```js
201
209
// Create a couple MLTensors to be used to facilitate WebGPU interop.
// Create a bind group for `tensorizedGpuBufferAfterInference`,
232
240
// create a command encoder, etc to feed `tensorizedGpuBufferAfterInference`
@@ -258,21 +266,15 @@ The WebNN API requires the developer to declare how an `MLTensor` will be used (
258
266
259
267
For example [an `MLContext` may be created with a `GPUDevice`](https://www.w3.org/TR/webnn/#dom-ml-createcontext-gpudevice), and creating an `MLTensor` from this context with the `MLTensorUsage.WEBGPU_INTEROP` flag expresses a clear intention to share the tensor with the given `GPUDevice`. However, there is no guarantee that sharing this tensor with WebGPU will be zero-copy.
260
268
261
-
The `MLTensorUsage.READ_FROM` and `MLTensorUsage.WRITE_TO` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script.
269
+
The `MLTensorUsage.READ` and `MLTensorUsage.WRITE` flags likewise are hints to the user agent indicating that the underlying data will be read and written to, respectively, by script.
262
270
263
271
### Importing an `MLTensor` to WebGPU
264
272
265
-
Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`, though cross-device buffer sharing may require expensive data copies. Sharing the tensor requires coordinating between the respective WebNN and WebGPU timelines. Below is an example of how the user agent may coordinate this handoff:
273
+
Any `MLTensor` created with the `MLTensorUsage.WEBGPU_INTEROP` flag may be imported into any `GPUDevice`. In the best case, this requires no data copies. If the underlying buffer backing the `MLTensor` is not accessible to the `GPUDevice`, this will require copying the contents of the `MLTensor` to a new buffer, then copying the contents of this buffer back to the `MLTensor` once WebGPU releases its handle to the buffer.
274
+
275
+
While an `MLTensor` is rented to a `GPUDevice`, the `GPUDevice` has exclusive, read/write access to the imported buffer. All WebNN work depending - directly or indirectly - on the imported `MLTensor` is blocked until the `GPUDevice` returns the tensor.
266
276
267
-
- Two fences are created:
268
-
1. a "start access" fence which is to be signaled by WebNN and waited on by WebGPU. A data copy may be required alongside the signaling of this fence
269
-
2. an "end access" fence which is to be signaled by WebGPU and waited on by WebNN. A data copy may be required alongside the signaling of this fence
270
-
- The `GPUDevice` enqueues a command to its `GPUQueue` to wait for the "start access" fence to be signaled
271
-
- WebNN will signal the "start access" fence after the completion of all currently-enqueued operations that use the `MLTensor` which is to be imported (this is very similar to how [`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync) works)
272
-
- Until the "end access" fence is signaled:
273
-
- The `GPUDevice` has exclusive, read/write access to the imported buffer
274
-
- All WebNN work involving the imported `MLTensor` is blocked
275
-
- When the `GPUBuffer` is destroyed, the "end access" fence is signaled and the `MLTensor` may be used again by WebNN
277
+
Importing and returning the `MLTensor` are each points of synchronization between the respective WebNN and WebGPU [timelines](https://www.w3.org/TR/webgpu/#programming-model-timelines). The `importExternalBuffer()` method is asynchronous to allow the user agent to await completion of WebNN operations before posting WebGPU commands with the imported buffer. This is to avoid making WebGPU workloads explicitly dependent on WebNN operations, which is may not be possible on platforms which [don't support enqueuing GPU work that waits on a fence to be later signaled by the CPU](https://github.com/webmachinelearning/webnn/pull/754#discussion_r1740841364) and/or don't express ML compute in terms of GPU commands.
276
278
277
279
### `compute()` vs. `dispatch()`
278
280
@@ -282,11 +284,12 @@ It's possible `compute()` may have a performance advantage on some platforms for
282
284
283
285
### Open Questions
284
286
285
-
- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations sufficient](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878)? See [#477](https://github.com/webmachinelearning/webnn/issues/477)
286
-
-On non-UMA systems, does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749)
287
-
- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readBuffer()` and `writeBuffer()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697).
287
+
- How will errors be surfaced? Do we need a concept similar to [WebGPU's error scopes](https://www.w3.org/TR/webgpu/#error-scopes), or is [returning errors via a promise for select operations](https://github.com/webmachinelearning/webnn/issues/697#issuecomment-2195656878) and losing the `MLContext` sufficient? See [#477](https://github.com/webmachinelearning/webnn/issues/477)
288
+
-Does the user agent have enough information to appropriately allocate an `MLTensor` if an `MLDeviceType` is not used for creating an `MLContext`? See [#350](https://github.com/webmachinelearning/webnn/issues/350) and [#749](https://github.com/webmachinelearning/webnn/issues/749)
289
+
- Should the `dispatch()` method be a part of the `MLGraph` interface rather than `MLContext`? Should `readTensor()` and `writeTensor()` exist on an `MLTensor`? See [#697](https://github.com/webmachinelearning/webnn/issues/697).
288
290
- If an `MLContext` is not created from a `GPUDevice`, does there need to be some mechanism - above and beyond the `MLTensorUsage.WEBGPU_INTEROP` flag - for identifying the specific `GPUDevice` with which interop is desired?
289
291
- What are the usage flags of a `GPUBuffer` created from an `MLTensor`?
292
+
- Is a sync variant of the `importExternalBuffer()` method feasible on platforms where the WebNN timeline _is_ the WebGPU timeline? (i.e. ML compute is expressed in terms of GPU commands on the same `GPUDevice`)
290
293
291
294
## Considered Alternatives
292
295
@@ -371,8 +374,8 @@ Many thanks for valuable feedback and advice from:
371
374
typedef [EnforceRange] unsigned long MLTensorUsageFlags;
0 commit comments