|
| 1 | +# `MLBuffer` Exploration |
| 2 | + |
| 3 | +By @a-sully |
| 4 | + |
| 5 | +## What is this? |
| 6 | + |
| 7 | +This is an exploration - primarily via code samples of key use cases - of what |
| 8 | +ML compute might look like using a device-agnostic buffer, as proposed in |
| 9 | +[#482](https://github.com/webmachinelearning/webnn/issues/482) as `MLBuffer`. |
| 10 | + |
| 11 | +This is not intended to be a formal explainer, though it could become one if |
| 12 | +that would be useful. My intention here is to describe our priorities (such that |
| 13 | +we can ensure the design satisfies these priorities), bring attention to some |
| 14 | +open questions and related issues, toss around some ideas, and encourage |
| 15 | +discussion about how this proposal will be specified. |
| 16 | + |
| 17 | +## Goals |
| 18 | + |
| 19 | +- Minimize round-trips to JavaScript/CPU needed for synchronization of work on |
| 20 | + buffers which may not live on the CPU |
| 21 | +- Minimize buffer copies |
| 22 | + - In particular, we should support zero-copy buffer sharing between WebNN and |
| 23 | + WebGPU if this is supported by the underlying hardware |
| 24 | +- Support the XPU (i.e. CPU, GPU, NPU, TPU, etc...) with one consistent API |
| 25 | +- Follow recomended [design |
| 26 | + principles](https://w3ctag.github.io/design-principles/) |
| 27 | + - In my opinion, this likely entails [mirroring WebGPU's design |
| 28 | + decisions](https://w3ctag.github.io/design-principles/#naming-consultation), |
| 29 | + where appropriate |
| 30 | + |
| 31 | +## Overarching Questions |
| 32 | + |
| 33 | +Many of these questions are not _specific_ to `MLBuffer`, but are important |
| 34 | +enough that their answers will strongly influence the shape of the `MLBuffer` |
| 35 | +proposal. |
| 36 | + |
| 37 | +- What are WebNN's timelines and how do they interact with WebGPU's timelines? |
| 38 | + See [#529](https://github.com/webmachinelearning/webnn/issues/529) |
| 39 | +- Where will an `MLBuffer`'s memory be allocated on systems where an `MLContext` |
| 40 | + may not be as closely tied to a given physical device as an |
| 41 | + [`IDMLDevice`](https://learn.microsoft.com/en-us/windows/win32/api/directml/nn-directml-idmldevice)? |
| 42 | + See [#350](https://github.com/webmachinelearning/webnn/issues/350) |
| 43 | +- How will errors be surfaced? See |
| 44 | + [#477](https://github.com/webmachinelearning/webnn/issues/477). Do we need a |
| 45 | + concept similar to [WebGPU's error |
| 46 | + scopes](https://www.w3.org/TR/webgpu/#error-scopes)? |
| 47 | +- Must an `MLBuffer` only be used with an `MLContext` it was created from? |
| 48 | + (or `MLGraph`s created from that `MLContext`, and so forth) |
| 49 | +- If what we're building is a device-agnostic buffer, it will surely be used for |
| 50 | + things other than ML (in the long run). In the spirit of |
| 51 | + [future-proofing](https://w3ctag.github.io/design-principles/#naming-future-proofing), |
| 52 | + should we name it something other than `MLBuffer`? |
| 53 | + |
| 54 | +## Use Case: Chained Inference |
| 55 | + |
| 56 | +Here's a code sample showing how `MLBuffer`s can be used for chained inference |
| 57 | +and then read back to an `ArrayBuffer`: |
| 58 | + |
| 59 | +```js |
| 60 | +// Create new MLBuffers to be used for chained inference. |
| 61 | +const inputMlBuffer = mlContext.createBuffer({inputSize}); |
| 62 | +const intermediateMlBuffer = mlContext.createBuffer({intermediateSize}); |
| 63 | +const outputMlBuffer = mlContext.createBuffer({outputSize}); |
| 64 | + |
| 65 | +// Copy the contents of an ArrayBuffer into an MLBuffer, to be later used as inputs. |
| 66 | +mlContext.writeBuffer( |
| 67 | + inputMlBuffer, |
| 68 | + /*dstOffset=*/0, |
| 69 | + /*srcData=*/someJsArrayBuffer, |
| 70 | +); |
| 71 | + |
| 72 | +// Perform some ✧*✧* machine learning *✧*✧ described by `graph`. |
| 73 | +mlContext.dispatch( |
| 74 | + graph, |
| 75 | + /*inputs=*/{buffer: inputMlBuffer}, |
| 76 | + /*outputs=*/{buffer: intermediateMlBuffer}, |
| 77 | +); |
| 78 | + |
| 79 | +// Feed the output of one execution as the input to the next. Chained inference! |
| 80 | +mlContext.dispatch( |
| 81 | + anotherGraph, |
| 82 | + /*inputs=*/{buffer: intermediateMlBuffer}, |
| 83 | + /*outputs=*/{buffer: outputMlBuffer}, |
| 84 | +); |
| 85 | + |
| 86 | +// Read back the results to script. |
| 87 | +const resultBuffer = await outputMlBuffer.mapAsync(); |
| 88 | +``` |
| 89 | + |
| 90 | +Let's dive into what happens at each of these steps: |
| 91 | + |
| 92 | +### `MLBuffer` creation |
| 93 | + |
| 94 | +```js |
| 95 | +const inputMlBuffer = mlContext.createBuffer({inputSize}); |
| 96 | +``` |
| 97 | +#### How it works: |
| 98 | + |
| 99 | +- Enqueue a request on some WebNN timeline to allocate memory on the device |
| 100 | + associated with `mlContext` |
| 101 | +- The memory allocation will be zeroed (as it is for [WebGPU's `createBuffer()` |
| 102 | + method](https://www.w3.org/TR/webgpu/#dom-gpudevice-createbuffer)) |
| 103 | + |
| 104 | +#### Questions: |
| 105 | + |
| 106 | +- Can an `MLBuffer`'s size always be known at the time of buffer allocation? |
| 107 | + - In this case and many other cases it seems possible; it's presumably a |
| 108 | + function of the model and/or video input. But since WebNN always rents a |
| 109 | + buffer to WebGPU - never the other way around - this introduces a constraint |
| 110 | + that the size of an `MLBuffer` must always be known at the time of buffer |
| 111 | + allocation |
| 112 | +- When will `inputMlBuffer` be deallocated if `destroy()` is not called? |
| 113 | + |
| 114 | +### Writing to an `MLBuffer` |
| 115 | + |
| 116 | +```js |
| 117 | +mlContext.writeBuffer( |
| 118 | + inputMlBuffer, |
| 119 | + /*dstOffset=*/0, |
| 120 | + /*srcData=*/someJsArrayBuffer, |
| 121 | +); |
| 122 | +``` |
| 123 | + |
| 124 | +#### How it works: |
| 125 | + |
| 126 | +- Enqueue a request on some WebNN timeline to copy the contents of |
| 127 | + `someJsArrayBuffer` to `inputMlBuffer`. This is very similar to [the |
| 128 | + corresponding WebGPU |
| 129 | + method](https://www.w3.org/TR/webgpu/#dom-gpuqueue-writebuffer), though the |
| 130 | + implementation details will vary depending on which device `inputMlBuffer` is |
| 131 | + allocated on. For example, if allocated on: |
| 132 | + - a CPU, the buffer contents will be copied directly (i.e. `memcpy()`) |
| 133 | + - a GPU, the behavior will likely match `GPUQueue.writeBuffer()`. On UMA |
| 134 | + systems, a `memcpy()` might suffice. Other implementations may use a hidden |
| 135 | + "upload" buffer to get the data onto the GPU. This implies two copies:\ |
| 136 | + *  `ArrayBuffer` → "upload" buffer → high-GPU-bandwidth |
| 137 | + buffer* |
| 138 | + - an XPU... it depends! |
| 139 | +- `someJsArrayBuffer` is unaffected, since the bytes are copied |
| 140 | + - Note that the aforementioned copies are _in addition_ to any copies needed |
| 141 | + to get the data into the `ArrayBuffer` in the first place. If the data is |
| 142 | + weights being read from a `File`, for example, this will require first |
| 143 | + copying the bytes from the `File` into the `ArrayBuffer`. This means |
| 144 | + **copying the weights into GPU-accessible memory could take as many as four |
| 145 | + copies!** |
| 146 | + |
| 147 | +#### Questions: |
| 148 | + |
| 149 | +- Should there be a corresponding |
| 150 | + [`mappedAtCreation`](https://www.w3.org/TR/webgpu/#dom-gpubufferdescriptor-mappedatcreation) |
| 151 | + capability? |
| 152 | + - If the data is not already in an `ArrayBuffer`, this eliminates the data |
| 153 | + copy into an `ArrayBuffer` altogether, since we could write to the "upload" |
| 154 | + buffer directly: |
| 155 | + ```js |
| 156 | + const mlBuffer = mlContext.createBuffer({size, mappedAtCreation: true}); |
| 157 | + |
| 158 | + const floatArray = new Float32Array(mlBuffer.getMappedRange()), |
| 159 | + |
| 160 | + // Write to `floatArray` |
| 161 | + // ... |
| 162 | + |
| 163 | + // Write the buffer contents to the XPU |
| 164 | + mlBuffer.unmap(); |
| 165 | + ``` |
| 166 | +   Before: *some source → `ArrayBuffer` → "upload" buffer |
| 167 | + → high-GPU-bandwidth buffer*\ |
| 168 | +   After: *some source → "upload" buffer |
| 169 | + → high-GPU-bandwidth buffer* |
| 170 | +- Should there be the equivalent of |
| 171 | + [`MAP_WRITE`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_write) + |
| 172 | + [`COPY_SRC`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_src) for |
| 173 | + `MLBuffer`s? |
| 174 | + - If we know the usage of `inputMlBuffer` (e.g. that it's read-only by WebNN) |
| 175 | + then we may be able to eliminate the data copy from the "upload" buffer to |
| 176 | + the high-GPU-bandwidth buffer in the non-UMA case: |
| 177 | + ```js |
| 178 | + const mlBuffer = mlContext.createBuffer({size, usage: MAP_WRITE | INPUT_ONLY}); |
| 179 | + ``` |
| 180 | + This may not make a difference for DirectML, which appears to require [bound |
| 181 | + resources](https://learn.microsoft.com/en-us/windows/win32/api/directml/nf-directml-idmlbindingtable-bindpersistentresource) |
| 182 | + to use `D3D12_HEAP_TYPE_DEFAULT`, but it could eliminate a copy on other |
| 183 | + systems. I'm not familiar enough with other systems to know the answer here! |
| 184 | + - Combining this with the above techniques brings (as many as) 4 copies down |
| 185 | + to (as few as) 2:\ |
| 186 | +   Before: *some source → `ArrayBuffer` → "upload" buffer |
| 187 | + → high-GPU-bandwidth buffer*\ |
| 188 | +   After: *some source → "upload" buffer* |
| 189 | + |
| 190 | +### Execute an `MLGraph` |
| 191 | + |
| 192 | +```js |
| 193 | +mlContext.dispatch( |
| 194 | + graph, |
| 195 | + /*inputs=*/{buffer: inputMlBuffer}, |
| 196 | + /*outputs=*/{buffer: intermediateMlBuffer}, |
| 197 | +); |
| 198 | +``` |
| 199 | + |
| 200 | +#### How it works: |
| 201 | + |
| 202 | +- Enqueues a request to compute the graph onto some WebNN timeline |
| 203 | +- Execution cannot start until all input and output `MLBuffer`s are available |
| 204 | +- All input and output `MLBuffer`s are unavailable while execution is in |
| 205 | + progress |
| 206 | +- All work submitted after this `dispatch()` call which relies on an input or |
| 207 | + output `MLBuffer` will be queued behind this execution |
| 208 | + |
| 209 | +#### Questions: |
| 210 | + |
| 211 | +- This approach is flexible enough to allow for graph execution on all backends. |
| 212 | + Do we need a separate `compute()` method? |
| 213 | +- Should this method be on the `MLGraph` (related to |
| 214 | + [#303](https://github.com/webmachinelearning/webnn/issues/303))? Is there a |
| 215 | + use case not satisfied by the following? |
| 216 | + ```js |
| 217 | + graph.dispatch( |
| 218 | + /*inputs=*/{buffer: inputMlBuffer}, |
| 219 | + /*outputs=*/{buffer: intermediateMlBuffer}, |
| 220 | + ); |
| 221 | + ``` |
| 222 | +- Is it valid to pass the same `MLBuffer` as both an input and output of the |
| 223 | + same `dispatch()` call? e.g. |
| 224 | + ```js |
| 225 | + graph.dispatch( |
| 226 | + /*inputs=*/{buffer: someMlBuffer}, |
| 227 | + /*outputs=*/{buffer: someMlBuffer}, |
| 228 | + ); |
| 229 | + ``` |
| 230 | + |
| 231 | +### Read back data from an `MLBuffer` |
| 232 | + |
| 233 | +```js |
| 234 | +const resultBuffer = await outputMlBuffer.mapAsync(); |
| 235 | +``` |
| 236 | + |
| 237 | +#### How it works: |
| 238 | + |
| 239 | +- After the completion of all currently-enqueued operations that use |
| 240 | + `outputMlBuffer`, WebNN will copy the contents of `outputMlBuffer` to |
| 241 | + `resultBuffer`. This is very similar to |
| 242 | + [`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync), |
| 243 | + with a key difference being that WebGPU only allows `mapAsync()` on buffers |
| 244 | + which have the `MAP_READ` usage flag. In this case, if using a GPU, we may |
| 245 | + need to create an intermediate "readback" buffer to facilitate the transfer. |
| 246 | + This may require two copies:\ |
| 247 | + *  high-GPU-bandwidth buffer → "readback" buffer → |
| 248 | + `ArrayBuffer`*\ |
| 249 | + |
| 250 | +#### Questions: |
| 251 | + |
| 252 | +- What should this method be called? I've proposed `mapAsync()` here to mirror |
| 253 | + WebGPU since the behavior is very similar. |
| 254 | +- Should there be the equivalent of |
| 255 | + [`MAP_READ`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_read) + |
| 256 | + [`COPY_DST`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_dst) for |
| 257 | + `MLBuffer`s? |
| 258 | + - If we know the usage of `outputMlBuffer` (e.g. that it's |
| 259 | + write-once by WebNN) then we could eliminate the data copy from the |
| 260 | + high-GPU-bandwidth buffer to the "readback" buffer in the non-UMA case |
| 261 | + ```js |
| 262 | + // The buffer may be allocated on a "readback" buffer |
| 263 | + const mlBuffer = mlContext.createBuffer({size, usage: MAP_READ | OUTPUT_ONLY}); |
| 264 | +
|
| 265 | + // `mlBuffer` may be used as an output to MLGraph execution |
| 266 | + // ... |
| 267 | +
|
| 268 | + // Read back with fewer data copies! |
| 269 | + const resultBuffer = await mlBuffer.mapAsync(); |
| 270 | + ``` |
| 271 | + Again, this will not help DirectML and may or may not help other systems. |
| 272 | + |
| 273 | +## Use Case: WebGPU Interop |
| 274 | + |
| 275 | +Here’s a code example in which WebNN performs selfie segmentation on a video |
| 276 | +frame without needing round-trips to JavaScript to synchronize WebNN and WebGPU |
| 277 | +compute: |
| 278 | + |
| 279 | +```js |
| 280 | +const applyEffectToFrame = () => { |
| 281 | + const gpuVideoTexture = gpuDevice.importExternalTexture({source: video}); |
| 282 | +
|
| 283 | + // Create a new MLBuffer to be used to facilitate WebGPU interop. |
| 284 | + // |
| 285 | + // Note that a more optimized implementation might allocate this buffer - or a |
| 286 | + // ring of buffers - ahead of time such that memory can be reused. |
| 287 | + const tensorizedMlBuffer = mlContext.createBuffer({size: tensorizedBufferSize}); |
| 288 | +
|
| 289 | + // Rent out the MLBuffer to WebGPU. |
| 290 | + const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice); |
| 291 | +
|
| 292 | + // Create a bind group for `gpuVideoTexture`, create a command encoder, etc. |
| 293 | + // to "tensorize" `gpuVideoTexture` and store the result in `tensorizedGpuBuffer` |
| 294 | + // ... |
| 295 | +
|
| 296 | + gpuDevice.queue.submit([tensorizationCommandEncoder.finish()]); |
| 297 | +
|
| 298 | + // Return the buffer to WebNN. |
| 299 | + tensorizedMlBuffer.unmapFromGpuBuffer(); |
| 300 | +
|
| 301 | + // Perform some inference described by `graph` on the frame |
| 302 | + // (e.g. selfie segmentation) |
| 303 | + mlContext.dispatch( |
| 304 | + graph, |
| 305 | + /*inputs=*/{buffer: tensorizedMlBuffer}, |
| 306 | + /*outputs=*/{buffer: tensorizedMlBuffer}, |
| 307 | + ); |
| 308 | +
|
| 309 | + // Rent the MLBuffer back out to WebGPU. |
| 310 | + const tensorizedGpuBufferAfterInference = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice); |
| 311 | +
|
| 312 | + // Create a bind group for `tensorizedGpuBufferAfterInference`, |
| 313 | + // create a command encoder, etc to feed `tensorizedGpuBufferAfterInference` |
| 314 | + // into a GPU shader which may blur the frame or replace background sections |
| 315 | + // and then render the result |
| 316 | + // ... |
| 317 | +
|
| 318 | + gpuDevice.queue.submit([texturizeAndRenderCommandEncoder.finish()]); |
| 319 | +
|
| 320 | + // Call this method for each frame. |
| 321 | + video.requestVideoFrameCallback(applyEffectToFrame); |
| 322 | +} |
| 323 | +``` |
| 324 | + |
| 325 | +Let's again dive into what happens at each of these steps. Some of these steps |
| 326 | +are covered above, which I'll skip over here: |
| 327 | + |
| 328 | +### Rent out an `MLBuffer` to WebGPU |
| 329 | + |
| 330 | +```js |
| 331 | +const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice); |
| 332 | +``` |
| 333 | + |
| 334 | +#### How it works: |
| 335 | + |
| 336 | +- Two fences are created: |
| 337 | + 1. a "start access" fence which is to be signaled by WebNN and waited on by |
| 338 | + WebGPU |
| 339 | + 2. an "end access" fence which is to be signaled by WebGPU and waited on by |
| 340 | + WebNN |
| 341 | +- `gpuDevice` enqueues a command to its `GPUQueue` to wait for the "start |
| 342 | + access" fence to be signaled |
| 343 | +- WebNN (on some queue or timeline yet to be specified) will signal the "start |
| 344 | + access" fence after the completion of all currently-enqueued operations that |
| 345 | + use `tensorizedMlBuffer`. This is very similar to how `mapAsync()` works |
| 346 | + - In this case, there is only one currently-enqueued operation: |
| 347 | + `MLContext.createBuffer()` |
| 348 | + - In the latter `mapAsGpuBuffer()` call, the "start access" fence will not be |
| 349 | + signaled by WebNN until the `dispatch()` call is complete. This implicitly |
| 350 | + blocks execution of the commands in `texturizeAndRenderCommandEncoder` that |
| 351 | + are enqueued to WebGPU until WebNN is finished with `tensorizedMlBuffer` |
| 352 | +- WebNN will wait for the "end access" fence to be signaled. In the meantime, |
| 353 | + all work involving `tensorizedMlBuffer` is blocked |
| 354 | +- `gpuDevice` has exclusive, read/write access to this memory for as long as the |
| 355 | + "end access" fence is not signaled |
| 356 | +- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this |
| 357 | + will be a zero-copy mapping. Otherwise a new buffer will be allocated on |
| 358 | + `gpuDevice` and the contents of `tensorizedMlBuffer` will be copied into this |
| 359 | + buffer |
| 360 | +- The memory backing `tensorizedMlBuffer` becomes inaccessible to WebNN (or |
| 361 | + script, or anything else), regardless of whether a copy is made. |
| 362 | + - Ideally these states and their transitions can be expressed similarly to a |
| 363 | + `GPUBuffer`'s [internal |
| 364 | + state](https://www.w3.org/TR/webgpu/#buffer-internals-state) |
| 365 | +
|
| 366 | +#### Questions: |
| 367 | +
|
| 368 | +- What are the usage flags of `tensorizedGpuBuffer`? |
| 369 | +- While `tensorizedMlBuffer` is rented out to WebGPU as `tensorizedGpuBuffer`: |
| 370 | + - What happens if `destroy()` is called on `tensorizedMlBuffer`? |
| 371 | + - What happens if `destroy()` is called on `tensorizedGpuBuffer`? |
| 372 | +
|
| 373 | +### Return a rented-out `MLBuffer` back to WebNN |
| 374 | +
|
| 375 | +```js |
| 376 | +tensorizedMlBuffer.unmapFromGpuBuffer(); |
| 377 | +``` |
| 378 | +
|
| 379 | +#### How it works: |
| 380 | +
|
| 381 | +- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this |
| 382 | + will be a zero-copy unmapping. Otherwise the contents of `tensorizedGpuBuffer` |
| 383 | + will be copied into `tensorizedMlBuffer` |
| 384 | +- Informs `gpuDevice` to signal the "end access" fence created in the |
| 385 | + `mapAsGpuBuffer()` method after the completion of currently-enqueued |
| 386 | + operations that use `tensorizedGpuBuffer`. This is very similar to how |
| 387 | + `mapAsync()` works |
| 388 | +- The WebNN timeline receives the signal and may resume execution |
| 389 | +- WebNN has exclusive, read/write access to this memory until further notice |
| 390 | +- `tensorizerGpuBuffer` is expired |
| 391 | + https://gpuweb.github.io/gpuweb/#dom-gpuexternaltexture-expired-slot |
| 392 | +
|
| 393 | +#### Questions: |
| 394 | +
|
| 395 | +- What happens to `tensorizedMlBuffer` if this method is never called? |
0 commit comments