Skip to content

Commit 62a4a6d

Browse files
committed
add MLBuffer exploration doc
1 parent 7773794 commit 62a4a6d

File tree

1 file changed

+395
-0
lines changed

1 file changed

+395
-0
lines changed

mlbuffer-exploration.md

+395
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,395 @@
1+
# `MLBuffer` Exploration
2+
3+
By @a-sully
4+
5+
## What is this?
6+
7+
This is an exploration - primarily via code samples of key use cases - of what
8+
ML compute might look like using a device-agnostic buffer, as proposed in
9+
[#482](https://github.com/webmachinelearning/webnn/issues/482) as `MLBuffer`.
10+
11+
This is not intended to be a formal explainer, though it could become one if
12+
that would be useful. My intention here is to describe our priorities (such that
13+
we can ensure the design satisfies these priorities), bring attention to some
14+
open questions and related issues, toss around some ideas, and encourage
15+
discussion about how this proposal will be specified.
16+
17+
## Goals
18+
19+
- Minimize round-trips to JavaScript/CPU needed for synchronization of work on
20+
buffers which may not live on the CPU
21+
- Minimize buffer copies
22+
- In particular, we should support zero-copy buffer sharing between WebNN and
23+
WebGPU if this is supported by the underlying hardware
24+
- Support the XPU (i.e. CPU, GPU, NPU, TPU, etc...) with one consistent API
25+
- Follow recomended [design
26+
principles](https://w3ctag.github.io/design-principles/)
27+
- In my opinion, this likely entails [mirroring WebGPU's design
28+
decisions](https://w3ctag.github.io/design-principles/#naming-consultation),
29+
where appropriate
30+
31+
## Overarching Questions
32+
33+
Many of these questions are not _specific_ to `MLBuffer`, but are important
34+
enough that their answers will strongly influence the shape of the `MLBuffer`
35+
proposal.
36+
37+
- What are WebNN's timelines and how do they interact with WebGPU's timelines?
38+
See [#529](https://github.com/webmachinelearning/webnn/issues/529)
39+
- Where will an `MLBuffer`'s memory be allocated on systems where an `MLContext`
40+
may not be as closely tied to a given physical device as an
41+
[`IDMLDevice`](https://learn.microsoft.com/en-us/windows/win32/api/directml/nn-directml-idmldevice)?
42+
See [#350](https://github.com/webmachinelearning/webnn/issues/350)
43+
- How will errors be surfaced? See
44+
[#477](https://github.com/webmachinelearning/webnn/issues/477). Do we need a
45+
concept similar to [WebGPU's error
46+
scopes](https://www.w3.org/TR/webgpu/#error-scopes)?
47+
- Must an `MLBuffer` only be used with an `MLContext` it was created from?
48+
(or `MLGraph`s created from that `MLContext`, and so forth)
49+
- If what we're building is a device-agnostic buffer, it will surely be used for
50+
things other than ML (in the long run). In the spirit of
51+
[future-proofing](https://w3ctag.github.io/design-principles/#naming-future-proofing),
52+
should we name it something other than `MLBuffer`?
53+
54+
## Use Case: Chained Inference
55+
56+
Here's a code sample showing how `MLBuffer`s can be used for chained inference
57+
and then read back to an `ArrayBuffer`:
58+
59+
```js
60+
// Create new MLBuffers to be used for chained inference.
61+
const inputMlBuffer = mlContext.createBuffer({inputSize});
62+
const intermediateMlBuffer = mlContext.createBuffer({intermediateSize});
63+
const outputMlBuffer = mlContext.createBuffer({outputSize});
64+
65+
// Copy the contents of an ArrayBuffer into an MLBuffer, to be later used as inputs.
66+
mlContext.writeBuffer(
67+
inputMlBuffer,
68+
/*dstOffset=*/0,
69+
/*srcData=*/someJsArrayBuffer,
70+
);
71+
72+
// Perform some ✧*✧* machine learning *✧*✧ described by `graph`.
73+
mlContext.dispatch(
74+
graph,
75+
/*inputs=*/{buffer: inputMlBuffer},
76+
/*outputs=*/{buffer: intermediateMlBuffer},
77+
);
78+
79+
// Feed the output of one execution as the input to the next. Chained inference!
80+
mlContext.dispatch(
81+
anotherGraph,
82+
/*inputs=*/{buffer: intermediateMlBuffer},
83+
/*outputs=*/{buffer: outputMlBuffer},
84+
);
85+
86+
// Read back the results to script.
87+
const resultBuffer = await outputMlBuffer.mapAsync();
88+
```
89+
90+
Let's dive into what happens at each of these steps:
91+
92+
### `MLBuffer` creation
93+
94+
```js
95+
const inputMlBuffer = mlContext.createBuffer({inputSize});
96+
```
97+
#### How it works:
98+
99+
- Enqueue a request on some WebNN timeline to allocate memory on the device
100+
associated with `mlContext`
101+
- The memory allocation will be zeroed (as it is for [WebGPU's `createBuffer()`
102+
method](https://www.w3.org/TR/webgpu/#dom-gpudevice-createbuffer))
103+
104+
#### Questions:
105+
106+
- Can an `MLBuffer`'s size always be known at the time of buffer allocation?
107+
- In this case and many other cases it seems possible; it's presumably a
108+
function of the model and/or video input. But since WebNN always rents a
109+
buffer to WebGPU - never the other way around - this introduces a constraint
110+
that the size of an `MLBuffer` must always be known at the time of buffer
111+
allocation
112+
- When will `inputMlBuffer` be deallocated if `destroy()` is not called?
113+
114+
### Writing to an `MLBuffer`
115+
116+
```js
117+
mlContext.writeBuffer(
118+
inputMlBuffer,
119+
/*dstOffset=*/0,
120+
/*srcData=*/someJsArrayBuffer,
121+
);
122+
```
123+
124+
#### How it works:
125+
126+
- Enqueue a request on some WebNN timeline to copy the contents of
127+
`someJsArrayBuffer` to `inputMlBuffer`. This is very similar to [the
128+
corresponding WebGPU
129+
method](https://www.w3.org/TR/webgpu/#dom-gpuqueue-writebuffer), though the
130+
implementation details will vary depending on which device `inputMlBuffer` is
131+
allocated on. For example, if allocated on:
132+
- a CPU, the buffer contents will be copied directly (i.e. `memcpy()`)
133+
- a GPU, the behavior will likely match `GPUQueue.writeBuffer()`. On UMA
134+
systems, a `memcpy()` might suffice. Other implementations may use a hidden
135+
"upload" buffer to get the data onto the GPU. This implies two copies:\
136+
*  `ArrayBuffer` → "upload" buffer → high-GPU-bandwidth
137+
buffer*
138+
- an XPU... it depends!
139+
- `someJsArrayBuffer` is unaffected, since the bytes are copied
140+
- Note that the aforementioned copies are _in addition_ to any copies needed
141+
to get the data into the `ArrayBuffer` in the first place. If the data is
142+
weights being read from a `File`, for example, this will require first
143+
copying the bytes from the `File` into the `ArrayBuffer`. This means
144+
**copying the weights into GPU-accessible memory could take as many as four
145+
copies!**
146+
147+
#### Questions:
148+
149+
- Should there be a corresponding
150+
[`mappedAtCreation`](https://www.w3.org/TR/webgpu/#dom-gpubufferdescriptor-mappedatcreation)
151+
capability?
152+
- If the data is not already in an `ArrayBuffer`, this eliminates the data
153+
copy into an `ArrayBuffer` altogether, since we could write to the "upload"
154+
buffer directly:
155+
```js
156+
const mlBuffer = mlContext.createBuffer({size, mappedAtCreation: true});
157+
158+
const floatArray = new Float32Array(mlBuffer.getMappedRange()),
159+
160+
// Write to `floatArray`
161+
// ...
162+
163+
// Write the buffer contents to the XPU
164+
mlBuffer.unmap();
165+
```
166+
  Before: *some source → `ArrayBuffer` → "upload" buffer
167+
→ high-GPU-bandwidth buffer*\
168+
  After: *some source → "upload" buffer
169+
→ high-GPU-bandwidth buffer*
170+
- Should there be the equivalent of
171+
[`MAP_WRITE`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_write) +
172+
[`COPY_SRC`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_src) for
173+
`MLBuffer`s?
174+
- If we know the usage of `inputMlBuffer` (e.g. that it's read-only by WebNN)
175+
then we may be able to eliminate the data copy from the "upload" buffer to
176+
the high-GPU-bandwidth buffer in the non-UMA case:
177+
```js
178+
const mlBuffer = mlContext.createBuffer({size, usage: MAP_WRITE | INPUT_ONLY});
179+
```
180+
This may not make a difference for DirectML, which appears to require [bound
181+
resources](https://learn.microsoft.com/en-us/windows/win32/api/directml/nf-directml-idmlbindingtable-bindpersistentresource)
182+
to use `D3D12_HEAP_TYPE_DEFAULT`, but it could eliminate a copy on other
183+
systems. I'm not familiar enough with other systems to know the answer here!
184+
- Combining this with the above techniques brings (as many as) 4 copies down
185+
to (as few as) 2:\
186+
  Before: *some source → `ArrayBuffer` → "upload" buffer
187+
→ high-GPU-bandwidth buffer*\
188+
  After: *some source → "upload" buffer*
189+
190+
### Execute an `MLGraph`
191+
192+
```js
193+
mlContext.dispatch(
194+
graph,
195+
/*inputs=*/{buffer: inputMlBuffer},
196+
/*outputs=*/{buffer: intermediateMlBuffer},
197+
);
198+
```
199+
200+
#### How it works:
201+
202+
- Enqueues a request to compute the graph onto some WebNN timeline
203+
- Execution cannot start until all input and output `MLBuffer`s are available
204+
- All input and output `MLBuffer`s are unavailable while execution is in
205+
progress
206+
- All work submitted after this `dispatch()` call which relies on an input or
207+
output `MLBuffer` will be queued behind this execution
208+
209+
#### Questions:
210+
211+
- This approach is flexible enough to allow for graph execution on all backends.
212+
Do we need a separate `compute()` method?
213+
- Should this method be on the `MLGraph` (related to
214+
[#303](https://github.com/webmachinelearning/webnn/issues/303))? Is there a
215+
use case not satisfied by the following?
216+
```js
217+
graph.dispatch(
218+
/*inputs=*/{buffer: inputMlBuffer},
219+
/*outputs=*/{buffer: intermediateMlBuffer},
220+
);
221+
```
222+
- Is it valid to pass the same `MLBuffer` as both an input and output of the
223+
same `dispatch()` call? e.g.
224+
```js
225+
graph.dispatch(
226+
/*inputs=*/{buffer: someMlBuffer},
227+
/*outputs=*/{buffer: someMlBuffer},
228+
);
229+
```
230+
231+
### Read back data from an `MLBuffer`
232+
233+
```js
234+
const resultBuffer = await outputMlBuffer.mapAsync();
235+
```
236+
237+
#### How it works:
238+
239+
- After the completion of all currently-enqueued operations that use
240+
`outputMlBuffer`, WebNN will copy the contents of `outputMlBuffer` to
241+
`resultBuffer`. This is very similar to
242+
[`GPUBuffer.mapAsync()`](https://www.w3.org/TR/webgpu/#dom-gpubuffer-mapasync),
243+
with a key difference being that WebGPU only allows `mapAsync()` on buffers
244+
which have the `MAP_READ` usage flag. In this case, if using a GPU, we may
245+
need to create an intermediate "readback" buffer to facilitate the transfer.
246+
This may require two copies:\
247+
*  high-GPU-bandwidth buffer → "readback" buffer →
248+
`ArrayBuffer`*\
249+
250+
#### Questions:
251+
252+
- What should this method be called? I've proposed `mapAsync()` here to mirror
253+
WebGPU since the behavior is very similar.
254+
- Should there be the equivalent of
255+
[`MAP_READ`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-map_read) +
256+
[`COPY_DST`](https://www.w3.org/TR/webgpu/#dom-gpubufferusage-copy_dst) for
257+
`MLBuffer`s?
258+
- If we know the usage of `outputMlBuffer` (e.g. that it's
259+
write-once by WebNN) then we could eliminate the data copy from the
260+
high-GPU-bandwidth buffer to the "readback" buffer in the non-UMA case
261+
```js
262+
// The buffer may be allocated on a "readback" buffer
263+
const mlBuffer = mlContext.createBuffer({size, usage: MAP_READ | OUTPUT_ONLY});
264+
265+
// `mlBuffer` may be used as an output to MLGraph execution
266+
// ...
267+
268+
// Read back with fewer data copies!
269+
const resultBuffer = await mlBuffer.mapAsync();
270+
```
271+
Again, this will not help DirectML and may or may not help other systems.
272+
273+
## Use Case: WebGPU Interop
274+
275+
Here’s a code example in which WebNN performs selfie segmentation on a video
276+
frame without needing round-trips to JavaScript to synchronize WebNN and WebGPU
277+
compute:
278+
279+
```js
280+
const applyEffectToFrame = () => {
281+
const gpuVideoTexture = gpuDevice.importExternalTexture({source: video});
282+
283+
// Create a new MLBuffer to be used to facilitate WebGPU interop.
284+
//
285+
// Note that a more optimized implementation might allocate this buffer - or a
286+
// ring of buffers - ahead of time such that memory can be reused.
287+
const tensorizedMlBuffer = mlContext.createBuffer({size: tensorizedBufferSize});
288+
289+
// Rent out the MLBuffer to WebGPU.
290+
const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice);
291+
292+
// Create a bind group for `gpuVideoTexture`, create a command encoder, etc.
293+
// to "tensorize" `gpuVideoTexture` and store the result in `tensorizedGpuBuffer`
294+
// ...
295+
296+
gpuDevice.queue.submit([tensorizationCommandEncoder.finish()]);
297+
298+
// Return the buffer to WebNN.
299+
tensorizedMlBuffer.unmapFromGpuBuffer();
300+
301+
// Perform some inference described by `graph` on the frame
302+
// (e.g. selfie segmentation)
303+
mlContext.dispatch(
304+
graph,
305+
/*inputs=*/{buffer: tensorizedMlBuffer},
306+
/*outputs=*/{buffer: tensorizedMlBuffer},
307+
);
308+
309+
// Rent the MLBuffer back out to WebGPU.
310+
const tensorizedGpuBufferAfterInference = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice);
311+
312+
// Create a bind group for `tensorizedGpuBufferAfterInference`,
313+
// create a command encoder, etc to feed `tensorizedGpuBufferAfterInference`
314+
// into a GPU shader which may blur the frame or replace background sections
315+
// and then render the result
316+
// ...
317+
318+
gpuDevice.queue.submit([texturizeAndRenderCommandEncoder.finish()]);
319+
320+
// Call this method for each frame.
321+
video.requestVideoFrameCallback(applyEffectToFrame);
322+
}
323+
```
324+
325+
Let's again dive into what happens at each of these steps. Some of these steps
326+
are covered above, which I'll skip over here:
327+
328+
### Rent out an `MLBuffer` to WebGPU
329+
330+
```js
331+
const tensorizedGpuBuffer = tensorizedMlBuffer.mapAsGpuBuffer(gpuDevice);
332+
```
333+
334+
#### How it works:
335+
336+
- Two fences are created:
337+
1. a "start access" fence which is to be signaled by WebNN and waited on by
338+
WebGPU
339+
2. an "end access" fence which is to be signaled by WebGPU and waited on by
340+
WebNN
341+
- `gpuDevice` enqueues a command to its `GPUQueue` to wait for the "start
342+
access" fence to be signaled
343+
- WebNN (on some queue or timeline yet to be specified) will signal the "start
344+
access" fence after the completion of all currently-enqueued operations that
345+
use `tensorizedMlBuffer`. This is very similar to how `mapAsync()` works
346+
- In this case, there is only one currently-enqueued operation:
347+
`MLContext.createBuffer()`
348+
- In the latter `mapAsGpuBuffer()` call, the "start access" fence will not be
349+
signaled by WebNN until the `dispatch()` call is complete. This implicitly
350+
blocks execution of the commands in `texturizeAndRenderCommandEncoder` that
351+
are enqueued to WebGPU until WebNN is finished with `tensorizedMlBuffer`
352+
- WebNN will wait for the "end access" fence to be signaled. In the meantime,
353+
all work involving `tensorizedMlBuffer` is blocked
354+
- `gpuDevice` has exclusive, read/write access to this memory for as long as the
355+
"end access" fence is not signaled
356+
- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this
357+
will be a zero-copy mapping. Otherwise a new buffer will be allocated on
358+
`gpuDevice` and the contents of `tensorizedMlBuffer` will be copied into this
359+
buffer
360+
- The memory backing `tensorizedMlBuffer` becomes inaccessible to WebNN (or
361+
script, or anything else), regardless of whether a copy is made.
362+
- Ideally these states and their transitions can be expressed similarly to a
363+
`GPUBuffer`'s [internal
364+
state](https://www.w3.org/TR/webgpu/#buffer-internals-state)
365+
366+
#### Questions:
367+
368+
- What are the usage flags of `tensorizedGpuBuffer`?
369+
- While `tensorizedMlBuffer` is rented out to WebGPU as `tensorizedGpuBuffer`:
370+
- What happens if `destroy()` is called on `tensorizedMlBuffer`?
371+
- What happens if `destroy()` is called on `tensorizedGpuBuffer`?
372+
373+
### Return a rented-out `MLBuffer` back to WebNN
374+
375+
```js
376+
tensorizedMlBuffer.unmapFromGpuBuffer();
377+
```
378+
379+
#### How it works:
380+
381+
- If `tensorizedMlBuffer` was allocated in memory shared by `gpuDevice`, this
382+
will be a zero-copy unmapping. Otherwise the contents of `tensorizedGpuBuffer`
383+
will be copied into `tensorizedMlBuffer`
384+
- Informs `gpuDevice` to signal the "end access" fence created in the
385+
`mapAsGpuBuffer()` method after the completion of currently-enqueued
386+
operations that use `tensorizedGpuBuffer`. This is very similar to how
387+
`mapAsync()` works
388+
- The WebNN timeline receives the signal and may resume execution
389+
- WebNN has exclusive, read/write access to this memory until further notice
390+
- `tensorizerGpuBuffer` is expired
391+
https://gpuweb.github.io/gpuweb/#dom-gpuexternaltexture-expired-slot
392+
393+
#### Questions:
394+
395+
- What happens to `tensorizedMlBuffer` if this method is never called?

0 commit comments

Comments
 (0)