OOM blow up in the worker #888

josephjclark · 2025-03-03T11:37:38Z

We've caught a case of the a worker blowing up for OOM

[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[resource.labels.containerName: global-worker] [R/T] ❯ Final memory usage: [step 482mb] [system 739mb]
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] <--- Last few GCs --->
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] [1:0x7a736db93000] 108822028 ms: Scavenge (interleaved) 1177.3 (1206.6) -> 1176.5 (1214.6) MB, pooled: 0 MB, 89.14 / 0.00 ms (average mu = 0.675, current mu = 0.547) task;
[resource.labels.containerName: global-worker] [1:0x7a736db93000] 108824468 ms: Mark-Compact (reduce) 1203.4 (1234.6) -> 1203.1 (1226.1) MB, pooled: 0 MB, 552.17 / 0.00 ms (+ 1390.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 2440 ms) (average mu = 0.544
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] <--- JS stacktrace --->
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
[resource.labels.containerName: global-worker] ----- Native stack trace -----

Cloud link

The text was updated successfully, but these errors were encountered:

doc-han · 2025-03-03T11:43:11Z

[resource.labels.containerName: global-worker] [R/T] ❯ Final memory usage: [step 482mb] [system 739mb]

step 482mb 🤔
system 739mb

josephjclark · 2025-03-03T11:53:03Z

We have a bunch of mitigations against this:

the child process should have its own memory limit
when it OOMs, the worker should catch it and safely abort the run

This isn't foolproof though: synchronous operations can consume memory way above the limit before the thread is killed. Synchronous operations like stringifying large objects (as the runtime goes.

What I think we're seeing is that the parent process memory - the actual worker - is blowing up. And unfortunately when that happens there's not very much we can do. Because the worker has no persistent layer, it can't send any updates to lightning.

Some thoughts:

We might be able to use a streaming serialiser to handle large objects. I worry about how this will play with fast-safe-stringify
can we use a supervisor process to a) track work and b) send errors back to lighting?
Or any other way to do this? Even a temporary file per active run? On OOM, look for any active files and send errors for them back to lightning

josephjclark · 2025-03-03T12:56:08Z

There are several possible points of memory failure:

The VM (the job code) could use too much working memory. This should be caught by the child process and safely handled (feels fine in my local testing), but worst-case scenario I think is that the child process dies. Again, this should be handled well.
The runtime inside the child process could also use too much memory, eg, when serializing strings to send out to the main worker process. Again, as per 1, I think this is well handled
The worker - which also needs to parse large JSON payloads - could exceed its memory limit. This will cause any concurrent runs on the worker to be lost.

It's this third one that's happening now I think. This is the absolute worst case because if the worker itself blows up, all bets are off.

I have a repro case here with a 20mb state object and a low 50mb limit on the main worker thread.

the VM had enough memory to build it
The child process had enouhh memory to serialize it
The worker appears to have blown up trying to send it out to lightning

A couple of good questions to ask are:

if this state object is 20mb, it shouldn't be getting sent to lightning at all? I can't see a "withholding payload" message (I could be missing it).
What is a reasonable default payload size that we expect a worker, with recommended memory allowance and capacity, to be able to support? It should be ironclad: if a worker has N runs, each returning a Ymb payload at the same time, the worker should be able to safely handle this

EDIT: Oh but hang on, the payload size is enforced by the worker, right at the last minute. The engine sends all payloads up to the worker, it doesn't enforce the limit at all. And yet, thinking about it, this is putting the worker in jeopardy. It's blown up processing the payload before it's even decided whether to send it back to lightning.

josephjclark · 2025-03-03T15:14:09Z

Yep, OK, zeroing in on this thing now.

I've confirmed that if the payload is too big, it's possible to trigger an OOM exception in the main worker thread while receiving the data. Not sure if it's the worker thread or the child process causing the problem.

So if I postMessage with a payload that's too big, we get a blow up. And there's no defense around this.

Options as I see them right now:

push the payload limit stuff from the worker down into the engine. Now the engine won't even emit an event from the worker thread if we think the payload is too big. I'd need to think a little bit about exactly where we do this and how we recognise that it's happened (because the worker still needs to report it to the user)
Work out some way to more effeciently stream the payload over the thread. I'm not sure what sort of API support I'll get for this in node because I think it's quite a limited interface.

For 2, I know I can use a shared memory buffer between the worker thread and its parenting child process. I'm more worried about the cross process messaging. I do think I can send my own socket or server there, presumably I can then use streaming to send messages more efficiently.

Tricky one. I think we probably should explore 2) because it results in a more robust architecture. But also there's no point in emitting data if it's not going to be used downstream, so that suggests 1) is the correct approach.

josephjclark · 2025-03-03T16:11:40Z

Had a little look at 2. You can apparently pass your own socket to handle the IPC messaging (as I understand it). See nodejs docs

Perhaps I'm naive but I hoped this would stream messages more efficiently and let me do more with less memory. Maybe I'm just missing a step and I just don't get this for free. But I've got an implementation and it doesn't seem to help at all :( I think it's using the socket - I really don't get much feedback - but I still get the same sorts of blowups

This is on branch engine-sockets

So I'm gonna pause this and switch to approach 1, which to be fair we do need anyway.

Trying this out makes me think that instead of using IPC for inter process messaging, I should just create a local websocket server and have worker threads call out to it directly. That still assumes we can stream JSON payloads with a much lower memory overhead. Which I think seems reasonable?

josephjclark self-assigned this Mar 3, 2025

taylordowns2000 added this to v2 Mar 3, 2025

github-project-automation bot moved this to New Issues in v2 Mar 3, 2025

theroinaochieng moved this from New Issues to DevX Backlog in v2 Mar 4, 2025

josephjclark mentioned this issue Mar 4, 2025

Engine: Limit payload sizes of events #890

Merged

7 tasks

josephjclark closed this as completed in #890 Mar 6, 2025

github-project-automation bot moved this from DevX Backlog to Done in v2 Mar 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM blow up in the worker #888

OOM blow up in the worker #888

josephjclark commented Mar 3, 2025

doc-han commented Mar 3, 2025

josephjclark commented Mar 3, 2025

josephjclark commented Mar 3, 2025 •

edited

Loading

josephjclark commented Mar 3, 2025

josephjclark commented Mar 3, 2025 •

edited

Loading

OOM blow up in the worker #888

OOM blow up in the worker #888

Comments

josephjclark commented Mar 3, 2025

doc-han commented Mar 3, 2025

josephjclark commented Mar 3, 2025

josephjclark commented Mar 3, 2025 • edited Loading

josephjclark commented Mar 3, 2025

josephjclark commented Mar 3, 2025 • edited Loading

josephjclark commented Mar 3, 2025 •

edited

Loading

josephjclark commented Mar 3, 2025 •

edited

Loading