Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM blow up in the worker #888

Closed
josephjclark opened this issue Mar 3, 2025 · 5 comments · Fixed by #890
Closed

OOM blow up in the worker #888

josephjclark opened this issue Mar 3, 2025 · 5 comments · Fixed by #890
Assignees

Comments

@josephjclark
Copy link
Collaborator

We've caught a case of the a worker blowing up for OOM

[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[resource.labels.containerName: global-worker] [R/T] ❯ Final memory usage: [step 482mb] [system 739mb]
[SRV] ℹ 1c87ffb1-e05f-4489-8ee3-9e9e3045108f :: run:log :: OK
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] <--- Last few GCs --->
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] [1:0x7a736db93000] 108822028 ms: Scavenge (interleaved) 1177.3 (1206.6) -> 1176.5 (1214.6) MB, pooled: 0 MB, 89.14 / 0.00 ms (average mu = 0.675, current mu = 0.547) task;
[resource.labels.containerName: global-worker] [1:0x7a736db93000] 108824468 ms: Mark-Compact (reduce) 1203.4 (1234.6) -> 1203.1 (1226.1) MB, pooled: 0 MB, 552.17 / 0.00 ms (+ 1390.0 ms in 0 steps since start of marking, biggest step 0.0 ms, walltime since start of marking 2440 ms) (average mu = 0.544
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] <--- JS stacktrace --->
[resource.labels.containerName: global-worker] {}
[resource.labels.containerName: global-worker] FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
[resource.labels.containerName: global-worker] ----- Native stack trace -----

Cloud link

@josephjclark josephjclark self-assigned this Mar 3, 2025
@github-project-automation github-project-automation bot moved this to New Issues in v2 Mar 3, 2025
@doc-han
Copy link
Collaborator

doc-han commented Mar 3, 2025

[resource.labels.containerName: global-worker] [R/T] ❯ Final memory usage: [step 482mb] [system 739mb]

step 482mb 🤔
system 739mb

@josephjclark
Copy link
Collaborator Author

We have a bunch of mitigations against this:

  • the child process should have its own memory limit
  • when it OOMs, the worker should catch it and safely abort the run

This isn't foolproof though: synchronous operations can consume memory way above the limit before the thread is killed. Synchronous operations like stringifying large objects (as the runtime goes.

What I think we're seeing is that the parent process memory - the actual worker - is blowing up. And unfortunately when that happens there's not very much we can do. Because the worker has no persistent layer, it can't send any updates to lightning.

Some thoughts:

  • We might be able to use a streaming serialiser to handle large objects. I worry about how this will play with fast-safe-stringify
  • can we use a supervisor process to a) track work and b) send errors back to lighting?
  • Or any other way to do this? Even a temporary file per active run? On OOM, look for any active files and send errors for them back to lightning

@josephjclark
Copy link
Collaborator Author

josephjclark commented Mar 3, 2025

There are several possible points of memory failure:

  1. The VM (the job code) could use too much working memory. This should be caught by the child process and safely handled (feels fine in my local testing), but worst-case scenario I think is that the child process dies. Again, this should be handled well.
  2. The runtime inside the child process could also use too much memory, eg, when serializing strings to send out to the main worker process. Again, as per 1, I think this is well handled
  3. The worker - which also needs to parse large JSON payloads - could exceed its memory limit. This will cause any concurrent runs on the worker to be lost.

It's this third one that's happening now I think. This is the absolute worst case because if the worker itself blows up, all bets are off.

I have a repro case here with a 20mb state object and a low 50mb limit on the main worker thread.

  • the VM had enough memory to build it
  • The child process had enouhh memory to serialize it
  • The worker appears to have blown up trying to send it out to lightning

A couple of good questions to ask are:

  • if this state object is 20mb, it shouldn't be getting sent to lightning at all? I can't see a "withholding payload" message (I could be missing it).
  • What is a reasonable default payload size that we expect a worker, with recommended memory allowance and capacity, to be able to support? It should be ironclad: if a worker has N runs, each returning a Ymb payload at the same time, the worker should be able to safely handle this

EDIT: Oh but hang on, the payload size is enforced by the worker, right at the last minute. The engine sends all payloads up to the worker, it doesn't enforce the limit at all. And yet, thinking about it, this is putting the worker in jeopardy. It's blown up processing the payload before it's even decided whether to send it back to lightning.

@josephjclark
Copy link
Collaborator Author

Yep, OK, zeroing in on this thing now.

I've confirmed that if the payload is too big, it's possible to trigger an OOM exception in the main worker thread while receiving the data. Not sure if it's the worker thread or the child process causing the problem.

So if I postMessage with a payload that's too big, we get a blow up. And there's no defense around this.

Options as I see them right now:

  1. push the payload limit stuff from the worker down into the engine. Now the engine won't even emit an event from the worker thread if we think the payload is too big. I'd need to think a little bit about exactly where we do this and how we recognise that it's happened (because the worker still needs to report it to the user)
  2. Work out some way to more effeciently stream the payload over the thread. I'm not sure what sort of API support I'll get for this in node because I think it's quite a limited interface.

For 2, I know I can use a shared memory buffer between the worker thread and its parenting child process. I'm more worried about the cross process messaging. I do think I can send my own socket or server there, presumably I can then use streaming to send messages more efficiently.

Tricky one. I think we probably should explore 2) because it results in a more robust architecture. But also there's no point in emitting data if it's not going to be used downstream, so that suggests 1) is the correct approach.

@josephjclark
Copy link
Collaborator Author

josephjclark commented Mar 3, 2025

Had a little look at 2. You can apparently pass your own socket to handle the IPC messaging (as I understand it). See nodejs docs

Perhaps I'm naive but I hoped this would stream messages more efficiently and let me do more with less memory. Maybe I'm just missing a step and I just don't get this for free. But I've got an implementation and it doesn't seem to help at all :( I think it's using the socket - I really don't get much feedback - but I still get the same sorts of blowups

This is on branch engine-sockets

So I'm gonna pause this and switch to approach 1, which to be fair we do need anyway.

Trying this out makes me think that instead of using IPC for inter process messaging, I should just create a local websocket server and have worker threads call out to it directly. That still assumes we can stream JSON payloads with a much lower memory overhead. Which I think seems reasonable?

@theroinaochieng theroinaochieng moved this from New Issues to DevX Backlog in v2 Mar 4, 2025
@github-project-automation github-project-automation bot moved this from DevX Backlog to Done in v2 Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants