Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Failed to create the main Isolate" error - Large Memory Req #46

Closed
muffato opened this issue Dec 17, 2023 · 16 comments · Fixed by #63
Closed

"Failed to create the main Isolate" error - Large Memory Req #46

muffato opened this issue Dec 17, 2023 · 16 comments · Fixed by #63
Assignees

Comments

@muffato
Copy link

muffato commented Dec 17, 2023

Following #29 (comment) I went to an Ubuntu 22.04 machine and got a different error

$ ./wave-1.1.1-linux-x86_64 
Fatal error: Failed to create the main Isolate. (code 801)

(the binary is the one attached to the v1.1.1 release https://github.com/seqeralabs/wave-cli/releases/tag/v1.1.1 )

@pditommaso
Copy link
Contributor

Works in my test

$ cat /etc/issue
Ubuntu 22.04.3 LTS \n \l

$ ./wave-1.1.1-linux-x86_64 --info
Client:
 Version   : 1.1.1
 System    : Linux
Server:
 Version   : 1.1.6
 Endpoint  : https://wave.seqera.io
vagrant@u

Are you using a Linux VM on macOS?

@muffato
Copy link
Author

muffato commented Dec 18, 2023

Thanks for taking a look, @pditommaso . That pushed me to dig deeper and I found this when doing strace

mmap(NULL, 34360786944, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = -1 ENOMEM (Cannot allocate memory)

Our head node doesn't allow "large" memory allocations.

When submitted to the farm, then it works

$ bsub -M1000 -R"select[mem>1000] rusage[mem=1000] span[hosts=1]" -n 1 -Is ./wave-1.1.1-linux-x86_64 --info
Job <104> is submitted to default queue <normal>.
<<Waiting for dispatch ...>>
<<Starting on tol22-os0000001>>
Client:
 Version   : 1.1.1
 System    : Linux
Server:
 Version   : 1.1.6
 Endpoint  : https://wave.seqera.io

@pditommaso
Copy link
Contributor

Interesting. Therefore it's a memory issue. It turns out that even when compiled to a native binary image, the usual logic for Java heap apply:

If no maximum Java heap size is specified, a native image that uses the Serial GC will set its maximum Java heap size to 80% of the physical memory size.

https://www.graalvm.org/latest/reference-manual/native-image/optimizations-and-performance/MemoryManagement/

But this does not make sense for a CLI app. We need to make a patch to constraint the max heap to a few hundred MBs

Tagging @munishchouhan for visibility

@pditommaso
Copy link
Contributor

This should have been addressed in version 1.1.2

@muffato
Copy link
Author

muffato commented Dec 20, 2023

Unfortunately it doesn't :( I'm still getting the error.
The difference is that strace is much shorter and there isn't the obvious mmap failure

$ strace ./wave-1.1.2-linux-x86_64 
execve("./wave-1.1.2-linux-x86_64", ["./wave-1.1.2-linux-x86_64"], 0x7ffdfd37e020 /* 65 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x3579998)      = 0
set_tid_address(0x3579ad0)              = 1757312
rt_sigprocmask(SIG_UNBLOCK, [RT_1 RT_2], NULL, 8) = 0
mmap(NULL, 143360, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa7efba7000
mprotect(0x7fa7efba9000, 135168, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [], 8) = 0
clone(child_stack=0x7fa7efbc9ef8, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tid=[1757313], tls=0x7fa7efbc9f38, child_tidptr=0x3579ad0) = 1757313
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
futex(0x7fa7efbc9f70, FUTEX_WAIT_PRIVATE, 2, NULLFatal error: Failed to create the main Isolate. (code 801)
) = ?
+++ exited with 33 +++

@pditommaso pditommaso reopened this Dec 20, 2023
@muffato
Copy link
Author

muffato commented Dec 20, 2023

Oki. I've run some more tests and the error only shows on certain machines. wave CLI seems to be working fine on other Ubuntu 22.04 machines here. Let me raise a ticket internally to see if there are known differences between those machines.

@pditommaso
Copy link
Contributor

Kernel version can be useful

@muffato
Copy link
Author

muffato commented Dec 20, 2023

Thanks for the pointer.

Failing on these machines

Linux tol-head2 4.15.0-216-generic #227-Ubuntu SMP Fri Aug 18 01:34:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Linux tol-head1 4.15.0-216-generic #227-Ubuntu SMP Fri Aug 18 01:34:03 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Linux tol22-head1 5.15.0-83-generic #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Linux tol22-head2 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Working on these

Linux farm5-head1 4.15.0-214-generic #225-Ubuntu SMP Thu Jul 13 09:22:47 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Linux farm5-head2 4.15.0-219-generic #230-Ubuntu SMP Thu Oct 5 20:25:05 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Linux farm22-head1 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Linux farm22-head2 5.15.0-79-generic #86-Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Linux pipelines-web 5.15.0-82-generic #91-Ubuntu SMP Mon Aug 14 14:14:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

No correlation with the kernel versions. But clear differences between clusters / VMs. There must be a difference in the way they've been set up

@pditommaso
Copy link
Contributor

There must be a difference in the way they've been set up

yeah, I'd ask to investigate a bit more on your side

@muffato
Copy link
Author

muffato commented Feb 23, 2024

Hello !

We made some progress on this. We found that the CLI needs at least 33,603,492 kbytes (~32 GB) of virtual memory to function, as demonstrated in the example below:

$ ulimit -a | grep virtual
virtual memory              (kbytes, -v) unlimited
$ ./wave-1.2.0-linux-x86_64 --version
1.2.0_b47f31f
$ (ulimit -v 33603492; ./wave-1.2.0-linux-x86_64 --version)
1.2.0_b47f31f
$ (ulimit -v 33603491; ./wave-1.2.0-linux-x86_64 --version)
Fatal error: Failed to create the main Isolate. (code 801)

even though the "Maximum resident set size (kbytes)" is reported by /usr/bin/time as 25,136 (24 MB).

Something in the CLI is requesting vast amounts of virtual memory but not using it.

@pditommaso
Copy link
Contributor

Thanks for reporting. We'll investigate this

@marcodelapierre marcodelapierre changed the title "Failed to create the main Isolate" error "Failed to create the main Isolate" error - Memory Req Issue Feb 28, 2024
@marcodelapierre marcodelapierre changed the title "Failed to create the main Isolate" error - Memory Req Issue "Failed to create the main Isolate" error - Large Memory Req Feb 28, 2024
@munishchouhan
Copy link
Member

@muffato thanks for the reproducer, I can reproduce this AWS EC2 free tier with slightly less memory:

~$ (ulimit -v 33603487; ./wave-1.2.0-linux-x86_64 --version)
Fatal error: Failed to create the main Isolate. (code 801)
~$ (ulimit -v 33603488; ./wave-1.2.0-linux-x86_64 --version)
1.2.0_b47f31f

I will investigate further and post the updates here

@munishchouhan
Copy link
Member

I have raised this issue with GraalVM native-image team also
oracle/graal#8476

@munishchouhan
Copy link
Member

@muffato I received a reply from native-image. This is a known problem when using serial GC. They have provided a couple of options. We will try them and let you know
oracle/graal#8476 (comment)

@munishchouhan
Copy link
Member

I have tried with g1gc and it is working

ubuntu@ip-172-31-39-213:~$ (ulimit -v 33603487; ./wave-1.2.0-linux-x86_64 --version)
Fatal error: Failed to create the main Isolate. (code 801)
ubuntu@ip-172-31-39-213:~$ (ulimit -v 33603487; ./wave --version)
1.2.0_60b9c7a

@muffato you can download the binary from the artifact section for this action https://github.com/seqeralabs/wave-cli/actions/runs/8570788276 and give it a try

@muffato
Copy link
Author

muffato commented Apr 5, 2024

It works 🎉 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants