Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node 22 version bump with ABI possibly causing severe timeout problem in buildbot #26078

Open
hnyman opened this issue Mar 2, 2025 · 21 comments

Comments

@hnyman
Copy link
Contributor

hnyman commented Mar 2, 2025

cc @robimarko @Ansuel @nxhack @ianchi

We have a frequent timeout/stall problem in the packages buildbot, which timeout is destroying quite many builds due to a hangup like failed 'make -j12 ...' (failure) (timed out.

 make[3] -C feeds/packages/admin/zabbix compile
command timed out: 3600 seconds without output running [b'make', b'-j12', b'IGNORE_ERRORS=n m y', b'BUILD_LOG=1', b'CONFIG_AUTOREMOVE=y', b'CONFIG_SIGNED_PACKAGES='], attempting to kill
process killed by signal 9
program finished with exit code -1

About 1/3 of the builds in the affected targets end in timeout. As it does not affect all builds of the same target architecture and 2/3 times the build succeeds, it is likely some kind of concurrency/race problem, so that the building order of the packages (or submodules) affects the features detected by a second package, causing a config prompt, or something like that.

That seems to have started approx 3 months ago.
The oldest failures that I have spotted are from around November 23, 2024

The failures happen on aarch64, arm, i386, x86
But not on armeb, arm_xscale, mips, powerpc, loongarch

It is hard to figure out what is happening, as the buildbot compile step logs available for casual users is just the launch of each package's compilation. And due to concurrent building, the packages are built in sligthly different order each time, so there is not direct diff possibility of the 4000+ line logs.

However, I did debugging by sorting the compile step output, and then comparing from the same target a recent ok build.
I noticed that from both analysed targets (x86_64, arm8vfpv3), the exact same package lines were missing from the timeouted build:

*** 1164,1183 ****
   make[3] -C feeds/packages/lang/node clean-build
   make[3] -C feeds/packages/lang/node compile
   make[3] -C feeds/packages/lang/node host-compile
-  make[3] -C feeds/packages/lang/node-arduino-firmata clean-build
-  make[3] -C feeds/packages/lang/node-arduino-firmata compile
-  make[3] -C feeds/packages/lang/node-cylon clean-build
-  make[3] -C feeds/packages/lang/node-cylon compile
-  make[3] -C feeds/packages/lang/node-hid clean-build
-  make[3] -C feeds/packages/lang/node-hid compile
-  make[3] -C feeds/packages/lang/node-homebridge clean-build
-  make[3] -C feeds/packages/lang/node-homebridge compile
-  make[3] -C feeds/packages/lang/node-javascript-obfuscator clean-build
-  make[3] -C feeds/packages/lang/node-javascript-obfuscator compile
-  make[3] -C feeds/packages/lang/node-serialport clean-build
-  make[3] -C feeds/packages/lang/node-serialport compile
-  make[3] -C feeds/packages/lang/node-serialport-bindings clean-build
-  make[3] -C feeds/packages/lang/node-serialport-bindings compile
   make[3] -C feeds/packages/lang/node-yarn host-compile
   make[3] -C feeds/packages/lang/perl clean-build
   make[3] -C feeds/packages/lang/perl compile

The node main compilation is started and also node-yarn host-compile gets started (as the first node module?). But then there is no trace that compiling other modules ever starts, until a timeout kills the whole buildbot build round.

So, my guess for the reason is #25435 : node: upgrade to 22.11.0 LTS on 23 Nov 2024 , which commit in addition to the major version bump, also added ABI versioning to node modules.

Node is restricted with DEPENDS:=@HAS_FPU @(i386||x86_64||arm||aarch64) to build on the affected targets, which increasingly points out to node being the reason for the major timeouts.

So for some reason, the node builds likely fails 1/3 of the times, but succeeds 2/3.
Curious.

Sorted logs:
sort x86 ok stdio.txt
sort x86 error stdio.txt

Original:
x86 ok stdio.txt
x86 error stdio.txt

@hnyman
Copy link
Contributor Author

hnyman commented Mar 2, 2025

@nxhack

Do you have any idea what might make node to occasionally react badly to extreme concurrency (-j 12 or 14) in building? Are some node modules dependent on each other without declaring that explicitly?

Is the new ABI versioning you implemented with that version bump really mandatory?

Unless a fix is figured rather soon, we might need to test my debugging results either by

  • removing the new ABI,
  • disable parallel builds for node, or
  • disabling whole node to test the hypothesis of its being the culprit for the timeouts. Maybe remove just some architectures, e.g. the arm ones, to see if that fix the buildbot runs for those targets.

@hnyman
Copy link
Contributor Author

hnyman commented Mar 2, 2025

Alternatively, we could disable node subpackages like node-yarn, which seems to be the one that gets built first (and maybe hangs). Looking at its Makefile, it has not been updated along the main node. We seem to be using really ancient yarn version 1.22. Quite possible that it is not in sync with the much newer main node.

https://github.com/yarnpkg/yarn#readme

This repository holds the sources for Yarn 1.x (latest version at the time of this writing being 1.22). New releases (at this time the 3.2.3, although we're currently working on our next major) are tracked on the yarnpkg/berry repository, this one here being mostly kept for historical purposes and the occasional hotfix we publish to make the migration from 1.x to later releases easier.

If you hit bugs or issues with Yarn 1.x, we strongly suggest you migrate to the latest release

@robimarko
Copy link
Contributor

@hnyman I tried building node-yarn multiple times with 32 threads locally in the snapshot SDK but I cannot get it to fail

@nxhack
Copy link
Contributor

nxhack commented Mar 2, 2025

@hnyman

Do you have any idea what might make node to occasionally react badly to extreme concurrency (-j 12 or 14) in building? Are some node modules dependent on each other without declaring that explicitly?

In my experience, there is no problem in building node.js itself, but I am aware of an extreme increase in npm cli threads when building node packages.

@nxhack
Copy link
Contributor

nxhack commented Mar 3, 2025

I also tried testing it on -j32, and it built without any problems. (4 cores, VT-x 8 threads, 32GB memory)

When building node packages, the number of threads increases to over 100, but it built without any problems.

@nxhack
Copy link
Contributor

nxhack commented Mar 3, 2025

@hnyman
Would you be able to test this?

diff --git a/lang/node-arduino-firmata/Makefile b/lang/node-arduino-firmata/Makefile
index 90c1c5b34..6c0e94eb0 100644
--- a/lang/node-arduino-firmata/Makefile
+++ b/lang/node-arduino-firmata/Makefile
@@ -18,6 +18,7 @@ PKG_HASH:=d7157e02867eae82887cb5e17b90c963fe7489bacd464110bfd20c672b8d5a98
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT
diff --git a/lang/node-cylon/Makefile b/lang/node-cylon/Makefile
index 28b3c635b..3bb1c16d0 100644
--- a/lang/node-cylon/Makefile
+++ b/lang/node-cylon/Makefile
@@ -20,6 +20,7 @@ PKG_SOURCE_SUBDIR:=$(PKG_SRC_NAME)-$(PKG_VERSION)
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=Apache-2.0
diff --git a/lang/node-hid/Makefile b/lang/node-hid/Makefile
index 575f9d579..0437fb63d 100644
--- a/lang/node-hid/Makefile
+++ b/lang/node-hid/Makefile
@@ -18,6 +18,7 @@ PKG_HASH:=6c1f05935215feed4e8d2f4aecf31abbad8fa783d252b0bd6041ed2f2e96e9ba
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT or X11
diff --git a/lang/node-homebridge/Makefile b/lang/node-homebridge/Makefile
index 7c6d124bc..d638a2fdc 100644
--- a/lang/node-homebridge/Makefile
+++ b/lang/node-homebridge/Makefile
@@ -15,6 +15,7 @@ PKG_HASH:=f91ab0058707a0498d97d87f45f19682065f80660fac942e0985caf9bb205f2a
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=ISC Apache-2.0
diff --git a/lang/node-javascript-obfuscator/Makefile b/lang/node-javascript-obfuscator/Makefile
index 281656331..fc2b3c5f4 100644
--- a/lang/node-javascript-obfuscator/Makefile
+++ b/lang/node-javascript-obfuscator/Makefile
@@ -14,10 +14,10 @@ PKG_SOURCE_URL:=https://registry.npmjs.org/$(PKG_NPM_NAME)/-/
 PKG_HASH:=9bc89b04c78277130bc6f699563871d211f6fc85803c874f6114a632d9456f7b
 
 PKG_BUILD_DEPENDS:=node/host
-HOST_BUILD_PARALLEL:=1
+HOST_BUILD_PARALLEL:=0
 
 HOST_BUILD_DEPENDS:=node/host
-PKG_BUILD_PARALLEL:=1
+PKG_BUILD_PARALLEL:=0
 PKG_BUILD_FLAGS:=no-mips16
 
 PKG_MAINTAINER:=Zbynek Kocur <[email protected]>
diff --git a/lang/node-serialport-bindings/Makefile b/lang/node-serialport-bindings/Makefile
index e6352781f..d0daa9b39 100644
--- a/lang/node-serialport-bindings/Makefile
+++ b/lang/node-serialport-bindings/Makefile
@@ -16,6 +16,7 @@ PKG_HASH:=aec200860bd175e4b14b4ab1aa56a5f750172b6c8e20ccb234846206395848d4
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT
diff --git a/lang/node-serialport/Makefile b/lang/node-serialport/Makefile
index 336d4b2e7..4c0f4af02 100644
--- a/lang/node-serialport/Makefile
+++ b/lang/node-serialport/Makefile
@@ -18,6 +18,7 @@ PKG_HASH:=e19fe993ad16ae0e03fc42e24cfe4babf8fd90f8358e1885d5e216277dda1086
 
 PKG_BUILD_DEPENDS:=node/host
 PKG_BUILD_FLAGS:=no-mips16
+PKG_BUILD_PARALLEL:=0
 
 PKG_MAINTAINER:=Hirokazu MORIKAWA <[email protected]>
 PKG_LICENSE:=MIT
diff --git a/lang/node-yarn/Makefile b/lang/node-yarn/Makefile
index 47c7112f2..b5527189f 100644
--- a/lang/node-yarn/Makefile
+++ b/lang/node-yarn/Makefile
@@ -19,7 +19,7 @@ PKG_LICENSE_FILES:=LICENSE
 
 PKG_HOST_ONLY:=1
 HOST_BUILD_DEPENDS:=node/host
-HOST_BUILD_PARALLEL:=1
+HOST_BUILD_PARALLEL:=0
 
 include $(INCLUDE_DIR)/host-build.mk
 include $(INCLUDE_DIR)/package.mk

hnyman added a commit to hnyman/packages that referenced this issue Mar 3, 2025
Disable parallel builds for node downstream packages, as the
buildbot is showing frequent timeout problems
for aarch644, arm, i386 and x86, and node & node packages
are the primary suspect.

Based on discussion in
openwrt#26078

Signed-off-by: Hannu Nyman <[email protected]>
@hnyman
Copy link
Contributor Author

hnyman commented Mar 3, 2025

Thanks. I applied that to the master branch to test if that is enough to fix things.
Next few days will show that. If there are still timeouts in the next few days (so that this is not enough), the next test step might be to temporarily disable node for aarch64 and i386 to both prove if node really is the culprit.

Ps.
Note that the node itself still has parallel build enabled.
I wonder how heavy and long the node build itself is? Maybe there is just a genuine timeout if it is among the last packages to be compiled and the compilation takes over an hour.

@robimarko
Copy link
Contributor

And it timed out again on couple of archs

@hnyman
Copy link
Contributor Author

hnyman commented Mar 5, 2025

I think that we should temporarily mark the node package itself as BROKEN just to verify that it really is the reason for the frequent hangups.

@robimarko
Copy link
Contributor

Sounds fine to me

@ynezz
Copy link
Member

ynezz commented Mar 5, 2025

@hnyman thanks a lot for looking into this!

Maybe there is just a genuine timeout if it is among the last packages to be compiled and the compilation takes over an hour.

I just bumped it from 1 to 2 hours, lets see.

@hnyman
Copy link
Contributor Author

hnyman commented Mar 5, 2025

I marked node BROKEN an hour ago, so let's see if that takes care of the timeouts.

Having the timeout period lengthened to two hours might help in case the node package really is that hard to compile. But then the question raised is if it is wise to spend that much resources for the probably really rarely used package. Node.js is not something typically installed into an OpenWrt home router, I think.

@ynezz
Copy link
Member

ynezz commented Mar 6, 2025

When building node packages, the number of threads increases to over 100, but it built without any problems.

Maybe there is some issue, where node's build system doesn't honor the build concurrency constraints? Or there is some deadlock/race somewhere, being exhibited only on build systems with lower I/O throughput? If it was changed in that update, maybe the diff between those two versions could show the culprit? How was the previous node version (one which built fine on buildbots) behaving?

But then the question raised is if it is wise to spend that much resources for the probably really rarely used package.

Indeed, but t seems to be actively maintained, so there are users.

Node.js is not something typically installed into an OpenWrt home router, I think.

We could say this about a lot of other packages as well :)

@hnyman
Copy link
Contributor Author

hnyman commented Mar 6, 2025

I suspect that heavy memory requirements of node compilation might be one reason. Maybe the compilation worker process just crashes?

I am using Ubuntu 24.10 in VirtualBox 7.1 running on top of Windows 11 in a branch system with new Intel core ultra 7 265 processor, pure SSDs, 16 GB RAM etc. system. I allocate 10 CPU cores and 7.1 GB RAM for Ubuntu/Virtualbox, and that is quite enough to compile my whole OpenWrt build in about 20 minutes without any resource constraint trouble.

I have now twice managed to crash the node compilation in Ubuntu, when trying to compile only the node package with "-j 9". On the latest time I tried to monitor the memory consumption, and after some 6 minutes of node compilation -j 9 (no other load), the memory consumption started to up go rapidly until the compilation shell crashed and closed.
Monitoring from another window showed that rapid increasing memory consumption:

perus@ub2410:~$ uptime ; free
 19:19:38 up  1:10,  2 users,  load average: 6.96, 4.60, 2.52
               total        used        free      shared  buff/cache   available
Mem:         6565508     4193416     1819536       14744      866760     2372092
Swap:        4194300      763456     3430844
perus@ub2410:~$ uptime ; free
 19:19:59 up  1:10,  2 users,  load average: 7.11, 4.79, 2.62
               total        used        free      shared  buff/cache   available
Mem:         6565508     5157920      742588       14748      979208     1407588
Swap:        4194300      762688     3431612
perus@ub2410:~$ uptime ; free
 19:20:35 up  1:11,  2 users,  load average: 8.01, 5.24, 2.86
               total        used        free      shared  buff/cache   available
Mem:         6565508     6210252      224088       13768      536848      355256
Swap:        4194300     1474904     2719396

And then the shell crashed, and dmesg shows:

[ 4275.949490] systemd-journald[366]: Under memory pressure, flushing caches.
[ 4304.341047] systemd-journald[366]: Under memory pressure, flushing caches.

Not quite sure how to debug that, or how to prevent that.

But compared to compiling OpenWrt itself (the full Linux etc.) this node seems much heavier. I have never had any problems with memory resources in connection of OpenWrt compilation. First time for me.

EDIT:
Just for reference: during normal "-j 10" compilation of my OpenWrt build with add-on packages, the memory usage mainly hovered between 1.9 - 2.1 GB, while peaking above 3 GB. Compared to that, the observed over 6 GB usage by just node compilation is quite much.

@ynezz
Copy link
Member

ynezz commented Mar 6, 2025

I suspect that heavy memory requirements of node compilation might be one reason. Maybe the compilation worker process just crashes?

Not sure, for example https://buildbot.openwrt.org/main/packages/#/builders/1/builds/97 is on quite beefy build host, sharing E5-2680 based 56 VCPUs/threads and 440GiB of RAM between 4 build workers, having following assignment:

  • ffffm-dock-01 - images/snapshot
  • ffffm-dock-02 - packages/snapshot
  • ffffm-dock-03 - packages/snapshot
  • ffffm-dock-04 - packages/openwrt-24.10

@robimarko
Copy link
Contributor

I mean, I tried compiling Node with 32 threads and I have 64G of RAM and have not really felt like I am close to OOM

@hnyman
Copy link
Contributor Author

hnyman commented Mar 7, 2025

This might be related:

nodejs/node#45949

If I build with make -j 16 the compile stage takes about 30 min. But during link 5 executables are being linked simultaneously, each taking 4GB RAM. Disk thrashing starts and the link phase takes an extreme long time to complete. In the mean while keyboard/mouse stop responding.

and one solution proposal to enable parallel processing for compilation, but disabling it for the apparently problematic linking phase:
nodejs/node#45949 (comment)

EDIT:
Other related discussion:
nodejs/node#43370 (comment)

@robimarko
Copy link
Contributor

Well, now with node disabled none of the targets timed out

@nxhack
Copy link
Contributor

nxhack commented Mar 9, 2025

I understand that the node.js v8 engine is a huge system, and that it would disrupt the openwrt ecosystem.
At the very least, I will create a package that installs only the pre-built binaries for node/host, in order to rescue the packages that use node-yarn/host. I will withdraw all node packages except node-yarn.
Just to confirm, are all the host builds for buildbot linux x86_64?

@ynezz
Copy link
Member

ynezz commented Mar 9, 2025

Just to confirm, are all the host builds for buildbot linux x86_64?

Yes

I will withdraw all node packages except node-yarn.

Ok. Something to consider, the other option is to make it build as previous node.js version, thus not causing havoc on the builders. IMO nodejs/node#45949 (comment) looks like a root cause/workaround, probably worth exploring if we want to keep all node.js packages.

@nxhack
Copy link
Contributor

nxhack commented Mar 9, 2025

Ok. Something to consider, the other option is to make it build as previous node.js version, thus not causing havoc on the builders.

Because node.js is a large system, it is frequently updated, and the latest version is the stable version.

IMO nodejs/node#45949 (comment) looks like a root cause/workaround,

I haven't tested it, but I think the build time will be extremely slow.

probably worth exploring if we want to keep all node.js packages.

I have a custom repository that is small in scale, but I am ready to distribute pre-built packages.

There is certainly a need for OpenWrt and node.js in the area of Home IoT. However, even in my environment, building consumes resources.

nxhack added a commit to nxhack/packages that referenced this issue Mar 9, 2025
openwrt#26078

As a result of the discussion in this thread, the node.js package was changed to hostpkg only.
In addition, this fix uses the pre-built version distributed on nodejs.
This process was suggested by @artynet.

The packages in the node module are successfully built, but the target node.js itself cannot be provided, so it cannot be used.

Yarn, which is used in packages for web front ends, etc., can be used without any problems.

Signed-off-by: Hirokazu MORIKAWA <[email protected]>
nxhack added a commit to nxhack/packages that referenced this issue Mar 10, 2025
openwrt#26078
As a result of the discussion in this thread, the node.js package was changed to hostpkg only.
In addition, this fix uses the pre-built version distributed on nodejs.
The use of pre-build is based on the suggestion of @artynet.

The packages in the node module are successfully built, but the target node.js itself cannot be provided, so it cannot be used.

Yarn, which is used in packages for web front ends, etc., can be used without any problems.

Signed-off-by: Hirokazu MORIKAWA <[email protected]>
nxhack added a commit to nxhack/packages that referenced this issue Mar 11, 2025
openwrt#26078
As a result of the discussion in this thread, the node.js package was changed to hostpkg only.
In addition, this fix uses the pre-built version distributed on nodejs.
The use of pre-build is based on the suggestion of @artynet.

The packages in the node module are successfully built, but the target node.js itself cannot be provided, so it cannot be used.

Yarn, which is used in packages for web front ends, etc., can be used without any problems.

Support for host builds other than linux x86_64.

Signed-off-by: Hirokazu MORIKAWA <[email protected]>
nxhack added a commit to nxhack/packages that referenced this issue Mar 11, 2025
openwrt#26078
As a result of the discussion in this thread, the node.js package was changed to hostpkg only.
In addition, this fix uses the pre-built version distributed on nodejs.
The use of pre-build is based on the suggestion of @artynet.

The packages in the node module are successfully built, but the target node.js itself cannot be provided, so it cannot be used.

Yarn, which is used in packages for web front ends, etc., can be used without any problems.

Support for host builds other than linux x86_64.

Signed-off-by: Hirokazu MORIKAWA <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants