Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloop server sometimes becomes unresponsive #2594

Open
harpocrates opened this issue Jan 31, 2025 · 20 comments
Open

Bloop server sometimes becomes unresponsive #2594

harpocrates opened this issue Jan 31, 2025 · 20 comments

Comments

@harpocrates
Copy link

Sorry for the very generic title - I'm just a bit at a loss to figure out what information would be most helpful to collect.

The failure

The failure mode is weird: the Bloop server becomes very slow to respond to requests, and at some point seems to not respond at all

  • Metals eventually catches on (warns you that your compile has been taking a really long time), and marks the Bloop icon in the corner as not responding anymore
  • My custom Bloop clients also just stop getting responses
  • Even just doing bloop projects from the CLI becomes super slow or hangs indefinitely!

My setup that runs into this...

Some maybe useful context:

  • There is often more than one project open
  • There are usually several Bloop clients per project
    • One of those is Metals
    • Others are coming from some custom code. They:
      • list out build targets
      • trigger compilation
      • sometimes hotload some of those compiled outputs
      • these are periodically restarted and probably sometimes leave the Bloop connection without proper shutdown. That's something I plan on fixing, but I would expect Bloop as a whole to be resistant to this sort of thing...
    • I would expect these clients to probably be requesting compilation around the same time (they all indirectly are triggering on file system events)
  • I am programmatically changing Bloop JSON configs. For some extra sanity check, I also restart metals at the same time.
  • Issues start only after there are a number of connections (or re-connections) that have occurred
  • This all runs without internet access, though I'm doubtful this is the issue (if I stick to a single project and don't run any custom stuff, Bloop works just fine!)
  • Anecdotally, I only noticed this when bumping to the latest Bloop version 2.0.8 and Metals 1.5.0 beginning of last week

Some logs

(Anything else worth collecting here?)

Metals logs when I try to reconnect to Bloop as prompted
2025.01.31 09:31:49 ERROR Failed to connect with build server, no functionality will work.
java.lang.RuntimeException: Timeout (10 seconds) while waiting for BSP socket to be created in /home/alec/.local/share/scalacli/bloop/bsp/8637-1
	at scala.sys.package$.error(package.scala:27)
	at bloop.rifle.internal.Operations$$anon$1.openSocket(Operations.scala:422)
	at scala.meta.internal.metals.BloopServers.create$1(BloopServers.scala:384)
	at scala.meta.internal.metals.BloopServers.openConnection$1(BloopServers.scala:400)
	at scala.meta.internal.metals.BloopServers.$anonfun$connect$5(BloopServers.scala:413)
	at scala.concurrent.Future$.$anonfun$apply$1(Future.scala:687)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:467)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.lang.Thread.run(Thread.java:840)
Bloop daemon logs
Bloop server PID: 2638
Ignoring SIGINT
Will truncate output file /home/alec/.local/share/scalacli/bloop/daemon/output every 5 minutes
Unable to load nailgun-version.properties.
NGServer [UNKNOWN] started on local socket /home/alec/.local/share/scalacli/bloop/daemon/socket.
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/.local/share/scalacli/bloop
/home/alec/.local/share/scalacli/bloop
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/.local/share/scalacli/bloop
/home/alec/.local/share/scalacli/bloop
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/.local/share/scalacli/bloop
/home/alec/.local/share/scalacli/bloop
/home/alec/projects/MyProject1
/home/alec/projects/MyProject1
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/.local/share/scalacli/bloop
/home/alec/.local/share/scalacli/bloop
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
09:15:17.873 [NGSession 10: (idle) (NGCommunicator orchestrator)] ERROR c.martiansoftware.nailgun.NGServer - Nailgun client read future raised an exception
java.net.SocketException: Connection reset
	at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
	at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
	at libdaemonjvm.Util$$anon$1$$anon$2.read(Util.scala:19)
	at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
	at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:107)
	at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:101)
	at java.base/java.io.InputStream.read(InputStream.java:218)
	at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
	at java.base/java.io.DataInputStream.readInt(DataInputStream.java:381)
	at com.martiansoftware.nailgun.NGCommunicator.readChunkImpl(NGCommunicator.java:497)
	at com.martiansoftware.nailgun.NGCommunicator.readChunk(NGCommunicator.java:480)
	at com.martiansoftware.nailgun.NGCommunicator.lambda$null$0(NGCommunicator.java:209)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:840)
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/.local/share/scalacli/bloop
/home/alec/.local/share/scalacli/bloop
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/projects/MyProject2
/home/alec/.local/share/scalacli/bloop
/home/alec/.local/share/scalacli/bloop
@tgodzik
Copy link
Contributor

tgodzik commented Jan 31, 2025

That looks like maybe we have a memory leak? We're you maybe able to check the Bloop JVM with visualvm or similar?

The behaviour is usually connected to having a lot of GC run and using most of the JVM processing time

@tgodzik
Copy link
Contributor

tgodzik commented Feb 3, 2025

sometimes hotload some of those compiled outputs

Maybe this is causing the memory leak, we might have never optimized it properly.

Anecdotally, I only noticed this when bumping to the latest Bloop version 2.0.8 and Metals 1.5.0 beginning of last week

I don't see any changes that could cause it in Bloop. Is it possible that copying resources to output dirs could have caused it? 🤔

@harpocrates
Copy link
Author

I'm continuing to poke at this. Just now I managed to repro again.

Attaching something like visualvm is a bit annoying because of the environment in which this is running, but I was able to get a JFR profile with

BLOOP_JAVA_OPTS='-XX:+FlightRecorder -XX:StartFlightRecording=filename=bloop.jfr' bloop start

Looking into that, memory use does not seem problematic - seems actually pretty low.

However, last time I managed to repro this, I got a bit more info: I noticed all interactions with the bloop server started hanging around 8:55. Metals got stuck importing a build, at the CLI bloop projects was hanging, and I presume my other custom Bloop clients were also hanging. Around 9:00, everything started working again - the call to bloop projects that had been hanging completed and reloading the VS code window Metals came back in a good state. Checking /home/alec/.local/share/scalacli/bloop/daemon/output, I see the following:

...
/home/alec/lightspeed/MyProject
/home/alec/lightspeed/MyProject
/home/alec/lightspeed/MyProject
/home/alec/lightspeed/MyProject
/home/alec/.local/share/scalacli/bloop
/home/alec/.local/share/scalacli/bloop
08:59:18.190 [NGSession 1: (idle) (NGCommunicator orchestrator)] ERROR c.martiansoftware.nailgun.NGServer - Nailgun client read future raised an exception
java.net.SocketException: Connection reset
        at java.base/sun.nio.ch.SocketChannelImpl.throwConnectionReset(SocketChannelImpl.java:394)
        at java.base/sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:426)
        at libdaemonjvm.Util$$anon$1$$anon$2.read(Util.scala:19)
        at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65)
        at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:107)
        at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:101)
        at java.base/java.io.InputStream.read(InputStream.java:218)
        at java.base/sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:81)
        at java.base/java.io.DataInputStream.readInt(DataInputStream.java:381)
        at com.martiansoftware.nailgun.NGCommunicator.readChunkImpl(NGCommunicator.java:497)
        at com.martiansoftware.nailgun.NGCommunicator.readChunk(NGCommunicator.java:480)
        at com.martiansoftware.nailgun.NGCommunicator.lambda$null$0(NGCommunicator.java:209)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)
/home/alec/lightspeed/MyProject
/home/alec/lightspeed/MyProject
/home/alec/lightspeed/MyProject
/home/alec/lightspeed/MyProject
/home/alec/lightspeed/MyProject

That error lines up with the time at which bloop was not responding to requests.

Is there a way to increase the verbosity of this log file? (Is there anywhere else I can get better logs?)

@tgodzik
Copy link
Contributor

tgodzik commented Feb 6, 2025

So it looks like the server connection somehow stopped working and after the reset everything went back to normal 🤔 I wonder why it took so long for that exception to happen.

Were you able to get any stack traces of Bloop during that time? Might be possible to see some problematic threads.

@voidcontext
Copy link

I think I ran into the same issue... Bloop compile very slow (both from Metals and from cli) and the process' memory usage goes up to 12GB (I guess that's all it can get, it's a 16GB machine). Eventually, when it finishes, memory usage goes down to 6GB. Clean + compile doesn't solve the issue. There are some compilation errors though (ambiguous implicits).

Bloop v2.0.8
Metals v1.5.1

Output log is:

Bloop server PID: 96801
Ignoring SIGINT
Will truncate output file /Users/gaborpihaj/Library/Caches/ScalaCli/bloop/daemon/output every 5 minutes
Unable to load nailgun-version.properties.
NGServer [UNKNOWN] started on local socket /Users/gaborpihaj/Library/Caches/ScalaCli/bloop/daemon/socket.
/Users/gaborpihaj/Library/Caches/ScalaCli/bloop
/Users/gaborpihaj/Library/Caches/ScalaCli/bloop
/Users/gaborpihaj/workspace/work/repos/reponame
/Users/gaborpihaj/workspace/work/repos/reponame
/Users/gaborpihaj/workspace/work/repos/reponame
/Users/gaborpihaj/workspace/work/repos/reponame
/Users/gaborpihaj/workspace/work/repos/reponame
/Users/gaborpihaj/workspace/work/repos/reponame
/Users/gaborpihaj/workspace/work/repos/reponame
/Users/gaborpihaj/workspace/work/repos/reponame

@voidcontext
Copy link

voidcontext commented Feb 12, 2025

Hm, I turned off best effort compilation in metals, killed all java processes, and now it seems things are much quicker, and memory usage went back to normal after the compilation finished. I'll keep monitoring the situation.

@tgodzik
Copy link
Contributor

tgodzik commented Feb 12, 2025

@voidcontext let us know if the issue comes up again. Maybe there is some memory leak for best effort 🤔

@voidcontext
Copy link

@voidcontext let us know if the issue comes up again. Maybe there is some memory leak for best effort 🤔

@tgodzik I haven't seen this issue since best effort compilation has been turned off. It seems I forgot to mention, I've seen this issue in Scala 3 projects (I don't work on any Scala 2 projects at the moment, so that might be affected too).

@harpocrates
Copy link
Author

Just flagging I'm still hitting this constantly. I've taken to programmatically detecting the bad state by polling a call to bloop projects and if it doesn't exit in 5 seconds, I assume bloop is hung.

I'm only seeing this with Scala 2.12.17 at the moment.

@tgodzik
Copy link
Contributor

tgodzik commented Feb 25, 2025

Did you manage to get stack traces of Bloop when the hang occurs? I wasn't able to reproduce the issue.

@harpocrates
Copy link
Author

@tgodzik Yeah, here's a jstack dump. It is very large - not sure if you have a way to cut to the interesting stuff.

  • I noticed the hanging behaviour, confirmed by bloop projects also hanging
  • did jps to find the Bloop server and then run jstack $PID
  • I pulled the jstack output from the daemon logs

jstack_dump.txt

@tgodzik
Copy link
Contributor

tgodzik commented Feb 26, 2025

One thing I noticed is that there is 18 CLI commands waiting, which is unexpected since it's rarely more than one. Is it possible that they are not cancelled when you cancel the command line request? I also have a couple on my local instance, but that's only 4 after a couple of days being run 🤔

There is also 12 processes waiting. on Forker.scala, which is also unexpected because that' usually tests or main classes being run this way. So that would make it 12 processes not finishing. Any idea if there are any child processes there?

@harpocrates
Copy link
Author

Is there a way to list out those active requests? The only CLI invocations that should be made are:

  • bloop projects (my poor man's healthcheck - runs every 5 seconds)
  • bloop bsp (runs once per connection)

The processes are themselves probably mostly my Scala programs that themselves set up bloop clients, but the programs themselves are spawned with bloop run. Is using bloop run this way for long running process a bad idea?

@harpocrates
Copy link
Author

Here's another example with less processes running. Taken at the same time, here's the JPS output:

35523 Jps
34469 MyLanguageServer
2666 BloopServer
2874 MyLanguageServer
2875 MyLanguageServer
34203 Main
34411 MyLanguageServer

Here, I expect all of MyLanguageServer to have a BSP client to BloopServer. All would be started with bloop run (on some bloop config that has no sources, just classpath and jvm options). Main is metals.

jstack_dump2.txt

@tgodzik
Copy link
Contributor

tgodzik commented Feb 27, 2025

The processes are themselves probably mostly my Scala programs that themselves set up bloop clients, but the programs themselves are spawned with bloop run. Is using bloop run this way for long running process a bad idea?

Not entirely sure, but it seems there is a lot of IO activity forwarding outputs from those programs. Might be worth eliminating running them and see if that is responsible for the freezes

Is there a way to list out those active requests? The only CLI invocations that should be made are:
bloop projects (my poor man's healthcheck - runs every 5 seconds)

That would explain the additional entries, they will most likely hang as well when there is an issue. Does the issue happen if that health check is not being run? Metals already sends requests to make sure Bloop is running and will change the status in case of any problems.

Both of the above could only be symptoms though, but I can't find anything else going on aside from the bsp connection.

@harpocrates
Copy link
Author

harpocrates commented Feb 27, 2025

Not entirely sure, but it seems there is a lot of IO activity forwarding outputs from those programs. Might be worth eliminating running them and see if that is responsible for the freezes

I think that does avoid the issue. However, it is quite annoying in that now I need to recreate a potentially pretty long CLI to java.

Given that plus your comment about the IO activity: is there a good way to get Bloop to just dump out the invocation it would make instead of actually running it? Then, I could just run it directly.

Does the issue happen if that health check is not being run?

Yes, it does.

@tgodzik
Copy link
Contributor

tgodzik commented Feb 28, 2025

I think that does avoid the issue. However, it is quite annoying in that now I need to recreate a potentially pretty long CLI to java.

Ok, I wonder whether we can fix that or the only option is not to run multiple apps. I will try and figure it out.

Metals actually does run plain Java whenever only the run option is selected. We generate classpath jars in .metals directory to avoid issue with too long commands. I don't think it's easy to do from command line.

@harpocrates
Copy link
Author

I ended up generating manifest JARs (much like I think Metals does) by first creating a short-lived Bloop client which just makes a couple queries to determine the runtime classpath and JVM options then crafts the manifest JAR and command line for running Java. Then, that MyLanguageServer gets spawned directly instead of going via bloop run.

Now, I no longer get hanging behaviour.


Prior to this, I also confirmed that the hang occurred in the bloop run-ed MyLanguageServer while it was waiting to read input from bloop bsp (it waits until it sees the message about being ready to connect).


Fingers crossed, this will put to bed this issue. If I don't run into issues in the next two weeks, I'll come back to close this (or feel free to close it for me). I do think there's something buggy somewhere (probably in bloop run, since it also seems to be one of the less widely used parts of bloop), but I'm unlikely to have time to chase further here.

@tgodzik
Copy link
Contributor

tgodzik commented Mar 5, 2025

I do think there's something buggy somewhere (probably in bloop run, since it also seems to be one of the less widely used parts of bloop), but I'm unlikely to have time to chase further here.

Looks that way, but at least this should be in a way reproducible and we have a stack trace. I wonder whether is maybe some limit of thread number in some place?

@tgodzik
Copy link
Contributor

tgodzik commented Mar 5, 2025

It doesn't seem to be bounded, so it should not cause any issues as far as I can see https://github.com/scalacenter/bloop/blob/main/frontend/src/main/scala/bloop/engine/ExecutionContext.scala#L50

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants