-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: context deadline exceeded using Go driver on bblfshd #209
Comments
Did not get to reproduce this yet, but it would be interesting to see the output of the |
@erizocosmico could you give to @bzz the requested info please? |
Sure @ajnavarro |
This was just after it started happening. If you wait a little bit longer, not even those commands are responsive. After this happens, not even /cc @bzz |
@erizocosmico Could you do |
Sure @smola |
@smola |
I can reproduce this by sending N (= NumCPU) concurrent requests for huge files which will take forever to parse, all subsequent requests will fail with context deadline exceeded. It might be happening to gitbase that huge files accumulate in parse requests and end up clogging all driver instances. The unexpected timeouts were supposed to be fixed here: #227 I think the other part would be to limit the file size we send from gitbase to bblfsh. That is, any file above X size immediately returns NULL UAST. The limit would be configurable. This is something I already discussed with @ajnavarro, not sure if it's done.
EDIT: I'm using bblfshd 2.11.0 which was supposed to fix the timeouts problem, so either the issue is not fixed or this is a different issue. By the way, for quick reproduction, you can use this code (warning: it will staturate all your CPUs!): package main
import (
"context"
"fmt"
"io/ioutil"
"net/http"
"runtime"
"sync"
"sync/atomic"
bblfsh "gopkg.in/bblfsh/client-go.v3"
)
func main() {
url := `https://raw.githubusercontent.com/src-d/enry/master/data/frequencies.go`
resp, err := http.DefaultClient.Get(url)
if err != nil {
panic(err)
}
bs, err := ioutil.ReadAll(resp.Body)
if err != nil {
panic(err)
}
code := string(bs)
client, err := bblfsh.NewClient("0.0.0.0:9432")
if err != nil {
panic(err)
}
wg := sync.WaitGroup{}
var count int32
for i := 0; i < runtime.NumCPU()*2; i++ {
wg.Add(1)
go func() {
for {
_, _, err = client.
NewParseRequest().
Context(context.Background()).
Language("go").
Content(code).
UAST()
if err != nil {
fmt.Println(err)
}
newCount := atomic.AddInt32(&count, 1)
if newCount%100 == 0 {
fmt.Printf("Requests: %d\n", newCount)
}
}
wg.Done()
}()
}
wg.Wait()
} |
CPU saturation can cause structural problems for gRPC even if there are no explicit deadlines. If the machine is so CPU starved that it can't keep up with channel maintenance (e.g., health checks) the channel can be closed. I haven't tried it in this case—it's very difficult to arrange in isolation—but this is why it's important not to let the client demand unbounded work, but to shed load, proxy, or queue. The fact that it happens under heavy CPU demand suggests this is something we should be thinking about in the design of our servers. |
Actually, if CPU is saturated, we may want to setup bblfshd with less driver instances in the pool. Other than that, I'd see as reasonable that if bblfshd cannot accept more work (e.g. a fixed-size request queue is full), the request returns an error right away and that error is easily identifiable by the client. So they can decide whether to retry with backoff, reduce parallelism, etc. I recommended the gitbase team to limit file size on their side for UAST parsing. This can mitigate some problems in a predictable way for the user. The user can control this limit and then they know that It may also be the case that some files, such as minified/obfuscated code are too expensive to process even within the size limits. I think it would be ideal if bblfshd errors in this situation were deterministic, not based on time. For example, if we consider that processing an UAST with more than 10k nodes is too much, then once we processed 10k nodes we could stop and return an error. My point is that for applications like gitbase it's important that results are deterministic. If, given a certain setup, bblfshd always returns an error for a given file, that's ok. If it depends on the load, it's a mess, the only way for gitbase to provide determinism would be to just fail the full query whenever there is is a timeout after a few retries, and timeouts are happening often enough for that not to be usable. |
Agreed. I think there are two pieces we need: One is for the server to push back when there is too much work, so that the client can tell that it's requesting too much work and back off; the other is for the client to make use of that signal. For the short term we can probably just work around by forcing the clients to rate-limit themselves (e.g., to some fixed QPS). Making the server behaviour more deterministic will be an important OKR for LA team in Q1, and then we should be able to do something less hacky.
Yes, limiting file size is probably a good start. Generated or transpiled code often has weird properties. But I think we should also consider how to make the API more forgiving. As far as I know this limit is not that Babelfish runs out of RAM, but that gRPC has some artificial limits on message size in the channel. I think the "right" solution is to make an API that allows chunked requests/responses, so that large requests can be passed without having to worry about gRPC. However that will take time, so I think for now @smola's suggestion is the right practical step.
I agree completely. I'd rather have results that are deterministic and wrong than non-deterministic. It's hard to program clients when you have to port backoff and timing heuristics from one to another. I feel like we should iron out these issues ASAP. |
We're using bblfshd extensively to process files on gitbase, which is a multi-threaded process. After a while processing, bblfshd becomes unresponsive and starts returning "Context deadline exceeded errors" after that and it does not recover from there. Sometimes, it even makes docker itself hang.
Using a single core does not fully mitigate the problems but it does recover after a few errors and continues.
How to reproduce this:
Run the following script:
The text was updated successfully, but these errors were encountered: