perf: context deadline exceeded using Go driver on bblfshd #209

erizocosmico · 2018-11-08T13:41:59Z

We're using bblfshd extensively to process files on gitbase, which is a multi-threaded process. After a while processing, bblfshd becomes unresponsive and starts returning "Context deadline exceeded errors" after that and it does not recover from there. Sometimes, it even makes docker itself hang.

time="2018-11-08T13:12:44Z" level=error msg="request processed content 2631 bytes error: rpc error: code = Canceled desc = context canceled" elapsed=15.091048125s language=go

Using a single core does not fully mitigate the problems but it does recover after a few errors and continues.

How to reproduce this:

Make a folder with 2 repositories: gitbase and gitbase-spark-connector
Run a bblfshd instance with the go driver installed at port 9432
Run a gitbase instance

gitbase server -v --host=127.0.0.1 --port=3306 -d path/to/repos --index=indexes

Run the following script:

package main

import (
	"database/sql"

	_ "github.com/go-sql-driver/mysql"
)

const connectionString = "root:@tcp(127.0.0.1:3306)/db"

const query = `
SELECT f.file_path, uast(f.blob_content, language(f.file_path))
FROM ref_commits r
INNER JOIN commit_files c ON r.commit_hash = c.commit_hash
AND r.repository_id = c.repository_id
INNER JOIN files f ON c.file_path = f.file_path AND c.tree_hash = f.tree_hash
AND f.blob_hash = c.blob_hash
AND f.repository_id = c.repository_id
WHERE r.ref_name = 'HEAD' AND language(f.file_path) = 'Go' AND NOT is_binary(f.blob_content)
`

func main() {
	db, err := sql.Open("mysql", connectionString)
	if err != nil {
		panic(err)
	}

	rs, err := db.Query(query)
	if err != nil {
		panic(err)
	}

	for rs.Next() {
                 // no need to do anything with the rows
	}

	if err := rs.Err(); err != nil {
		panic(err)
	}
}

The text was updated successfully, but these errors were encountered:

bzz · 2018-11-19T12:09:36Z

Did not get to reproduce this yet, but it would be interesting to see the output of the bblfshctl status and bblfshctl instances under such conditions.

ajnavarro · 2018-12-12T13:20:49Z

@erizocosmico could you give to @bzz the requested info please?

erizocosmico · 2018-12-12T13:44:19Z

Sure @ajnavarro

erizocosmico · 2018-12-12T14:02:29Z

> docker exec -it bblfshd bblfshctl status
+----------+----------------+---------------+---------+--------+
| LANGUAGE | SUCCESS/FAILED | STATE/DESIRED | WAITING | EXITED |
+----------+----------------+---------------+---------+--------+
| go       | 439/6          | 1/1           | 0       | 0      |
+----------+----------------+---------------+---------+--------+
Response time 4.2804ms

> docker exec -it bblfshd bblfshctl instances
+-------------+-------------------------+---------+-----------+------+
| INSTANCE ID |         DRIVER          | STATUS  |  CREATED  | PIDS |
+-------------+-------------------------+---------+-----------+------+
| 01cyhbkec5  | bblfsh/go-driver:latest | Running | 2 minutes | 15   |
+-------------+-------------------------+---------+-----------+------+

This was just after it started happening. If you wait a little bit longer, not even those commands are responsive.

After this happens, not even docker rm -f bblfshd can stop the container or make it responsive and I have to restart the docker runtime.

/cc @bzz

smola · 2018-12-12T14:10:01Z

@erizocosmico Could you do du -csh /var/lib/bblfshd/* inside bblfshd container?
Docker hanging could be related to resource exhaustion: #168

erizocosmico · 2018-12-12T14:19:32Z

Sure @smola

erizocosmico · 2018-12-12T14:23:24Z

@smola docker exec -it bblfshd du -csh /var/lib/bblfshd/* does not respond either

smola · 2018-12-13T11:30:42Z

I can reproduce this by sending N (= NumCPU) concurrent requests for huge files which will take forever to parse, all subsequent requests will fail with context deadline exceeded. It might be happening to gitbase that huge files accumulate in parse requests and end up clogging all driver instances.

The unexpected timeouts were supposed to be fixed here: #227

I think the other part would be to limit the file size we send from gitbase to bblfsh. That is, any file above X size immediately returns NULL UAST. The limit would be configurable. This is something I already discussed with @ajnavarro, not sure if it's done.

~~Other than that, I don't think there's anything else to do here. What do you think?~~

EDIT: I'm using bblfshd 2.11.0 which was supposed to fix the timeouts problem, so either the issue is not fixed or this is a different issue.

By the way, for quick reproduction, you can use this code (warning: it will staturate all your CPUs!):

package main

import (
	"context"
	"fmt"
	"io/ioutil"
	"net/http"
	"runtime"
	"sync"
	"sync/atomic"

	bblfsh "gopkg.in/bblfsh/client-go.v3"
)

func main() {

	url := `https://raw.githubusercontent.com/src-d/enry/master/data/frequencies.go`
	resp, err := http.DefaultClient.Get(url)
	if err != nil {
		panic(err)
	}

	bs, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		panic(err)
	}

	code := string(bs)

	client, err := bblfsh.NewClient("0.0.0.0:9432")
	if err != nil {
		panic(err)
	}

	wg := sync.WaitGroup{}
	var count int32
	for i := 0; i < runtime.NumCPU()*2; i++ {
		wg.Add(1)
		go func() {
			for {
				_, _, err = client.
					NewParseRequest().
					Context(context.Background()).
					Language("go").
					Content(code).
					UAST()
				if err != nil {
					fmt.Println(err)
				}

				newCount := atomic.AddInt32(&count, 1)
				if newCount%100 == 0 {
					fmt.Printf("Requests: %d\n", newCount)
				}
			}
			wg.Done()
		}()
	}
	wg.Wait()
}

creachadair · 2018-12-14T17:56:40Z

CPU saturation can cause structural problems for gRPC even if there are no explicit deadlines. If the machine is so CPU starved that it can't keep up with channel maintenance (e.g., health checks) the channel can be closed.

I haven't tried it in this case—it's very difficult to arrange in isolation—but this is why it's important not to let the client demand unbounded work, but to shed load, proxy, or queue. The fact that it happens under heavy CPU demand suggests this is something we should be thinking about in the design of our servers.

smola · 2018-12-18T11:53:24Z

Actually, if CPU is saturated, we may want to setup bblfshd with less driver instances in the pool. Other than that, I'd see as reasonable that if bblfshd cannot accept more work (e.g. a fixed-size request queue is full), the request returns an error right away and that error is easily identifiable by the client. So they can decide whether to retry with backoff, reduce parallelism, etc.

I recommended the gitbase team to limit file size on their side for UAST parsing. This can mitigate some problems in a predictable way for the user. The user can control this limit and then they know that UAST will return NULL for too big files, where big is configurable.

It may also be the case that some files, such as minified/obfuscated code are too expensive to process even within the size limits. I think it would be ideal if bblfshd errors in this situation were deterministic, not based on time. For example, if we consider that processing an UAST with more than 10k nodes is too much, then once we processed 10k nodes we could stop and return an error. My point is that for applications like gitbase it's important that results are deterministic. If, given a certain setup, bblfshd always returns an error for a given file, that's ok. If it depends on the load, it's a mess, the only way for gitbase to provide determinism would be to just fail the full query whenever there is is a timeout after a few retries, and timeouts are happening often enough for that not to be usable.

creachadair · 2018-12-18T19:43:56Z

Actually, if CPU is saturated, we may want to setup bblfshd with less driver instances in the pool. Other than that, I'd see as reasonable that if bblfshd cannot accept more work (e.g. a fixed-size request queue is full), the request returns an error right away and that error is easily identifiable by the client. So they can decide whether to retry with backoff, reduce parallelism, etc.

Agreed. I think there are two pieces we need: One is for the server to push back when there is too much work, so that the client can tell that it's requesting too much work and back off; the other is for the client to make use of that signal.

For the short term we can probably just work around by forcing the clients to rate-limit themselves (e.g., to some fixed QPS). Making the server behaviour more deterministic will be an important OKR for LA team in Q1, and then we should be able to do something less hacky.

I recommended the gitbase team to limit file size on their side for UAST parsing. This can mitigate some problems in a predictable way for the user. The user can control this limit and then they know that UAST will return NULL for too big files, where big is configurable

Yes, limiting file size is probably a good start. Generated or transpiled code often has weird properties. But I think we should also consider how to make the API more forgiving. As far as I know this limit is not that Babelfish runs out of RAM, but that gRPC has some artificial limits on message size in the channel.

I think the "right" solution is to make an API that allows chunked requests/responses, so that large requests can be passed without having to worry about gRPC. However that will take time, so I think for now @smola's suggestion is the right practical step.

I think it would be ideal if bblfshd errors in this situation were deterministic, not based on time.

I agree completely. I'd rather have results that are deterministic and wrong than non-deterministic. It's hard to program clients when you have to port backoff and timing heuristics from one to another. I feel like we should iron out these issues ASAP.

dennwc added the bug label Nov 8, 2018

erizocosmico mentioned this issue Nov 26, 2018

[METRICS] get times on queries using UAST functions src-d/gitbase#606

Closed

This was referenced Nov 27, 2018

Automation of performance measurements #212

Open

perf: bblfshd 2.9 non responsive after parsing 1000+ files #226

Closed

bzz changed the title ~~Context deadline exceeded using Go driver on bblfshd~~ perf: context deadline exceeded using Go driver on bblfshd Dec 3, 2018

smola mentioned this issue Dec 13, 2018

Deadline exceeded when drivers cannot be allocated (even if the client provides context.Background()) #229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: context deadline exceeded using Go driver on bblfshd #209

perf: context deadline exceeded using Go driver on bblfshd #209

erizocosmico commented Nov 8, 2018

bzz commented Nov 19, 2018

ajnavarro commented Dec 12, 2018

erizocosmico commented Dec 12, 2018

erizocosmico commented Dec 12, 2018 •

edited

Loading

smola commented Dec 12, 2018

erizocosmico commented Dec 12, 2018

erizocosmico commented Dec 12, 2018

smola commented Dec 13, 2018 •

edited

Loading

creachadair commented Dec 14, 2018

smola commented Dec 18, 2018 •

edited

Loading

creachadair commented Dec 18, 2018

perf: context deadline exceeded using Go driver on bblfshd #209

perf: context deadline exceeded using Go driver on bblfshd #209

Comments

erizocosmico commented Nov 8, 2018

bzz commented Nov 19, 2018

ajnavarro commented Dec 12, 2018

erizocosmico commented Dec 12, 2018

erizocosmico commented Dec 12, 2018 • edited Loading

smola commented Dec 12, 2018

erizocosmico commented Dec 12, 2018

erizocosmico commented Dec 12, 2018

smola commented Dec 13, 2018 • edited Loading

creachadair commented Dec 14, 2018

smola commented Dec 18, 2018 • edited Loading

creachadair commented Dec 18, 2018

erizocosmico commented Dec 12, 2018 •

edited

Loading

smola commented Dec 13, 2018 •

edited

Loading

smola commented Dec 18, 2018 •

edited

Loading