Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very long query time with many topics and consumer groups #41

Open
panda87 opened this issue Oct 30, 2017 · 19 comments
Open

Very long query time with many topics and consumer groups #41

panda87 opened this issue Oct 30, 2017 · 19 comments

Comments

@panda87
Copy link

panda87 commented Oct 30, 2017

Hi

Im using this plugin for a while, and it worked pretty well while I had small amount of consumers.
Today, I added many other consumers and new topics and 2 things started to appear

  1. I started to get failures due to long consumer_ids (I used pznamensky branch and this fixed that) - can you pls merge, btw
  2. The http query time increased from 2-3 seconds to 20 seconds, even when I changed max-concurrent-group-queries to 10 - it just effected my CPU cores and increased the load to 500%

Do you know why this happens?

D.

@zot42
Copy link

zot42 commented Nov 1, 2017

I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long. Seems to be related to the number of partitions

@JensRantil
Copy link
Collaborator

pznamensky branch and this fixed that

@panda87 Would you mind sharing a link to his fix/branch and I can look into bringing it into master here? If you could submit a PR that would be even better! 🙏

@JensRantil
Copy link
Collaborator

I am seeing the same behavior, it's so bad that prometheus is skipping it because the scrape to the /metrics endpoint is taking to long.

@zot42 FYI, you can increase that timeout in Prometheus, though.

@JensRantil
Copy link
Collaborator

Do you know why this happens?

@panda87 Would you mind measuring how long it takes to list our your topics using kafka-consumer-groups.sh as well as querying the lag? I've also created #47 which would help diagnose issues like yours.

@JensRantil
Copy link
Collaborator

Would you mind sharing a link to his fix/branch and I can look into bringing it into master here? If you could submit a PR that would be even better!

@panda87 Never mind. Please ignore my comment, I just saw his PR. ;)

@panda87
Copy link
Author

panda87 commented Nov 14, 2017

@JensRantil I used his PR, but I still get errors like this:

goroutine 1290463 [running]:
panic(0x7f2880, 0xc420012080)
	/usr/local/go/src/runtime/panic.go:500 +0x1a1
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).parseLine(0xc4200acee0, 0xc4207680b1, 0xe1, 0xc4200ef400, 0x0, 0x10)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:134 +0x4d2
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*regexpParser).Parse(0xc4200acee0, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x0, 0x0, 0x0, 0xa51ce0, 0xc420552550)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:74 +0x189
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*DelegatingParser).Parse(0xc4200fb120, 0xc420768000, 0xed1, 0xc420126e80, 0x77, 0x3, 0xc420768000, 0xed1, 0xc420126e80, 0x77)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/parsing.go:211 +0x8f
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka.(*ConsumerGroupsCommandClient).DescribeGroup(0xc4200e8c60, 0xa58460, 0xc42011d080, 0xc4200777a0, 0xc, 0x1, 0x13bb, 0x0, 0xc420502757, 0xc)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/kafka/collector.go:79 +0x105
github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop.func1(0xc420011780, 0xc4200777a0, 0xc, 0xc42011d0e0, 0xa58460, 0xc42011d080, 0xc42011d140)
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:159 +0x71
created by github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync.(*FanInConsumerGroupInfoClient).describeLoop
	/go/src/github.com/kawamuray/prometheus-kafka-consumer-group-exporter/sync/metrics.go:164 +0x178
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. line: counter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:110"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for lag. line: %scounter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:118"
time="2017-11-13T07:35:37Z" level=warning msg="unable to find current offset field. Line: counter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:125"
time="2017-11-13T07:35:37Z" level=warning msg="unable to parse int for current offset. Line: %scounter_raw_events             0          1               1               0          event-stream-StreamThread-2-consumer-11d0f14e-4b96-41a7-ac49-eb389ca75e58/172.40.102.56                 event-stream-StreamThread-2-consumer" source="parsing.go:130"
panic: runtime error: index out of range

@JensRantil
Copy link
Collaborator

@panda87 That looks like a different issue than this. Please open a new issue (and specify which version/commit you are running).

@panda87
Copy link
Author

panda87 commented Nov 14, 2017

Ok, I will create new issue

@panda87 panda87 closed this as completed Nov 14, 2017
@panda87 panda87 reopened this Nov 14, 2017
@panda87
Copy link
Author

panda87 commented Nov 14, 2017

It seems that now after I pulled this last repo with latest changes I dont get the errors above, so thanks! Now its only the response time, which is still high

@JensRantil
Copy link
Collaborator

@panda87 Good! I know I saw that error when I was recently revamping some of the parsing logic.

@cl0udgeek
Copy link

I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....

@JensRantil
Copy link
Collaborator

I noticed the response time being high too....I wonder if this is actually kafka who is taking a long time to run vs prometheus....

I'm pretty sure it is. #47 will help us tell whether that's the case.

@cl0udgeek
Copy link

any update on this one?

@JensRantil
Copy link
Collaborator

Unfortunately not. Pull requests are to fix #47. I've been pretty busy lately and haven't had time to get back to this 😥

@cl0udgeek
Copy link

Any update on this?

@JensRantil
Copy link
Collaborator

Unfortunately not.

@JensRantil
Copy link
Collaborator

Might be worth mentioning that I had a colleague that claimed lag is now exposed through JMX. A workaround might be to have a look at using jmx_exporter instead of this.

@cl0udgeek
Copy link

cl0udgeek commented Feb 14, 2018

wait what? that'd be awesome if it is! do you know which kafka version? to clarify...its always been there on the consumer side but not on the server side as far as I know

@raindev
Copy link

raindev commented Feb 14, 2018

@k1ng87,

If you're interested in consumer lag it's published via JMX by the consumer:

Old consumer: kafka.consumer:type=ConsumerFetcherManager,name=MaxLag,clientId=([-.\w]+)

New consumer: kafka.consumer:type=consumer-fetch-manager-metrics,client-id={client-id} Attribute: records-lag-max

Replication lag is published by the broker:

kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.\w]+),topic=([-.\w]+),partition=([0-9]+)

See the official Kafka documentation for more details: https://kafka.apache.org/documentation/#monitoring. I checked only version 1.0, the latest one as of now. Hope this helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants