-
Notifications
You must be signed in to change notification settings - Fork 520
Improve cost of creating AttributeSets #1379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@cijothomas When you get a chance, what is the best way to fix the test failure this PR causes? Specifically, the prometheus sanitation test uses duplicate attributes ( Based on previous conversations, the C++ and .net libraries do not perform de-duplication, and as far as I"m aware the the OpenTelemetry spec does not say they should be duplicated, which means this behavior might surprise users. Should I modify the test to expect them to be concatenated? |
prometheus won't be able to scrap successfully if there is duplication, so maybe this is something to be handled at the promethes exporter level. @jtescher Do you have suggestions to unblock? Could we just modify the prometheus test to prevent the issue and document that SDK/Prometheus exporter does not de-dup, and is user responsibility? |
Seems like there are a few places to investigate here, if the instrument |
Are we sure about this? The prometheus test seems pretty explicit that it will concatenate values if the keys are duplicated due to sanitation. That is tested and still works after my change. So it seems like prometheus already supports some level of duplication, this test is only making the explicit decision that duplicate keys due to sanitation are allowed but duplicate exact keys are not allowed. |
That only addresses performance concerns for users that have a stable set of attributes they use a lot. This PR significantly improves performance for users who do not have a predictable set of attribute key/value combinations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent improvement! Thanks for the contribution. Assuming other approvers/maintainers are okay with the behavior change, I wholeheartedly support from a performance perspective.
I tested OTel .NET with actual Prometheus to see how would Prometheus react if de-dedup is not done. Below are my findings. Following is an output Prometheus is happily scrapping successfully:
This was generated using below Counter.Add(9.9, new("name", "apple"), new("color", "red"));
Counter.Add(99.9, new("name", "lemon"), new("color", "yellow")); Now the following is something prometheus is failing to scrape with error message "label name "name" is not unique: invalid sample"
This was generated using Counter.Add(9.9, new("name", "apple"), new("name", "orange"), new("color", "red")); // no dedup done in sdk..
Counter.Add(99.9, new("name", "lemon"), new("color", "yellow")); In case of .NET, neither SDK, not prometheus exporter is deduping, so prometheus scrape is failing. My personal vote is to not do dedup in prometheus now, and get back to this based on feedback. If there are users affected by this, we should fix prom exporter. Main point being - lets remove dedup from SDK and work towards the goals! |
can prometheus scrape this correctly? Duplicate Label Names: In the first metric, myCounter_total{color="red",name="orange",name="apple"}, you have used the label name name twice. Prometheus expects each label name within a metric to be unique. Having duplicate label names is not valid in Prometheus and would likely cause an error during scraping. Should have just asked AI 🤖 |
.NET has a bad Prometheus exporter implementation ? I wouldn't think we should replicate that 🤕. I agree we want to get good performance, but we should also ensure a proper exporter output. This could be offloaded to the scrape time. |
It's not clear to me why deduping should be opentelemetry's concern rather than a concern of the calling application. What happens when deduping occurs? Should the last value be kept? The first? What is last and first in different scenarios since all attributes are passed in at the same time? If the calling application passes in the duplicate attributes in different order in different scenarios then deduplication logic will split them into different time series, causing hard to debug and wrong metrics. Only the consuming application knows how deduping should accurately be done imo |
While we don't have to follow the limits there are clear guidelines for how to deal with attributes https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/README.md#attribute-limits For metrics the identity is very important to how values are aggregated. |
Also I do want to say @KallDrexx thank you for looking into this it's very much appreciated. I apologize if I came if a bit strong to start. |
In a curiosity vein, for your specific use case do you have skeleton you could provide. Would it be possible to have a cached set of attributes or is it entirely dynamic for your network counter? |
Those guidelines are far from clear in this scenario as far as I can tell. Not only do they state that those guidelines aren't really applicable to metrics, Those dictate how truncation should happen if there is a finite number of allowed attributes which isn't the case here.
Which is 100% why I think the library doing arbitrary deduplication is bad. If the consumer calls So as a concrete example calling
In my metrics UI I should see 10 for So what happens if the user calls So truncating now means that This is not that crazy to me , because that's exactly how the prometheus exporter test works today. If you look at the test you see that it's assuming that if This leads me to believe that the current prometheus code is correct. If you call That's why I believe that the caller of the SDK should do any de-duplication that involves complete removal of attributes, because they are the only one who know the valid way to do so. So my next question is how do other opentelemetry SDKs handle this?
So every language library seems to have different considerations and no one follows that guideline you posted specifically. Maybe that's a bad thing and the rust's current implementation is doing things right and everyone else is wrong, but is it ok to be the only one doing things "right" if it causes a performance hit? |
Our application is working with metrics for network I/O on a per packet basis, which is why we encountered the |
You are right! It could be treated as a bug in OTel .NET's Prometheus Exporter. OTel .NET's Prometheus is an experimental component, and there is a chance it'll never be stable released at all (in favor of native OTLP support in Prometheus) And I tried how would Prometheus's native OTLP ingestion deals with duplication! Surprised with the result! I send it @hdost Any ideas how should OTel Rust prometheus handles duplicate attributes? Should we send "name=orange", with a policy of "last-one-wins", OR should we do "name=orange;apple", just like Prometheus itself is doing? |
Well I would agree that arbitrary dedup is bad and I would say what I said before in that we could have the exporter (the client from my perspective) have settings which could control how dedup is handled. It also wouldn't put the performance hit on the writer but on the subscriber. That's what I was trying to say here:
Whatever we do for resolving Prometheus specifically requires uniqueness of labels, so whether to do it is not the question, but how. And there are Processors and Views which can be utilized to make a determination as to how this data is processed. It's the SDK which enables it. And to provide flexibility to users we could certainly expand that.
I'm not going to die on the hill of right vs performance, but I think this comes from me not explaining my position well enough. As that's not what I was arguing. I would also say whatever "right" is should be consistent between implementations, otherwise what's the point of having a standard, but that's going to be decided here. If we decide to go against spec then that's why we have #1297 |
I will also mention that if we really want duplication logic as part of I'm not necessarily opposed to doing deduplication of keys on the exporting side (which I think may be what you are aiming for). However to me it seems like:
I haven't looked at the exporter code in these projects yet but it seems like there's no consistency between implementations currently. So I'm not sure it's actually a deviation from the spec. |
I guess "preserving the order in which they are added" is an interesting concept. For example
THis is actually technically an edge case in my PR as well. But should these be in the same time series or different time series? If the same, should order be preserved at export time? |
I went ahead and updated this PR, because while precomputing the hash is new, I still did it in the vein of storing the attributes in a The So the new changes that include deduplication last-wins logic now passes all tests as can be seen on CI. I still want to do some manual verification since we don't have good test coverage over attributes. The current stress test values are:
The first two runs and the final run are from the commit in this PR as of now. The middle two runs are from the main branch (with the stress test modified to match the out of order sorting my PR changes). Averaging these iteration/sec values gives main branch To be sure where the performance difference was coming from, precomputing the hashes with a I also tested this change against our application which encountered metrics being a bottleneck for us. One function is ending up particularly heavy for us which is incrementing two counters for every network packet we get. This brought the total CPU usage for that function from 11.93% down to 8.24%. While I need to look a bit more at it, a significant amount of the performance gain ended up coming from improved call times to |
Thank you again very much for working on this! |
… staying under the 2k cardinality.
Ok I've double checked everything I wanted to, and things look good. I had to modify the stress test a tiny bit (lower the number of attribute values) since 4 attributes was causing us to go over the 2k cardinality limit. I've updated the PR description with the final benchmarks from my machine. Would be nice if someone is able to verify similar numbers. Interestingly enough, I did an experiment reducing the deduplication logic and ended up with consistently lower throughput (only by ~100k-200k iterations per second). That doesn't quite make sense to me, especially since it's not actually detecting duplicates in the stress test. So from my perspective this PR is ready for review. Again, sorry for all the back and forth that went on while I did exploration :) |
My laptop is giving a diff result: |
| Would be nice if someone is able to verify similar numbers. This is what I see (8 cores, wsl2 32GB) Current PR: ~6,286,000 iterations/sec |
another machine: Throughput: 2,257,200 iterations/sec vs Throughput: 5,180,000 iterations/sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last issue and a nit, other than that looks good!
I see otherwise! Removing the de-dup caused another 10% boost for me... (I cannot think of any reason why it showed lower throughput for you! Could you try re-running? |
#[derive(Clone, Default, Debug, Hash, PartialEq, Eq)] | ||
pub struct AttributeSet(BTreeSet<HashKeyValue>); | ||
#[derive(Clone, Default, Debug, PartialEq, Eq)] | ||
pub struct AttributeSet(Vec<HashKeyValue>, u64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying to see how caching the hash is improve speed.. I believe this is because it can now be re-used when we lookup timeseries in here?
It'd help if we were needing the hash calculation more than once. But we don't need it for more than once.. So in this PR, the hashcalculation is simply moved from here to the AttributeSet creation, so I am not able to see how that would have helped with the perf improvement!
Or the perf gains are purely from the Btee->Vec switch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, we should not make this change, as taking up more memory to store/cache the hash only benefits if there lookup/insert are needed more than once, and from my understanding, we don't have the lookup-more-than-once requirement..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting a breakpoint on Hash
shows there are several hidden hash calls within the hashmap. For example:
So when the hash map grows, it has to re-hash all the time series in it. So this maybe why we see a 500% increase in stress test performance but a bit muted (but still impactful) performance improvement in real life applications. Precaching hashes improves this test case because we have 1k+ time series that sometimes need to be rehashed.
Although eventually that should stabilize, and after a few thousand iterations I no longer see reserve calls requesting hashes, but I don't see a significant change in throughput.
I did the following two runs:
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust$ cargo run --release --package stress --bin metrics
Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
Finished release [optimized] target(s) in 5.88s
Running `target/release/metrics`
Number of threads: 6
Throughput: 5,525,800 iterations/sec
Throughput: 5,796,800 iterations/sec
Throughput: 5,553,600 iterations/sec
Throughput: 5,823,200 iterations/sec
Throughput: 5,942,600 iterations/sec
Throughput: 5,609,600 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust$ cargo run --release --package stress --bin metrics
Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
Finished release [optimized] target(s) in 5.86s
Running `target/release/metrics`
Number of threads: 6
Throughput: 1,304,400 iterations/sec
Throughput: 1,314,800 iterations/sec
Throughput: 1,314,600 iterations/sec
Throughput: 1,158,400 iterations/sec
Throughput: 1,300,200 iterations/sec
Throughput: 1,332,600 iterations/sec
Throughput: 1,333,200 iterations/sec
The first run is with this PR as is, the 2nd run is commenting out the impl Hash
block and adding Hash
to the derive()
attribute.
So pre-hashing is definitely providing an improvement on its own as far as Ican tell. This is backed up by profiling our actual application, as we see Measure::call()
percentages go from 4.66% of our CPU time down to 2.30%. I'm also pretty sure we saw a reduction in hash_one
percentages but I didn't write them down in my notes.
So I'm not totally sure where the extra hashing is coming from, but it does seem to add some benefits in real use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Looking into the code now I don't see caching is going to help here. Even for the slight improvement, the implication of storing the cache could be considerable. We have a slightly different design in otel-cpp which helps as the lookup is not purely O(1) in the hot path.
Ok, ignore it, the above comment by @KallDrexx was added around the same time as this.
My further comment based on the reply -
It seems that the pre-compute hash cache is useful in the case of rehashing. If the HashMap is pre-allocated with a certain capacity ( based on the supported upper cardinality limit) and does not need to be resized, will that avoid re-hashing, and so the need of cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So when the hash map grows, it has to re-hash all the time series in it. So this maybe why we see a 500% increase in stress test performance but a bit muted (but still impactful) performance improvement in real life applications. Precaching hashes improves this test case because we have 1k+ time series that sometimes need to be rehashed.
We need to prevent this nonetheless. We can ensure HashMap has enough size by initializing it correctly, so rehashed would be avoided. We know the max size (default 2000), so this should be relatively easy to achieve! Let us address this separately, as that is another nice improvement by avoiding reallocations!
My stress test show no change when I removed Hash custom implementation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As another data point, I changed ValueMap::new()
to do HashMap::with_capacity(2000)
to reduce resizing and rehashing. That made the first iteration report to be slightly better than vanilla main
, but not by much. Every throughput report afterwards was within the same variance I get with main.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rehashing impact cannot be tested using the current stress test, as there is no collect() triggered so the hashmap is never emptied, so after the initial fill-up, it wont need re-hash!
Here's what I suggest:
- We know from the tests that btree -> vec is definitely perf booster, so lets scope this PR to do just that.
- For rehashing prevention, cache-the-hash, and avoid de-duplication logic, avoid sorting and more - lets do it separately on after other, so each PR is very focused and demonstrates clean perf gains.
(This PR already kick started a ton of ideas to improve perf!)
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to splitting it out. I can do that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok it's 12:30am, I went to go to sleep and it finally dawns on me why precalculating the hashing improves my throughput on multiple machines more than others. So I had to get out my laptop to verify, and yes, I know what's going on now.
You are correct that we are only computing the hash once, but where we compute the hash is important! Precomputing the hash the way I am in this PR (that I'm going to put in its own PR) incurs the cost at AttributeSet::From()
time. Not precomputing the hash incurs the cost at Measure::call()
time.
This is important because incurring the cost during the hash lookup means we are doing more work inside the mutex lock, and thus delaying all other Measure::call()
operations. Precomputing the hash allows each counter to compute the hash in parallel, and thus less work inside the lock.
This is why we were seeing such variance of if it was helping or not. It all came down to thread counts for the throughput stress test. I verified this by testing with and without precomputed hashes with 6 threads on my PC, then with and without with 2 threads (fyi, 1 thread is broken).
On my personal laptop (and on battery):
- 6 threads precomputed: 3.5M
- 6 threads non-precomputed: 2.3M
- 2 threads prcomputed: 2.5M
- 2 threads non-precomputed: 2.5M
So precomputing the hash is a concurrency optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! It makes sense. Lets tackle it in the follow up PR.
If I learned anything over the last few years working in OTel - it is that there is no magic! Everything should be explainable. Thanks a lot!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Suggestion to remove the cache for hash, and take it as separate enhancement. There needs to be tradeoff between perf improvement, and storage allocation, and probably needs more investigation.
stress/src/metrics.rs
Outdated
@@ -33,14 +31,17 @@ fn test_counter() { | |||
let index_first_attribute = rng.gen_range(0..len); | |||
let index_second_attribute = rng.gen_range(0..len); | |||
let index_third_attribute = rng.gen_range(0..len); | |||
let index_fourth_attribute = rng.gen_range(0..len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets not make change to the stress test itself in this PR. We can drive changes to stress test as a separate PR if we believe it should do unsorted order. (I do not believe it should, but want to avoid adding noise to this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the replacement of Btree with Vec.! This is moving us in the right direction. Many thanks for doing that, and all the other productive discussions in this PR. I hope the next set of PRs will take us even closer to the perf goals!
Requesting changes only to make sure the following is done before merge:
- Move the hash caching to a separate PR, after we demonstrate it is needed.
- Revert changes to stress test.
I opened #1389 to get more focused benchmark. Could you run that and update the PR description with the results? I also suggest re-run the stress tests as we don't see the 5x improvement in other machines!
Updated the PR to remove the changes to the stress test, and remove caching of the hash. I will open up new PRs for these separately. Will update the PR description with updated benchmarks and throughput values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you!
Changes
When attempting to call
Counter<T>::add()
orUpDownCounter<T>::add()
on a frequent basis (e.g. per network packet), a significant amount of CPU time was being used byAttributeSet::from()
calls.Replacing the
AttributeSet
's inner type from aBTreeSet
toVec
(+sort()
and deduplication) improves metric throughput (presumably due to more efficient handling of smaller slices). Likewise, pre-caching the hash for theAttributeSet
made lookups much faster.This PR also changes the metrics test to have attributes out of order, to ensure that throughput considers sorting cost.
Stress tests
main branch
Number of threads: 8
Throughput: 4,236,400 iterations/sec
Throughput: 4,253,800 iterations/sec
Throughput: 4,261,400 iterations/sec
Throughput: 4,308,400 iterations/sec
Throughput: 4,389,600 iterations/sec
Throughput: 4,440,800 iterations/sec
Throughput: 4,474,400 iterations/sec
Throughput: 4,460,200 iterations/sec
This PR
Number of threads: 8
Throughput: 5,490,600 iterations/sec
Throughput: 5,496,800 iterations/sec
Throughput: 5,891,800 iterations/sec
Throughput: 5,909,600 iterations/sec
Throughput: 5,830,800 iterations/sec
This ends up being 31% throughput improvement on my machine.
Benchmark result
Merge requirement checklist
CHANGELOG.md
files updated for non-trivial, user-facing changes