Improve cost of creating AttributeSets #1379

KallDrexx · 2023-11-16T21:25:17Z

Changes

When attempting to call Counter<T>::add() or UpDownCounter<T>::add() on a frequent basis (e.g. per network packet), a significant amount of CPU time was being used by AttributeSet::from() calls.

Replacing the AttributeSet's inner type from a BTreeSet to Vec (+ sort() and deduplication) improves metric throughput (presumably due to more efficient handling of smaller slices). Likewise, pre-caching the hash for the AttributeSet made lookups much faster.

This PR also changes the metrics test to have attributes out of order, to ensure that throughput considers sorting cost.

Stress tests

main branch

Number of threads: 8
Throughput: 4,236,400 iterations/sec
Throughput: 4,253,800 iterations/sec
Throughput: 4,261,400 iterations/sec
Throughput: 4,308,400 iterations/sec
Throughput: 4,389,600 iterations/sec
Throughput: 4,440,800 iterations/sec
Throughput: 4,474,400 iterations/sec
Throughput: 4,460,200 iterations/sec

This PR

Number of threads: 8
Throughput: 5,490,600 iterations/sec
Throughput: 5,496,800 iterations/sec
Throughput: 5,891,800 iterations/sec
Throughput: 5,909,600 iterations/sec
Throughput: 5,830,800 iterations/sec

This ends up being 31% throughput improvement on my machine.

Benchmark result

Counter_Add_Sorted      time:   [698.63 ns 703.35 ns 708.73 ns]
                        change: [-2.6023% -0.8437% +0.7414%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

Counter_Add_Unsorted    time:   [696.41 ns 703.65 ns 712.87 ns]
                        change: [-4.6853% -2.7787% -1.0534%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe

Merge requirement checklist

CONTRIBUTING guidelines followed
Unit tests added/updated (if applicable)
Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
Changes in public API reviewed (if applicable)

linux-foundation-easycla · 2023-11-16T21:25:22Z

The committers listed above are authorized under a signed CLA.

✅ login: KallDrexx / name: Matthew Shapiro (7fdfea9, 639b61c, 8c377ec, 9e81bc5, 2ff643c, 3f514c0, a4cc357, b488dc1, 6cf86cf, 36a9aab, 8f9c327, 2277538, 092e37b, 698811a, d61079f)
✅ login: lalitb / name: Lalit Kumar Bhasin (07e72c9)

KallDrexx · 2023-11-17T15:53:29Z

@cijothomas When you get a chance, what is the best way to fix the test failure this PR causes?

Specifically, the prometheus sanitation test uses duplicate attributes (A.B) and uses the previous de-duplication logic which would only result in Q. However, my PR removes de-duplication logic and thus the final result has the two values concatenated.

Based on previous conversations, the C++ and .net libraries do not perform de-duplication, and as far as I"m aware the the OpenTelemetry spec does not say they should be duplicated, which means this behavior might surprise users.

Should I modify the test to expect them to be concatenated?

cijothomas · 2023-11-17T18:04:15Z

@cijothomas When you get a chance, what is the best way to fix the test failure this PR causes?

Specifically, the prometheus sanitation test uses duplicate attributes (A.B) and uses the previous de-duplication logic which would only result in Q. However, my PR removes de-duplication logic and thus the final result has the two values concatenated.

Based on previous conversations, the C++ and .net libraries do not perform de-duplication, and as far as I"m aware the the OpenTelemetry spec does not say they should be duplicated, which means this behavior might surprise users.

Should I modify the test to expect them to be concatenated?

prometheus won't be able to scrap successfully if there is duplication, so maybe this is something to be handled at the promethes exporter level. @jtescher Do you have suggestions to unblock? Could we just modify the prometheus test to prevent the issue and document that SDK/Prometheus exporter does not de-dup, and is user responsibility?

jtescher · 2023-11-17T19:54:03Z

Seems like there are a few places to investigate here, if the instrument add calls are with a fixed set of attributes we could look at changing the methods to accept attribute sets rather than slices that have to be converted which may sidestep this issue for some people. If this is more about supporting a large range of different attribute sets for instruments we could look at fixing this in exporters, or adding a new_unchecked style API where users have to ensure there are no dupes themselves.

KallDrexx · 2023-11-17T20:10:33Z

prometheus won't be able to scrap successfully if there is duplication, so maybe this is something to be handled at the promethes exporter level.

Are we sure about this? The prometheus test seems pretty explicit that it will concatenate values if the keys are duplicated due to sanitation. That is tested and still works after my change. So it seems like prometheus already supports some level of duplication, this test is only making the explicit decision that duplicate keys due to sanitation are allowed but duplicate exact keys are not allowed.

KallDrexx · 2023-11-17T20:11:50Z

Seems like there are a few places to investigate here, if the instrument add calls are with a fixed set of attributes we could look at changing the methods to accept attribute sets rather than slices that have to be converted which may sidestep this issue for some people.

That only addresses performance concerns for users that have a stable set of attributes they use a lot. This PR significantly improves performance for users who do not have a predictable set of attribute key/value combinations.

shaun-cox

Excellent improvement! Thanks for the contribution. Assuming other approvers/maintainers are okay with the behavior change, I wholeheartedly support from a performance perspective.

stress/src/metrics.rs

cijothomas · 2023-11-17T21:35:39Z

prometheus won't be able to scrap successfully if there is duplication, so maybe this is something to be handled at the promethes exporter level.

Are we sure about this? The prometheus test seems pretty explicit that it will concatenate values if the keys are duplicated due to sanitation. That is tested and still works after my change. So it seems like prometheus already supports some level of duplication, this test is only making the explicit decision that duplicate keys due to sanitation are allowed but duplicate exact keys are not allowed.

I tested OTel .NET with actual Prometheus to see how would Prometheus react if de-dedup is not done. Below are my findings.

Following is an output Prometheus is happily scrapping successfully:

# TYPE myCounter_total counter
# HELP myCounter_total A counter for demonstration purpose.
myCounter_total{color="red",name="apple"} 2999.700000000016 1700256532355
myCounter_total{color="yellow",name="lemon"} 30269.70000000016 1700256532355

# EOF

This was generated using below

Counter.Add(9.9, new("name", "apple"), new("color", "red"));
Counter.Add(99.9, new("name", "lemon"), new("color", "yellow"));

Now the following is something prometheus is failing to scrape with error message "label name "name" is not unique: invalid sample"

# TYPE myCounter_total counter
# HELP myCounter_total A counter for demonstration purpose.
myCounter_total{color="red",name="orange",name="apple"} 11375.099999999758 1700256656251
myCounter_total{color="yellow",name="lemon"} 114785.09999999781 1700256656251

# EOF

This was generated using

Counter.Add(9.9, new("name", "apple"), new("name", "orange"), new("color", "red")); // no dedup done in sdk..
Counter.Add(99.9, new("name", "lemon"), new("color", "yellow"));

In case of .NET, neither SDK, not prometheus exporter is deduping, so prometheus scrape is failing.
For OTel Rust, we need to decide if we are okay with similar behavior as OTel .NET OR we should do de-dedup in PrometheusExporter itself.

My personal vote is to not do dedup in prometheus now, and get back to this based on feedback. If there are users affected by this, we should fix prom exporter.

Main point being - lets remove dedup from SDK and work towards the goals!

cijothomas · 2023-11-17T22:17:51Z

# TYPE myCounter_total counter
# HELP myCounter_total A counter for demonstration purpose.
myCounter_total{color="red",name="orange",name="apple"} 11375.099999999758 1700256656251
myCounter_total{color="yellow",name="lemon"} 114785.09999999781 1700256656251

can prometheus scrape this correctly?

Duplicate Label Names: In the first metric, myCounter_total{color="red",name="orange",name="apple"}, you have used the label name name twice. Prometheus expects each label name within a metric to be unique. Having duplicate label names is not valid in Prometheus and would likely cause an error during scraping.

Should have just asked AI 🤖

hdost · 2023-11-18T14:30:19Z

In case of .NET, neither SDK, not prometheus exporter is deduping, so prometheus scrape is failing.
For OTel Rust, we need to decide if we are okay with similar behavior as OTel .NET OR we should do de-dedup in PrometheusExporter itself.

.NET has a bad Prometheus exporter implementation ? I wouldn't think we should replicate that 🤕. I agree we want to get good performance, but we should also ensure a proper exporter output. This could be offloaded to the scrape time.

KallDrexx · 2023-11-18T14:38:26Z

It's not clear to me why deduping should be opentelemetry's concern rather than a concern of the calling application.

What happens when deduping occurs? Should the last value be kept? The first? What is last and first in different scenarios since all attributes are passed in at the same time?

If the calling application passes in the duplicate attributes in different order in different scenarios then deduplication logic will split them into different time series, causing hard to debug and wrong metrics.

Only the consuming application knows how deduping should accurately be done imo

hdost · 2023-11-18T14:45:21Z

While we don't have to follow the limits there are clear guidelines for how to deal with attributes https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/README.md#attribute-limits
If we want to offload the work of deduplication to the exporter, then we will need to preserve the order in which they were added.

For metrics the identity is very important to how values are aggregated.

hdost · 2023-11-18T17:22:30Z

Also I do want to say @KallDrexx thank you for looking into this it's very much appreciated. I apologize if I came if a bit strong to start.

hdost · 2023-11-18T17:34:47Z

In a curiosity vein, for your specific use case do you have skeleton you could provide. Would it be possible to have a cached set of attributes or is it entirely dynamic for your network counter?

KallDrexx · 2023-11-18T17:43:10Z

Those guidelines are far from clear in this scenario as far as I can tell. Not only do they state that those guidelines aren't really applicable to metrics, Those dictate how truncation should happen if there is a finite number of allowed attributes which isn't the case here.

For metrics the identity is very important to how values are aggregated.

Which is 100% why I think the library doing arbitrary deduplication is bad.

If the consumer calls counter.add(5, {color: red, shape: circle}) what are they semantically trying to communicate? Based on my experience in past observability projects (not opentelemetry specific) they ware looking at incrementing the counter for the intersection of all attributes specified.

So as a concrete example calling

counter.add(5, {color: red, shape: square});
counter.add(5, {color: red, shape: circle});

In my metrics UI I should see 10 for color: red, 5 for shape: square, 5 for shape: circle, 5 for shape: circle, color: red and 0 for shape: circle, color: blue. The latter one is especially important because I can clearly tell that all the circles that were incremented were red and none were blue.

So what happens if the user calls counter.add(5, {color: red, color: blue}). Semantically this means that they only want 5 added to the metrics where you are search for both red and blue, which is a totally different bucket than red and blue by themselves.

So truncating now means that counter.add(5, {color: red, color: blue}) is equivalent to counter.add(5, {color: red}) which has a very different semantic meaning imo.

This is not that crazy to me , because that's exactly how the prometheus exporter test works today. If you look at the test you see that it's assuming that if C.D and C/D are provided, they should get concatenated into a single C_D with the final attribute value being Y;Z. While I don't have back story into that test (I didn't write it and am not a prometheus expert), my assumption is that this will get aggregated by prometheus as a single Y;Z value that shows up in filters, which is the only meaningful way to filter on values that are an intersection of two types of the same dimension.

This leads me to believe that the current prometheus code is correct. If you call counter.add(5, {color: red, color: blue}) then the current prometheus exporter will export color:red;blue, which while will bloat the UI is the only way to make the metrics correctly identifiable.

That's why I believe that the caller of the SDK should do any de-duplication that involves complete removal of attributes, because they are the only one who know the valid way to do so.

So my next question is how do other opentelemetry SDKs handle this?

Python's add call requires a dictionary to be put in, explicitly forcing the consumer of the SDK to do any de-duplication themselves.
Java's library does de-dupe attributes but does a last one wins, which is different than rust's code today and the guidelines posted above.
C++ does not seem to make any attempt to filter duplicate keys in their iterables (the interface doesn't specify it and the code I found doesn't do it
I'm not totally sure what's going on in Go but it doesn't appear that it's doing any filtering from the browsing I did
Javascripts' library is expecting a javascript object to be passed in by attributes, which means it's expecting the caller to have have no-duplicates already when passed in (or handled duplicate keys themselves)

So every language library seems to have different considerations and no one follows that guideline you posted specifically. Maybe that's a bad thing and the rust's current implementation is doing things right and everyone else is wrong, but is it ok to be the only one doing things "right" if it causes a performance hit?

KallDrexx · 2023-11-18T17:45:45Z

In a curiosity vein, for your specific use case do you have skeleton you could provide. Would it be possible to have a cached set of attributes or is it entirely dynamic for your network counter?

Our application is working with metrics for network I/O on a per packet basis, which is why we encountered the AttributeSet::from() costs. I do plan on looking at adding a bounded instrument API to not have to bear the cost of creating attribute sets as often, but we will have to create new attribute sets regularly at some points, so I wanted to ensure that that code path was still fast.

cijothomas · 2023-11-18T17:50:24Z

In case of .NET, neither SDK, not prometheus exporter is deduping, so prometheus scrape is failing.
For OTel Rust, we need to decide if we are okay with similar behavior as OTel .NET OR we should do de-dedup in PrometheusExporter itself.

.NET has a bad Prometheus exporter implementation ? I wouldn't think we should replicate that 🤕. I agree we want to get good performance, but we should also ensure a proper exporter output. This could be offloaded to the scrape time.

You are right! It could be treated as a bug in OTel .NET's Prometheus Exporter. OTel .NET's Prometheus is an experimental component, and there is a chance it'll never be stable released at all (in favor of native OTLP support in Prometheus)

And I tried how would Prometheus's native OTLP ingestion deals with duplication! Surprised with the result! I send it MyFruitCounter.Add(1, new("name", "apple"), new("name", "orange"), new("color", "red"));, via OTLP in OTel .NET where we know neither the SDK nor OTLP Exporter is deduping, and Prometheus stored it as MyFruitCounter_total{color="red", job="unknown_service:getting-started-prometheus-grafana", name="orange;apple"}

@hdost Any ideas how should OTel Rust prometheus handles duplicate attributes? Should we send "name=orange", with a policy of "last-one-wins", OR should we do "name=orange;apple", just like Prometheus itself is doing?

hdost · 2023-11-18T19:18:16Z

For metrics the identity is very important to how values are aggregated.

Which is 100% why I think the library doing arbitrary deduplication is bad.

Well I would agree that arbitrary dedup is bad and I would say what I said before in that we could have the exporter (the client from my perspective) have settings which could control how dedup is handled. It also wouldn't put the performance hit on the writer but on the subscriber.

That's what I was trying to say here:

If we want to offload the work of deduplication to the exporter, then we will need to preserve the order in which they were added.

Whatever we do for resolving Prometheus specifically requires uniqueness of labels, so whether to do it is not the question, but how. And there are Processors and Views which can be utilized to make a determination as to how this data is processed. It's the SDK which enables it. And to provide flexibility to users we could certainly expand that.

That's why I believe that the caller of the SDK should do any de-duplication that involves complete removal of attributes, because they are the only one who know the valid way to do so.

So every language library seems to have different considerations and no one follows that guideline you posted specifically. Maybe that's a bad thing and the rust's current implementation is doing things right and everyone else is wrong, but is it ok to be the only one doing things "right" if it causes a performance hit?

I'm not going to die on the hill of right vs performance, but I think this comes from me not explaining my position well enough. As that's not what I was arguing.

I would also say whatever "right" is should be consistent between implementations, otherwise what's the point of having a standard, but that's going to be decided here. If we decide to go against spec then that's why we have #1297

KallDrexx · 2023-11-18T19:24:49Z

I will also mention that if we really want duplication logic as part of AttributeSet::from() then I can look at adding that in. My suspicion is the majority of the performance gain is from switching from BTreeSet to Vec. I'd rather not add the cost in if it's not needed.

I'm not necessarily opposed to doing deduplication of keys on the exporting side (which I think may be what you are aiming for). However to me it seems like:

OpenTelemetry spec does not specify this behavior as far as I've been able to find
Every implementation that I've been able to find seems to differ.
- Some APIs filter duplicate attribute keys on creation of the attribute set (rust right now is first wins, java is last wins)
- Some APIs have API boundaries that make it impossible to submit duplicate keys in the first place (python, javascript)
- Some APIs seem to do no filtering at attribute set time (c++, go I think?)

I haven't looked at the exporter code in these projects yet but it seems like there's no consistency between implementations currently. So I'm not sure it's actually a deviation from the spec.

KallDrexx · 2023-11-18T19:28:53Z

If we want to offload the work of deduplication to the exporter, then we will need to preserve the order in which they were added.

I guess "preserving the order in which they are added" is an interesting concept. For example

counter.add(1, {color: red, color: blue});
counter.add(1, {color: blue, color: red});

THis is actually technically an edge case in my PR as well. But should these be in the same time series or different time series? If the same, should order be preserved at export time?

KallDrexx · 2023-11-20T21:26:06Z

This sounds very promising, do you want to make this one as a specific separate PR for comparison ?

I went ahead and updated this PR, because while precomputing the hash is new, I still did it in the vein of storing the attributes in a Vec<HashedKeyValue> (though I probably don't need HashKeyValue instead of just KeyValue since it's precomputed).

The Vec<T> is still going to be faster than a BTreeSet as the BTreeSet does more work to sort on every insert instead of once at the end (among other Vec optimizations for small lists).

So the new changes that include deduplication last-wins logic now passes all tests as can be seen on CI. I still want to do some manual verification since we don't have good test coverage over attributes.

The current stress test values are:

mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 6.13s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 6,290,200 iterations/sec
Throughput: 6,562,400 iterations/sec
Throughput: 6,672,400 iterations/sec
Throughput: 6,636,600 iterations/sec
Throughput: 6,373,800 iterations/sec
Throughput: 6,460,800 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
    Finished release [optimized] target(s) in 0.08s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 6,304,000 iterations/sec
Throughput: 5,937,600 iterations/sec
Throughput: 5,824,600 iterations/sec
Throughput: 6,308,600 iterations/sec
Throughput: 6,110,200 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 9.07s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 1,248,800 iterations/sec
Throughput: 1,296,000 iterations/sec
Throughput: 1,266,400 iterations/sec
Throughput: 1,291,800 iterations/sec
Throughput: 1,306,200 iterations/sec
Throughput: 1,267,200 iterations/sec
Throughput: 1,246,800 iterations/sec
Throughput: 1,228,200 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ cargo run --release --bin metrics
    Finished release [optimized] target(s) in 0.07s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 1,092,000 iterations/sec
Throughput: 1,102,000 iterations/sec
Throughput: 1,075,800 iterations/sec
Throughput: 1,102,000 iterations/sec
^C
mshapiro@matt-ms-hpz:~/code/opentelemetry-rust/stress$ Throughput: 971,400 iterations/sec
cargo run --release --bin metrics
   Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk)
   Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing)
   Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout)
   Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress)
    Finished release [optimized] target(s) in 6.03s
     Running `/home/mshapiro/code/opentelemetry-rust/target/release/metrics`
Number of threads: 6
Throughput: 6,162,400 iterations/sec
Throughput: 5,684,200 iterations/sec
Throughput: 5,232,000 iterations/sec
Throughput: 5,629,600 iterations/sec
Throughput: 5,640,200 iterations/sec
^C

The first two runs and the final run are from the commit in this PR as of now. The middle two runs are from the main branch (with the stress test modified to match the out of order sorting my PR changes).

Averaging these iteration/sec values gives main branch 1,210,266 iterations per second while my PR provided 6,168,066 iterations per second, a 510% improvement.

To be sure where the performance difference was coming from, precomputing the hashes with a BTreeSet gave a speedup but only got up to about 2,959,560 iterations per second, which is far below what I'm getting for the Vec implementation.

I also tested this change against our application which encountered metrics being a bottleneck for us. One function is ending up particularly heavy for us which is incrementing two counters for every network packet we get. This brought the total CPU usage for that function from 11.93% down to 8.24%. While I need to look a bit more at it, a significant amount of the performance gain ended up coming from improved call times to Measure::call(), presumably due to decreased hashing cost on lookups.

hdost · 2023-11-20T23:03:28Z

Thank you again very much for working on this!

… staying under the 2k cardinality.

KallDrexx · 2023-11-21T18:38:32Z

Ok I've double checked everything I wanted to, and things look good. I had to modify the stress test a tiny bit (lower the number of attribute values) since 4 attributes was causing us to go over the 2k cardinality limit.

I've updated the PR description with the final benchmarks from my machine. Would be nice if someone is able to verify similar numbers.

Interestingly enough, I did an experiment reducing the deduplication logic and ended up with consistently lower throughput (only by ~100k-200k iterations per second). That doesn't quite make sense to me, especially since it's not actually detecting duplicates in the stress test.

So from my perspective this PR is ready for review.

Again, sorry for all the back and forth that went on while I did exploration :)

cijothomas · 2023-11-21T19:17:06Z

Would be nice if someone is able to verify similar numbers.

My laptop is giving a diff result:
Throughput: 4,471,200 iterations/sec -- main branch with the stress test from this PR
Throughput: 5,492,000 iterations/sec -- this PR branch

lalitb · 2023-11-21T19:38:29Z

| Would be nice if someone is able to verify similar numbers.

This is what I see (8 cores, wsl2 32GB)

Current PR: ~6,286,000 iterations/sec
Main branch (with stress test from current PR): ~2,592,000 iterations/sec

cijothomas · 2023-11-21T19:38:55Z

Would be nice if someone is able to verify similar numbers.

My laptop is giving a diff result: Throughput: 4,471,200 iterations/sec -- main branch with the stress test from this PR Throughput: 5,492,000 iterations/sec -- this PR branch

another machine:

Throughput: 2,257,200 iterations/sec
Throughput: 2,236,200 iterations/sec
Throughput: 2,476,200 iterations/sec

vs

Throughput: 5,180,000 iterations/sec
Throughput: 5,242,400 iterations/sec
Throughput: 5,289,600 iterations/sec

jtescher

one last issue and a nit, other than that looks good!

opentelemetry-sdk/src/attributes/set.rs

cijothomas · 2023-11-21T20:52:29Z

Interestingly enough, I did an experiment reducing the deduplication logic and ended up with consistently lower throughput (only by ~100k-200k iterations per second). That doesn't quite make sense to me, especially since it's not actually detecting duplicates in the stress test.

I see otherwise! Removing the de-dup caused another 10% boost for me... (I cannot think of any reason why it showed lower throughput for you! Could you try re-running?

cijothomas · 2023-11-21T21:09:18Z

opentelemetry-sdk/src/attributes/set.rs

-#[derive(Clone, Default, Debug, Hash, PartialEq, Eq)]
-pub struct AttributeSet(BTreeSet<HashKeyValue>);
+#[derive(Clone, Default, Debug, PartialEq, Eq)]
+pub struct AttributeSet(Vec<HashKeyValue>, u64);


Trying to see how caching the hash is improve speed.. I believe this is because it can now be re-used when we lookup timeseries in here?

It'd help if we were needing the hash calculation more than once. But we don't need it for more than once.. So in this PR, the hashcalculation is simply moved from here to the AttributeSet creation, so I am not able to see how that would have helped with the perf improvement!

Or the perf gains are purely from the Btee->Vec switch?

In other words, we should not make this change, as taking up more memory to store/cache the hash only benefits if there lookup/insert are needed more than once, and from my understanding, we don't have the lookup-more-than-once requirement..

Putting a breakpoint on Hash shows there are several hidden hash calls within the hashmap. For example:

So when the hash map grows, it has to re-hash all the time series in it. So this maybe why we see a 500% increase in stress test performance but a bit muted (but still impactful) performance improvement in real life applications. Precaching hashes improves this test case because we have 1k+ time series that sometimes need to be rehashed.

Although eventually that should stabilize, and after a few thousand iterations I no longer see reserve calls requesting hashes, but I don't see a significant change in throughput.

I did the following two runs:

mshapiro@matt-ms-hpz:~/code/opentelemetry-rust$ cargo run --release --package stress --bin metrics Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk) Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout) Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing) Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress) Finished release [optimized] target(s) in 5.88s Running `target/release/metrics` Number of threads: 6 Throughput: 5,525,800 iterations/sec Throughput: 5,796,800 iterations/sec Throughput: 5,553,600 iterations/sec Throughput: 5,823,200 iterations/sec Throughput: 5,942,600 iterations/sec Throughput: 5,609,600 iterations/sec ^C mshapiro@matt-ms-hpz:~/code/opentelemetry-rust$ cargo run --release --package stress --bin metrics Compiling opentelemetry_sdk v0.21.1 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-sdk) Compiling opentelemetry-appender-tracing v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-appender-tracing) Compiling opentelemetry-stdout v0.2.0 (/home/mshapiro/code/opentelemetry-rust/opentelemetry-stdout) Compiling stress v0.1.0 (/home/mshapiro/code/opentelemetry-rust/stress) Finished release [optimized] target(s) in 5.86s Running `target/release/metrics` Number of threads: 6 Throughput: 1,304,400 iterations/sec Throughput: 1,314,800 iterations/sec Throughput: 1,314,600 iterations/sec Throughput: 1,158,400 iterations/sec Throughput: 1,300,200 iterations/sec Throughput: 1,332,600 iterations/sec Throughput: 1,333,200 iterations/sec

The first run is with this PR as is, the 2nd run is commenting out the impl Hash block and adding Hash to the derive() attribute.

So pre-hashing is definitely providing an improvement on its own as far as Ican tell. This is backed up by profiling our actual application, as we see Measure::call() percentages go from 4.66% of our CPU time down to 2.30%. I'm also pretty sure we saw a reduction in hash_one percentages but I didn't write them down in my notes.

So I'm not totally sure where the extra hashing is coming from, but it does seem to add some benefits in real use cases.

Good point. Looking into the code now I don't see caching is going to help here. Even for the slight improvement, the implication of storing the cache could be considerable. We have a slightly different design in otel-cpp which helps as the lookup is not purely O(1) in the hot path.

Ok, ignore it, the above comment by @KallDrexx was added around the same time as this.
My further comment based on the reply -
It seems that the pre-compute hash cache is useful in the case of rehashing. If the HashMap is pre-allocated with a certain capacity ( based on the supported upper cardinality limit) and does not need to be resized, will that avoid re-hashing, and so the need of cache.

So when the hash map grows, it has to re-hash all the time series in it. So this maybe why we see a 500% increase in stress test performance but a bit muted (but still impactful) performance improvement in real life applications. Precaching hashes improves this test case because we have 1k+ time series that sometimes need to be rehashed.

We need to prevent this nonetheless. We can ensure HashMap has enough size by initializing it correctly, so rehashed would be avoided. We know the max size (default 2000), so this should be relatively easy to achieve! Let us address this separately, as that is another nice improvement by avoiding reallocations!

My stress test show no change when I removed Hash custom implementation!

As another data point, I changed ValueMap::new() to do HashMap::with_capacity(2000) to reduce resizing and rehashing. That made the first iteration report to be slightly better than vanilla main, but not by much. Every throughput report afterwards was within the same variance I get with main.

rehashing impact cannot be tested using the current stress test, as there is no collect() triggered so the hashmap is never emptied, so after the initial fill-up, it wont need re-hash!

Here's what I suggest:

We know from the tests that btree -> vec is definitely perf booster, so lets scope this PR to do just that.

For rehashing prevention, cache-the-hash, and avoid de-duplication logic, avoid sorting and more - lets do it separately on after other, so each PR is very focused and demonstrates clean perf gains.
(This PR already kick started a ton of ideas to improve perf!)

Thoughts?

I'm not opposed to splitting it out. I can do that tomorrow.

Ok it's 12:30am, I went to go to sleep and it finally dawns on me why precalculating the hashing improves my throughput on multiple machines more than others. So I had to get out my laptop to verify, and yes, I know what's going on now.

You are correct that we are only computing the hash once, but where we compute the hash is important! Precomputing the hash the way I am in this PR (that I'm going to put in its own PR) incurs the cost at AttributeSet::From() time. Not precomputing the hash incurs the cost at Measure::call() time.

This is important because incurring the cost during the hash lookup means we are doing more work inside the mutex lock, and thus delaying all other Measure::call() operations. Precomputing the hash allows each counter to compute the hash in parallel, and thus less work inside the lock.

This is why we were seeing such variance of if it was helping or not. It all came down to thread counts for the throughput stress test. I verified this by testing with and without precomputed hashes with 6 threads on my PC, then with and without with 2 threads (fyi, 1 thread is broken).

On my personal laptop (and on battery):

6 threads precomputed: 3.5M

6 threads non-precomputed: 2.3M

2 threads prcomputed: 2.5M

2 threads non-precomputed: 2.5M

So precomputing the hash is a concurrency optimization.

Awesome! It makes sense. Lets tackle it in the follow up PR.

If I learned anything over the last few years working in OTel - it is that there is no magic! Everything should be explainable. Thanks a lot!

lalitb

LGTM. Suggestion to remove the cache for hash, and take it as separate enhancement. There needs to be tradeoff between perf improvement, and storage allocation, and probably needs more investigation.

cijothomas · 2023-11-22T03:14:26Z

stress/src/metrics.rs

@@ -33,14 +31,17 @@ fn test_counter() {
    let index_first_attribute = rng.gen_range(0..len);
    let index_second_attribute = rng.gen_range(0..len);
    let index_third_attribute = rng.gen_range(0..len);
+    let index_fourth_attribute = rng.gen_range(0..len);


lets not make change to the stress test itself in this PR. We can drive changes to stress test as a separate PR if we believe it should do unsorted order. (I do not believe it should, but want to avoid adding noise to this PR).

cijothomas

Agree with the replacement of Btree with Vec.! This is moving us in the right direction. Many thanks for doing that, and all the other productive discussions in this PR. I hope the next set of PRs will take us even closer to the perf goals!

Requesting changes only to make sure the following is done before merge:

Move the hash caching to a separate PR, after we demonstrate it is needed.
Revert changes to stress test.

I opened #1389 to get more focused benchmark. Could you run that and update the PR description with the results? I also suggest re-run the stress tests as we don't see the 5x improvement in other machines!

KallDrexx · 2023-11-22T19:50:28Z

Updated the PR to remove the changes to the stress test, and remove caching of the hash. I will open up new PRs for these separately.

Will update the PR description with updated benchmarks and throughput values

cijothomas

LGTM! Thank you!

Matthew Shapiro added 3 commits November 14, 2023 12:16

Remove hashmap/btree conversion for attributes to attribute sets

7fdfea9

Sort key value vector by key

639b61c

Add extra attribute and don't have them pre-sorted

8c377ec

Matthew Shapiro added 2 commits November 16, 2023 17:00

fmt

9e81bc5

Merge branch 'main' into vec_attribute_sets

2ff643c

use distinct value for attribute 4

3f514c0

KallDrexx marked this pull request as ready for review November 17, 2023 18:37

KallDrexx requested a review from a team November 17, 2023 18:37

cijothomas mentioned this pull request Nov 17, 2023

Attempt to write metric tests with inmemoryexp #1382

Merged

shaun-cox approved these changes Nov 17, 2023

View reviewed changes

stress/src/metrics.rs Outdated Show resolved Hide resolved

Add comment on attribute ordering

a4cc357

KallDrexx mentioned this pull request Nov 21, 2023

[Feature]: Allow precreation of AttributeSets for metrics #1387

Closed

Updated test to use distinct values for all 4 attributes, while still…

36a9aab

… staying under the 2k cardinality.

jtescher suggested changes Nov 21, 2023

View reviewed changes

opentelemetry-sdk/src/attributes/set.rs Outdated Show resolved Hide resolved

opentelemetry-sdk/src/attributes/set.rs Show resolved Hide resolved

Added missint sort, and unstable sorts

8f9c327

cijothomas reviewed Nov 21, 2023

View reviewed changes

opentelemetry-sdk/src/attributes/set.rs Outdated Show resolved Hide resolved

jtescher approved these changes Nov 21, 2023

View reviewed changes

k = kv

2277538

cijothomas reviewed Nov 21, 2023

View reviewed changes

lalitb approved these changes Nov 21, 2023

View reviewed changes

Merge branch 'main' into attribute_sets

07e72c9

cijothomas mentioned this pull request Nov 22, 2023

Add focused benchmark for metric hotpath #1389

Merged

cijothomas reviewed Nov 22, 2023

View reviewed changes

cijothomas requested changes Nov 22, 2023

View reviewed changes

KallDrexx added 3 commits November 22, 2023 14:14

Merge branch 'main' into attribute_sets

092e37b

Reset test back to match main

698811a

Removed precomputed hashes

d61079f

cijothomas approved these changes Nov 22, 2023

View reviewed changes

cijothomas merged commit 075d26c into open-telemetry:main Nov 22, 2023

cijothomas mentioned this pull request Nov 22, 2023

Metric SDK - remove de-duplicating attributes for perf gain #1398

Closed

4 tasks

KallDrexx deleted the attribute_sets branch November 27, 2023 15:56

KallDrexx restored the attribute_sets branch November 27, 2023 17:08

KallDrexx deleted the attribute_sets branch November 27, 2023 17:19

Improve cost of creating AttributeSets #1379

Improve cost of creating AttributeSets #1379

Uh oh!

Conversation

KallDrexx commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Stress tests

main branch

This PR

Benchmark result

Merge requirement checklist

Uh oh!

linux-foundation-easycla bot commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KallDrexx commented Nov 17, 2023

Uh oh!

cijothomas commented Nov 17, 2023

Uh oh!

jtescher commented Nov 17, 2023

Uh oh!

KallDrexx commented Nov 17, 2023

Uh oh!

KallDrexx commented Nov 17, 2023

Uh oh!

shaun-cox left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cijothomas commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cijothomas commented Nov 17, 2023

Uh oh!

hdost commented Nov 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KallDrexx commented Nov 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdost commented Nov 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdost commented Nov 18, 2023

Uh oh!

hdost commented Nov 18, 2023

Uh oh!

KallDrexx commented Nov 18, 2023

Uh oh!

KallDrexx commented Nov 18, 2023

Uh oh!

cijothomas commented Nov 18, 2023

Uh oh!

hdost commented Nov 18, 2023

Uh oh!

KallDrexx commented Nov 18, 2023

Uh oh!

KallDrexx commented Nov 18, 2023

Uh oh!

KallDrexx commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hdost commented Nov 20, 2023

Uh oh!

KallDrexx commented Nov 21, 2023

Uh oh!

cijothomas commented Nov 21, 2023

Uh oh!

lalitb commented Nov 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cijothomas commented Nov 21, 2023

Uh oh!

jtescher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KallDrexx commented Nov 16, 2023 •

edited

Loading

linux-foundation-easycla bot commented Nov 16, 2023 •

edited

Loading

cijothomas commented Nov 17, 2023 •

edited

Loading

hdost commented Nov 18, 2023 •

edited

Loading

KallDrexx commented Nov 18, 2023 •

edited

Loading

hdost commented Nov 18, 2023 •

edited

Loading

KallDrexx commented Nov 20, 2023 •

edited

Loading

lalitb commented Nov 21, 2023 •

edited

Loading

lalitb Nov 21, 2023 •

edited

Loading

KallDrexx Nov 21, 2023 •

edited

Loading

KallDrexx Nov 22, 2023 •

edited

Loading