Investigate bad TSDB NumericDocValues compression efficiency on real world dataset

While testing the new `exponential_histogram` feature, I ingested a large data set of real-world response times from our monitoring.

During that, I noticed that the `min`, `max` and `sum` which we store as `doubles` in `NumericDocValues` for the histograms, don't compress well. We end up with roughly `6.8 bytes` per double value.

We should investigate why our compression performs suboptimal on this data and if we can find any way to improve.

Here are 128 sample values of the `min` values for one of the series, which reproduce the inefficient compression behaviour when used in the `ES87TSDBDocValuesEncoderTests`:

```
0.001286,4.0E-4,2.93E-4,3.36E-4,3.06E-4,2.8E-4,0.100443,0.001489,2.84E-4,3.12E-4,2.42E-4,2.82E-4,3.15E-4,2.88E-4,2.88E-4,3.27E-4,0.001258,2.86E-4,3.05E-4,2.69E-4,2.77E-4,0.095372,3.4E-4,3.04E-4,0.001295,0.001294,2.89E-4,2.46E-4,3.02E-4,2.89E-4,2.82E-4,2.42E-4,0.001291,0.001287,3.31E-4,0.100412,3.14E-4,2.6E-4,3.05E-4,3.2E-4,2.9E-4,0.001314,0.100419,3.08E-4,2.89E-4,2.83E-4,3.01E-4,2.9E-4,2.84E-4,0.001351,2.66E-4,2.74E-4,2.71E-4,2.87E-4,0.001288,3.16E-4,3.24E-4,2.8E-4,0.001291,2.84E-4,2.45E-4,4.25E-4,3.08E-4,3.0E-4,2.62E-4,2.84E-4,2.91E-4,0.001319,2.75E-4,2.75E-4,0.001299,4.3E-4,2.41E-4,4.18E-4,3.15E-4,3.12E-4,2.99E-4,2.83E-4,0.001287,4.97E-4,2.46E-4,2.81E-4,0.001324,3.06E-4,0.001255,3.07E-4,2.74E-4,2.73E-4,3.48E-4,0.001264,2.73E-4,3.14E-4,2.34E-4,2.74E-4,3.07E-4,2.82E-4,0.001249,2.94E-4,3.27E-4,0.00127,3.87E-4,2.63E-4,2.91E-4,4.69E-4,3.05E-4,2.96E-4,2.59E-4,0.001299,2.82E-4,2.75E-4,3.01E-4,2.69E-4,3.01E-4,2.82E-4,2.48E-4,2.37E-4,3.04E-4,4.17E-4,2.87E-4,2.99E-4,2.74E-4,0.001284,2.92E-4,0.099382,0.001297,2.97E-4,3.6E-4,4.0E-4
```

These values are seemingly similar to each other, yet they still end up with roughly 7 bytes per value.
They are response times in milliseconds (yes, those are very fast endpoints). If we instead use integers with nanosecond precision (so `doubleValue -> Math.round(doubleValue * 1_000_000)` as input for the codec, we get a little over 2 bytes per value instead:

```
1286,400,293,336,306,280,100443,1489,284,312,242,282,315,288,288,327,1258,286,305,269,277,95372,340,304,1295,1294,289,246,302,289,282,242,1291,1287,331,100412,314,260,305,320,290,1314,100419,308,289,283,301,290,284,1351,266,274,271,287,1288,316,324,280,1291,284,245,425,308,300,262,284,291,1319,275,275,1299,430,241,418,315,312,299,283,1287,497,246,281,1324,306,1255,307,274,273,348,1264,273,314,234,274,307,282,1249,294,327,1270,387,263,291,469,305,296,259,1299,282,275,301,269,301,282,248,237,304,417,287,299,274,1284,292,99382,1297,297,360,400,
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate bad TSDB NumericDocValues compression efficiency on real world dataset #138763

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate bad TSDB NumericDocValues compression efficiency on real world dataset #138763

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions