Skip to content

Investigate bad TSDB NumericDocValues compression efficiency on real world dataset #138763

@JonasKunz

Description

@JonasKunz

While testing the new exponential_histogram feature, I ingested a large data set of real-world response times from our monitoring.

During that, I noticed that the min, max and sum which we store as doubles in NumericDocValues for the histograms, don't compress well. We end up with roughly 6.8 bytes per double value.

We should investigate why our compression performs suboptimal on this data and if we can find any way to improve.

Here are 128 sample values of the min values for one of the series, which reproduce the inefficient compression behaviour when used in the ES87TSDBDocValuesEncoderTests:

0.001286,4.0E-4,2.93E-4,3.36E-4,3.06E-4,2.8E-4,0.100443,0.001489,2.84E-4,3.12E-4,2.42E-4,2.82E-4,3.15E-4,2.88E-4,2.88E-4,3.27E-4,0.001258,2.86E-4,3.05E-4,2.69E-4,2.77E-4,0.095372,3.4E-4,3.04E-4,0.001295,0.001294,2.89E-4,2.46E-4,3.02E-4,2.89E-4,2.82E-4,2.42E-4,0.001291,0.001287,3.31E-4,0.100412,3.14E-4,2.6E-4,3.05E-4,3.2E-4,2.9E-4,0.001314,0.100419,3.08E-4,2.89E-4,2.83E-4,3.01E-4,2.9E-4,2.84E-4,0.001351,2.66E-4,2.74E-4,2.71E-4,2.87E-4,0.001288,3.16E-4,3.24E-4,2.8E-4,0.001291,2.84E-4,2.45E-4,4.25E-4,3.08E-4,3.0E-4,2.62E-4,2.84E-4,2.91E-4,0.001319,2.75E-4,2.75E-4,0.001299,4.3E-4,2.41E-4,4.18E-4,3.15E-4,3.12E-4,2.99E-4,2.83E-4,0.001287,4.97E-4,2.46E-4,2.81E-4,0.001324,3.06E-4,0.001255,3.07E-4,2.74E-4,2.73E-4,3.48E-4,0.001264,2.73E-4,3.14E-4,2.34E-4,2.74E-4,3.07E-4,2.82E-4,0.001249,2.94E-4,3.27E-4,0.00127,3.87E-4,2.63E-4,2.91E-4,4.69E-4,3.05E-4,2.96E-4,2.59E-4,0.001299,2.82E-4,2.75E-4,3.01E-4,2.69E-4,3.01E-4,2.82E-4,2.48E-4,2.37E-4,3.04E-4,4.17E-4,2.87E-4,2.99E-4,2.74E-4,0.001284,2.92E-4,0.099382,0.001297,2.97E-4,3.6E-4,4.0E-4

These values are seemingly similar to each other, yet they still end up with roughly 7 bytes per value.
They are response times in milliseconds (yes, those are very fast endpoints). If we instead use integers with nanosecond precision (so doubleValue -> Math.round(doubleValue * 1_000_000) as input for the codec, we get a little over 2 bytes per value instead:

1286,400,293,336,306,280,100443,1489,284,312,242,282,315,288,288,327,1258,286,305,269,277,95372,340,304,1295,1294,289,246,302,289,282,242,1291,1287,331,100412,314,260,305,320,290,1314,100419,308,289,283,301,290,284,1351,266,274,271,287,1288,316,324,280,1291,284,245,425,308,300,262,284,291,1319,275,275,1299,430,241,418,315,312,299,283,1287,497,246,281,1324,306,1255,307,274,273,348,1264,273,314,234,274,307,282,1249,294,327,1270,387,263,291,469,305,296,259,1299,282,275,301,269,301,282,248,237,304,417,287,299,274,1284,292,99382,1297,297,360,400,

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions