-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Description
While testing the new exponential_histogram feature, I ingested a large data set of real-world response times from our monitoring.
During that, I noticed that the min, max and sum which we store as doubles in NumericDocValues for the histograms, don't compress well. We end up with roughly 6.8 bytes per double value.
We should investigate why our compression performs suboptimal on this data and if we can find any way to improve.
Here are 128 sample values of the min values for one of the series, which reproduce the inefficient compression behaviour when used in the ES87TSDBDocValuesEncoderTests:
0.001286,4.0E-4,2.93E-4,3.36E-4,3.06E-4,2.8E-4,0.100443,0.001489,2.84E-4,3.12E-4,2.42E-4,2.82E-4,3.15E-4,2.88E-4,2.88E-4,3.27E-4,0.001258,2.86E-4,3.05E-4,2.69E-4,2.77E-4,0.095372,3.4E-4,3.04E-4,0.001295,0.001294,2.89E-4,2.46E-4,3.02E-4,2.89E-4,2.82E-4,2.42E-4,0.001291,0.001287,3.31E-4,0.100412,3.14E-4,2.6E-4,3.05E-4,3.2E-4,2.9E-4,0.001314,0.100419,3.08E-4,2.89E-4,2.83E-4,3.01E-4,2.9E-4,2.84E-4,0.001351,2.66E-4,2.74E-4,2.71E-4,2.87E-4,0.001288,3.16E-4,3.24E-4,2.8E-4,0.001291,2.84E-4,2.45E-4,4.25E-4,3.08E-4,3.0E-4,2.62E-4,2.84E-4,2.91E-4,0.001319,2.75E-4,2.75E-4,0.001299,4.3E-4,2.41E-4,4.18E-4,3.15E-4,3.12E-4,2.99E-4,2.83E-4,0.001287,4.97E-4,2.46E-4,2.81E-4,0.001324,3.06E-4,0.001255,3.07E-4,2.74E-4,2.73E-4,3.48E-4,0.001264,2.73E-4,3.14E-4,2.34E-4,2.74E-4,3.07E-4,2.82E-4,0.001249,2.94E-4,3.27E-4,0.00127,3.87E-4,2.63E-4,2.91E-4,4.69E-4,3.05E-4,2.96E-4,2.59E-4,0.001299,2.82E-4,2.75E-4,3.01E-4,2.69E-4,3.01E-4,2.82E-4,2.48E-4,2.37E-4,3.04E-4,4.17E-4,2.87E-4,2.99E-4,2.74E-4,0.001284,2.92E-4,0.099382,0.001297,2.97E-4,3.6E-4,4.0E-4
These values are seemingly similar to each other, yet they still end up with roughly 7 bytes per value.
They are response times in milliseconds (yes, those are very fast endpoints). If we instead use integers with nanosecond precision (so doubleValue -> Math.round(doubleValue * 1_000_000) as input for the codec, we get a little over 2 bytes per value instead:
1286,400,293,336,306,280,100443,1489,284,312,242,282,315,288,288,327,1258,286,305,269,277,95372,340,304,1295,1294,289,246,302,289,282,242,1291,1287,331,100412,314,260,305,320,290,1314,100419,308,289,283,301,290,284,1351,266,274,271,287,1288,316,324,280,1291,284,245,425,308,300,262,284,291,1319,275,275,1299,430,241,418,315,312,299,283,1287,497,246,281,1324,306,1255,307,274,273,348,1264,273,314,234,274,307,282,1249,294,327,1270,387,263,291,469,305,296,259,1299,282,275,301,269,301,282,248,237,304,417,287,299,274,1284,292,99382,1297,297,360,400,