non-ambiguous internal aggregations #5715

trinity-1686a · 2025-03-17T11:06:08Z

use a non-ambiguous format for aggregation results internally

the api is unchanged, but it makes it easier to manipulate results from inside quickwit

before that, trying to re-parse the aggregation results (json) may lead to a different ast than expected, which made it hard to apply any transformation without also traversing the aggregation request ast at the same time

rdettai · 2025-03-17T11:38:18Z

quickwit/quickwit-proto/protos/quickwit/search.proto

-  // Serialized aggregation response
-  optional string aggregation = 5;
+  // used to be json-encoded aggregation
+  reserved 5;
+
+  // Postcard-encoded aggregation response
+  optional bytes aggregation_postcard = 9;


this is not retro-compatible, right? it means we cannot perform a rolling upgrade of searchers with this change

you need the node which serves the rest api and the root-searcher for the request to be the same, otherwise the aggregation result just disappear, but you can have root-searcher and leaf-searcher on different versions and that's alright

you need the node which serves the rest api and the root-searcher for the request to be the same

From what I read in the code, this is always the case: the REST API uses SearchService that only has one concrete implementation (SearchServiceImpl).

Now, this raises a new question. We could also solve this by getting rid of the serialization entirely. If the SearchService trait was never re-implemented so far, we can assume it was a premature extra abstraction layer.

let's keep that abstraction

rdettai · 2025-03-17T11:38:43Z

quickwit/quickwit-search/src/search_response_rest.rs

+// TODO previously, we were using zero-copy when possible, which we are no longer doing:
+// is that problematic? How can we return to zero/low-copy without it being painful?


The reason I switched to serde_json_borrow was a very large allocation for this intermediate representation (up to 5x the actual response payload). I assume postcard should be a lot better than serde_json::Value, even if it's not borrowing. A quick check with heaptrack should confirm this.

the largest footprint was probably the BTreeMap in serde_json. Here at least we are using hashmap (which I believe could be Vec<(K,V)> or something more frugal in term of number of allocation. Typically one String big string, and stuff referring to it).

rdettai · 2025-03-17T12:03:28Z

quickwit/quickwit-query/src/aggregations.rs

Is removing #[serde(untagged)] from Tantivy aggregation result types the only reason for this "forked" intermediate representation to exist?

postcard also doesn't like skip_serialize_if, but yeah, that's the main reason

when parsing back something like a BucketResult, you end up parsing the wrong variant most of the time. That's an issue when converting the aggregation result to anything that's not strictly an ES lookalike

I think using https://serde.rs/remote-derive.html (either in Tantivy or Quickwit) could help readability a lot (example)

they have an untagged enum inside

fulmicoton-dd · 2025-03-19T10:19:31Z

quickwit/quickwit-query/src/aggregations.rs

+    /// Vector format bucket entries
+    Vec(Vec<T>),
+    /// HashMap format bucket entries
+    HashMap(FxHashMap<String, T>),


Could it have been Vec<(String, T)>?

i was about to say that we'd need to use something like serde_with::Map, but actually, we could just switch all hashmaps to Vec of tuples, this is our own format, and as long as we can convert losslessly from/to tantivy aggregations, we can do anything we want, including not storing things in the most evident format

trinity-1686a added 2 commits February 28, 2025 11:13

add proxy struct for aggregations

791efcd

use postcard for aggregations internally

26f79c2

trinity-1686a requested a review from fulmicoton-dd March 17, 2025 11:06

add license header

e785fe1

rdettai reviewed Mar 17, 2025

View reviewed changes

also fork percentile

e9cab79

they have an untagged enum inside

fulmicoton-dd reviewed Mar 19, 2025

View reviewed changes

fulmicoton-dd approved these changes Mar 19, 2025

View reviewed changes

use vec instead of hashmaps

5b23c88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-ambiguous internal aggregations #5715

non-ambiguous internal aggregations #5715

trinity-1686a commented Mar 17, 2025

rdettai Mar 17, 2025

trinity-1686a Mar 17, 2025

rdettai Mar 19, 2025

fulmicoton-dd Mar 21, 2025

rdettai Mar 17, 2025

fulmicoton-dd Mar 19, 2025 •

edited

Loading

rdettai Mar 17, 2025

trinity-1686a Mar 17, 2025

rdettai Mar 17, 2025

fulmicoton-dd Mar 19, 2025

trinity-1686a Mar 31, 2025

		// TODO previously, we were using zero-copy when possible, which we are no longer doing:
		// is that problematic? How can we return to zero/low-copy without it being painful?

non-ambiguous internal aggregations #5715

Are you sure you want to change the base?

non-ambiguous internal aggregations #5715

Conversation

trinity-1686a commented Mar 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fulmicoton-dd Mar 19, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fulmicoton-dd Mar 19, 2025 •

edited

Loading