Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-ambiguous internal aggregations #5715

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

trinity-1686a
Copy link
Contributor

use a non-ambiguous format for aggregation results internally

the api is unchanged, but it makes it easier to manipulate results from inside quickwit

before that, trying to re-parse the aggregation results (json) may lead to a different ast than expected, which made it hard to apply any transformation without also traversing the aggregation request ast at the same time

Comment on lines -292 to +296
// Serialized aggregation response
optional string aggregation = 5;
// used to be json-encoded aggregation
reserved 5;

// Postcard-encoded aggregation response
optional bytes aggregation_postcard = 9;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not retro-compatible, right? it means we cannot perform a rolling upgrade of searchers with this change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need the node which serves the rest api and the root-searcher for the request to be the same, otherwise the aggregation result just disappear, but you can have root-searcher and leaf-searcher on different versions and that's alright

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need the node which serves the rest api and the root-searcher for the request to be the same

From what I read in the code, this is always the case: the REST API uses SearchService that only has one concrete implementation (SearchServiceImpl).

Now, this raises a new question. We could also solve this by getting rid of the serialization entirely. If the SearchService trait was never re-implemented so far, we can assume it was a premature extra abstraction layer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep that abstraction

Comment on lines +27 to +28
// TODO previously, we were using zero-copy when possible, which we are no longer doing:
// is that problematic? How can we return to zero/low-copy without it being painful?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I switched to serde_json_borrow was a very large allocation for this intermediate representation (up to 5x the actual response payload). I assume postcard should be a lot better than serde_json::Value, even if it's not borrowing. A quick check with heaptrack should confirm this.

Copy link
Collaborator

@fulmicoton-dd fulmicoton-dd Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the largest footprint was probably the BTreeMap in serde_json. Here at least we are using hashmap (which I believe could be Vec<(K,V)> or something more frugal in term of number of allocation. Typically one String big string, and stuff referring to it).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is removing #[serde(untagged)] from Tantivy aggregation result types the only reason for this "forked" intermediate representation to exist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postcard also doesn't like skip_serialize_if, but yeah, that's the main reason

when parsing back something like a BucketResult, you end up parsing the wrong variant most of the time. That's an issue when converting the aggregation result to anything that's not strictly an ES lookalike

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using https://serde.rs/remote-derive.html (either in Tantivy or Quickwit) could help readability a lot (example)

they have an untagged enum inside
/// Vector format bucket entries
Vec(Vec<T>),
/// HashMap format bucket entries
HashMap(FxHashMap<String, T>),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it have been Vec<(String, T)>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was about to say that we'd need to use something like serde_with::Map, but actually, we could just switch all hashmaps to Vec of tuples, this is our own format, and as long as we can convert losslessly from/to tantivy aggregations, we can do anything we want, including not storing things in the most evident format

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants