-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-ambiguous internal aggregations #5715
base: main
Are you sure you want to change the base?
Conversation
// Serialized aggregation response | ||
optional string aggregation = 5; | ||
// used to be json-encoded aggregation | ||
reserved 5; | ||
|
||
// Postcard-encoded aggregation response | ||
optional bytes aggregation_postcard = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not retro-compatible, right? it means we cannot perform a rolling upgrade of searchers with this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need the node which serves the rest api and the root-searcher for the request to be the same, otherwise the aggregation result just disappear, but you can have root-searcher and leaf-searcher on different versions and that's alright
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need the node which serves the rest api and the root-searcher for the request to be the same
From what I read in the code, this is always the case: the REST API uses SearchService
that only has one concrete implementation (SearchServiceImpl
).
Now, this raises a new question. We could also solve this by getting rid of the serialization entirely. If the SearchService
trait was never re-implemented so far, we can assume it was a premature extra abstraction layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's keep that abstraction
// TODO previously, we were using zero-copy when possible, which we are no longer doing: | ||
// is that problematic? How can we return to zero/low-copy without it being painful? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I switched to serde_json_borrow
was a very large allocation for this intermediate representation (up to 5x the actual response payload). I assume postcard should be a lot better than serde_json::Value
, even if it's not borrowing. A quick check with heaptrack should confirm this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the largest footprint was probably the BTreeMap in serde_json. Here at least we are using hashmap (which I believe could be Vec<(K,V)>
or something more frugal in term of number of allocation. Typically one String big string, and stuff referring to it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is removing #[serde(untagged)]
from Tantivy aggregation result types the only reason for this "forked" intermediate representation to exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postcard also doesn't like skip_serialize_if
, but yeah, that's the main reason
when parsing back something like a BucketResult
, you end up parsing the wrong variant most of the time. That's an issue when converting the aggregation result to anything that's not strictly an ES lookalike
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using https://serde.rs/remote-derive.html (either in Tantivy or Quickwit) could help readability a lot (example)
they have an untagged enum inside
/// Vector format bucket entries | ||
Vec(Vec<T>), | ||
/// HashMap format bucket entries | ||
HashMap(FxHashMap<String, T>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it have been Vec<(String, T)>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was about to say that we'd need to use something like serde_with::Map
, but actually, we could just switch all hashmaps to Vec of tuples, this is our own format, and as long as we can convert losslessly from/to tantivy aggregations, we can do anything we want, including not storing things in the most evident format
use a non-ambiguous format for aggregation results internally
the api is unchanged, but it makes it easier to manipulate results from inside quickwit
before that, trying to re-parse the aggregation results (json) may lead to a different ast than expected, which made it hard to apply any transformation without also traversing the aggregation request ast at the same time