fix: make random iteration over ComparisonIndexer fair across leaf indexers#2431
Open
mvanhorn wants to merge 2 commits into
Open
fix: make random iteration over ComparisonIndexer fair across leaf indexers#2431mvanhorn wants to merge 2 commits into
mvanhorn wants to merge 2 commits into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes random iteration over a
ComparisonIndexerfair across all elements in the query range. When the comparison map holds several buckets (leaf sub-indexers), a tuple now has the same probability of being visited next regardless of which bucket it lives in.Why this matters
ComparisonIndexer.RandomIteratorextendedDefaultIterator, which walks the in-range buckets sequentially and only randomizes within each leaf indexer's ownrandomIterator. As noted in #2325, the iterator picks buckets without regard to how many items each leaf holds, so a tuple in a small bucket is over- or under-represented relative to one in a large bucket. For random move selection this skews the distribution away from uniform-over-elements.core/.../bavet/common/index/ComparisonIndexer.javanow draws each in-range leaf indexer with probability proportional to itssize(queryCompositeKey), so every element within the range is equally likely to be visited next. The orderediterator()/forEach()/size()paths and the single-bucket and empty paths are unchanged.For the filtered overload, selection draws from each leaf's unfiltered iterator and applies the predicate during selection, removing rejected tuples as they are drawn. This keeps the bucket weights exact, so the result is fair over the surviving elements rather than over the raw bucket sizes.
remove()and itsIllegalStateExceptionsemantics are preserved.Testing
Added tests in
ComparisonIndexerTestexercising the random path directly (constructing the indexer with a random-access leaf backend):randomIteratorIsFairAcrossLeafIndexersOfDifferentSizes: one 100-element bucket plus three single-element buckets, sampled 200k times; every element's selection frequency is approximately uniform (the small-bucket tuples were never selected under the old behavior).randomIteratorWithFilterIsFairOverSurvivingElements: a 100-tuple bucket with a single match against a one-tuple bucket; both survivors are selected about equally often.remove()beforenext(), and predicate-respecting completeness for the filtered overload../mvnw -pl core -am test -Dtest=ComparisonIndexerTest-> Tests run: 10, Failures: 0, Errors: 0.Fixes #2325