You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is part 3 of 3
This PR replays an earlier attemptwithout the two bugs it introduces
(fanout problem for byte `00`; dropping large offsets). The previous PRs
protect us from writing a bad MIDX if other bugs are introduced, and
more rigorously tests the MIDX feature. If we ever do see a problematic
MIDX in the wild, we can use `git midx --verify` to inspect it and see
what is wrong.
With the creation of this PR, I will run the gambit of GVFS tests and
add a `git midx --verify` step to the OS Repo Tests (once the
GitForWindows update is in master).
---
The multi-pack index requires a sorted list of objects to
create a binary-searchable index of objects across multiple
pack-files. If an object appears in multiple packs, then we
select a signle copy based on the most-recent mtime among
packs containing that object.
In the case of many duplicate objects, or simply many objects,
we can speed up the de-duplication by processing objects in
batches. Using the first byte of the object ID is a natural
way to batch because the MIDX and IDX files have a fanout table
based on the first byte. This gives us a way to navigate directly
to the objects from each batch from each source.
To process a batch, create an array of MIDX entries for each object
matching that first byte value, then sort by OID (breaking ties by
recently-modified packs first). Then copy the first instance of an
object to the final object list.
Since the pack-by-pack loading happens in builtin/midx.c, move this
de-duplication to that file and add an expectation to write_midx_file()
that the input object list is sorted and de-duplicated.
Note: this commit includes fixes for the bugs introduced by a
previous version ("midx: batch object sort by first byte").
Signed-off-by: Derrick Stolee <[email protected]>
0 commit comments