Skip to content

Commit 311d552

Browse files
derrickstoleeGit for Windows Build Agent
authored and
Git for Windows Build Agent
committed
Add path walk API and its use in 'git pack-objects' (#5171)
This is a follow up to #5157 as well as motivated by the RFC in gitgitgadget#1786. We have ways of walking all objects, but it is focused on visiting a single commit and then expanding the new trees and blobs reachable from that commit that have not been visited yet. This means that objects arrive without any locality based on their path. Add a new "path walk API" that focuses on walking objects in batches according to their type and path. This will walk all annotated tags, all commits, all root trees, and then start a depth-first search among all paths in the repo to collect trees and blobs in batches. The most important application for this is being fast-tracked to Git for Windows: `git pack-objects --path-walk`. This application of the path walk API discovers the objects to pack via this batched walk, and automatically groups objects that appear at a common path so they can be checked for delta comparisons. This use completely avoids any name-hash collisions (even the collisions that sometimes occur with the new `--full-name-hash` option) and can be much faster to compute since the first pass of delta calculations does not waste time on objects that are unlikely to be diffable. Some statistics are available in the commit messages.
2 parents ee98f23 + aa6239c commit 311d552

30 files changed

+1466
-41
lines changed

Diff for: Documentation/config/feature.txt

+4
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,10 @@ walking fewer objects.
2020
+
2121
* `pack.allowPackReuse=multi` may improve the time it takes to create a pack by
2222
reusing objects from multiple packs instead of just one.
23+
+
24+
* `pack.usePathWalk` may speed up packfile creation and make the packfiles be
25+
significantly smaller in the presence of certain filename collisions with Git's
26+
default name-hash.
2327

2428
feature.manyFiles::
2529
Enable config options that optimize for repos with many files in the

Diff for: Documentation/config/pack.txt

+8
Original file line numberDiff line numberDiff line change
@@ -155,6 +155,14 @@ pack.useSparse::
155155
commits contain certain types of direct renames. Default is
156156
`true`.
157157

158+
pack.usePathWalk::
159+
When true, git will default to using the '--path-walk' option in
160+
'git pack-objects' when the '--revs' option is present. This
161+
algorithm groups objects by path to maximize the ability to
162+
compute delta chains across historical versions of the same
163+
object. This may disable other options, such as using bitmaps to
164+
enumerate objects.
165+
158166
pack.preferBitmapTips::
159167
When selecting which commits will receive bitmaps, prefer a
160168
commit at the tip of any reference that is a suffix of any value

Diff for: Documentation/git-pack-objects.txt

+11-1
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ SYNOPSIS
1616
[--cruft] [--cruft-expiration=<time>]
1717
[--stdout [--filter=<filter-spec>] | <base-name>]
1818
[--shallow] [--keep-true-parents] [--[no-]sparse]
19-
[--full-name-hash] < <object-list>
19+
[--full-name-hash] [--path-walk] < <object-list>
2020

2121

2222
DESCRIPTION
@@ -346,6 +346,16 @@ raise an error.
346346
Restrict delta matches based on "islands". See DELTA ISLANDS
347347
below.
348348

349+
--path-walk::
350+
By default, `git pack-objects` walks objects in an order that
351+
presents trees and blobs in an order unrelated to the path they
352+
appear relative to a commit's root tree. The `--path-walk` option
353+
enables a different walking algorithm that organizes trees and
354+
blobs by path. This has the potential to improve delta compression
355+
especially in the presence of filenames that cause collisions in
356+
Git's default name-hash algorithm. Due to changing how the objects
357+
are walked, this option is not compatible with `--delta-islands`,
358+
`--shallow`, or `--filter`.
349359

350360
DELTA ISLANDS
351361
-------------

Diff for: Documentation/git-repack.txt

+14-1
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ SYNOPSIS
1111
[verse]
1212
'git repack' [-a] [-A] [-d] [-f] [-F] [-l] [-n] [-q] [-b] [-m]
1313
[--window=<n>] [--depth=<n>] [--threads=<n>] [--keep-pack=<pack-name>]
14-
[--write-midx] [--full-name-hash]
14+
[--write-midx] [--full-name-hash] [--path-walk]
1515

1616
DESCRIPTION
1717
-----------
@@ -251,6 +251,19 @@ linkgit:git-multi-pack-index[1]).
251251
Write a multi-pack index (see linkgit:git-multi-pack-index[1])
252252
containing the non-redundant packs.
253253

254+
--path-walk::
255+
This option passes the `--path-walk` option to the underlying
256+
`git pack-options` process (see linkgit:git-pack-objects[1]).
257+
By default, `git pack-objects` walks objects in an order that
258+
presents trees and blobs in an order unrelated to the path they
259+
appear relative to a commit's root tree. The `--path-walk` option
260+
enables a different walking algorithm that organizes trees and
261+
blobs by path. This has the potential to improve delta compression
262+
especially in the presence of filenames that cause collisions in
263+
Git's default name-hash algorithm. Due to changing how the objects
264+
are walked, this option is not compatible with `--delta-islands`
265+
or `--filter`.
266+
254267
CONFIGURATION
255268
-------------
256269

Diff for: Documentation/technical/api-path-walk.txt

+73
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
Path-Walk API
2+
=============
3+
4+
The path-walk API is used to walk reachable objects, but to visit objects
5+
in batches based on a common path they appear in, or by type.
6+
7+
For example, all reachable commits are visited in a group. All tags are
8+
visited in a group. Then, all root trees are visited. At some point, all
9+
blobs reachable via a path `my/dir/to/A` are visited. When there are
10+
multiple paths possible to reach the same object, then only one of those
11+
paths is used to visit the object.
12+
13+
When walking a range of commits with some `UNINTERESTING` objects, the
14+
objects with the `UNINTERESTING` flag are included in these batches. In
15+
order to walk `UNINTERESTING` objects, the `--boundary` option must be
16+
used in the commit walk in order to visit `UNINTERESTING` commits.
17+
18+
Basics
19+
------
20+
21+
To use the path-walk API, include `path-walk.h` and call
22+
`walk_objects_by_path()` with a customized `path_walk_info` struct. The
23+
struct is used to set all of the options for how the walk should proceed.
24+
Let's dig into the different options and their use.
25+
26+
`path_fn` and `path_fn_data`::
27+
The most important option is the `path_fn` option, which is a
28+
function pointer to the callback that can execute logic on the
29+
object IDs for objects grouped by type and path. This function
30+
also receives a `data` value that corresponds to the
31+
`path_fn_data` member, for providing custom data structures to
32+
this callback function.
33+
34+
`revs`::
35+
To configure the exact details of the reachable set of objects,
36+
use the `revs` member and initialize it using the revision
37+
machinery in `revision.h`. Initialize `revs` using calls such as
38+
`setup_revisions()` or `parse_revision_opt()`. Do not call
39+
`prepare_revision_walk()`, as that will be called within
40+
`walk_objects_by_path()`.
41+
+
42+
It is also important that you do not specify the `--objects` flag for the
43+
`revs` struct. The revision walk should only be used to walk commits, and
44+
the objects will be walked in a separate way based on those starting
45+
commits.
46+
+
47+
If you want the path-walk API to emit `UNINTERESTING` objects based on the
48+
commit walk's boundary, be sure to set `revs.boundary` so the boundary
49+
commits are emitted.
50+
51+
`commits`, `blobs`, `trees`, `tags`::
52+
By default, these members are enabled and signal that the path-walk
53+
API should call the `path_fn` on objects of these types. Specialized
54+
applications could disable some options to make it simpler to walk
55+
the objects or to have fewer calls to `path_fn`.
56+
+
57+
While it is possible to walk only commits in this way, consumers would be
58+
better off using the revision walk API instead.
59+
60+
`prune_all_uninteresting`::
61+
By default, all reachable paths are emitted by the path-walk API.
62+
This option allows consumers to declare that they are not
63+
interested in paths where all included objects are marked with the
64+
`UNINTERESTING` flag. This requires using the `boundary` option in
65+
the revision walk so that the walk emits commits marked with the
66+
`UNINTERESTING` flag.
67+
68+
Examples
69+
--------
70+
71+
See example usages in:
72+
`t/helper/test-path-walk.c`,
73+
`builtin/pack-objects.c`

Diff for: Makefile

+2
Original file line numberDiff line numberDiff line change
@@ -828,6 +828,7 @@ TEST_BUILTINS_OBJS += test-parse-options.o
828828
TEST_BUILTINS_OBJS += test-parse-pathspec-file.o
829829
TEST_BUILTINS_OBJS += test-partial-clone.o
830830
TEST_BUILTINS_OBJS += test-path-utils.o
831+
TEST_BUILTINS_OBJS += test-path-walk.o
831832
TEST_BUILTINS_OBJS += test-pcre2-config.o
832833
TEST_BUILTINS_OBJS += test-pkt-line.o
833834
TEST_BUILTINS_OBJS += test-proc-receive.o
@@ -1104,6 +1105,7 @@ LIB_OBJS += parse-options.o
11041105
LIB_OBJS += patch-delta.o
11051106
LIB_OBJS += patch-ids.o
11061107
LIB_OBJS += path.o
1108+
LIB_OBJS += path-walk.o
11071109
LIB_OBJS += pathspec.o
11081110
LIB_OBJS += pkt-line.o
11091111
LIB_OBJS += preload-index.o

0 commit comments

Comments
 (0)