Discussion:
[PATCH v2 0/25] prune-safety
Jeff King
2014-10-15 22:32:44 UTC
Permalink
Here's a re-roll of the patch series I posted earlier to make "git
prune" keep more contiguous chunks of the object graph. The cleanups to
t5304 were spun off into their own series, and are dropped here.
However, the other patches seem to have multiplied in number (I must
have fed them after midnight).

Here are the changes since the first round (thanks everybody for your
comments):

- fix bogus return values from freshen_file, foreach_alt_odb, and
for_each_packed_object

- make for_each_object_in_pack static

- clarify commit message for "keep objects reachable from recent
objects" patch (this was the one that confused Junio, and I
elaborated based on our discussion)

- clarify the definition of "loose object dirs" in the comment above
for_each_loose_file_in_object_dir

- in for_each_loose_file, traverse hashed loose object directories in
numeric order, and pass the number to the subdir callback (this is
used by prune-packed for its progress updates); as a side effect,
this fixes the bugs Michael noticed with the subdir callback.

- prune-packed now reuses the for_each_loose_file interface

- use revs->ignore_missing_links so we don't barf on already-missing
unreferenced objects

- convert reachable.c to use traverse_commit_list instead of its own
custom walk; this gives support for ignore_missing_links above, and
saves us a fair bit of code.

- while in the area, I noticed that reachable.c's reflog handling is
the same as rev-list's --reflog option; it now builds on what's in
revision.c.

That takes us up to patch 17. While working in reachable.c, I noticed an
oddity: we consider objects in the index to be reachable during prune
(which is good), but we do not when dropping them during a repack that
uses --unpack-unreachable=<expiration>. The remaining patches fix that,
which needed a fair bit of preparatory cleanup.

I'm really beginning to question whether the "just drop objects that are
about to be pruned" optimization done in 7e52f56 (gc: do not explode
objects which will be immediately pruned, 2012-04-07). It really
complicates things as pack-objects and prune need to have the exact same
rules (and implementing it naively, by having pack-objects run the same
code as prune, is not desirable because pack-objects has _already_ done
a complete expensive traversal to generate the packing list). And I fear
it will get even worse if we implement some of the race-condition fixes
that Michael suggested earlier.

On the other hand, the loosening behavior without 7e52f56 has some
severe pathological cases. A repository which has had a chunk of history
deleted can easily increase in size several orders of magnitude due to
loosening (since we lose the benefit of all deltas in the loosened
objects).

Finally, there are a few things that were discussed that I didn't
address/fix. I don't think any of them is a critical blocker, but I
did want to summarize the state:

- when refreshing, we may update a pack's mtime multiple times. It
probably wouldn't be too hard to cache this and only update once per
program run, but I also don't think it's that big a deal in
practice.

- We will munge mtimes of objects found in alternates. If we don't
have write access to the alternate, we'll create a local duplicate
of the object. This is the safer thing, but I'm not sure if there
are cases where we might try to write out a large number of objects
which exist in an alternate (OTOH, we will eventually drop them at
the next repack).

- I didn't implement the "sort by inode" trick that fsck does when
traversing the loose objects. It wouldn't be too hard, but I'm not
convinced it's actually important.

- I didn't convert fsck to the for_each_loose_file interface (mostly
because I didn't do the inode-sorting trick, and while I don't think
it matters, I didn't go to the work to show that it _doesn't_).

Here are the patches:

[01/25]: foreach_alt_odb: propagate return value from callback
[02/25]: isxdigit: cast input to unsigned char
[03/25]: object_array: factor out slopbuf-freeing logic
[04/25]: object_array: add a "clear" function
[05/25]: clean up name allocation in prepare_revision_walk
[06/25]: reachable: use traverse_commit_list instead of custom walk
[07/25]: reachable: reuse revision.c "add all reflogs" code
[08/25]: prune: factor out loose-object directory traversal
[09/25]: reachable: mark index blobs as SEEN
[10/25]: prune-packed: use for_each_loose_file_in_objdir
[11/25]: count-objects: do not use xsize_t when counting object size
[12/25]: count-objects: use for_each_loose_file_in_objdir
[13/25]: sha1_file: add for_each iterators for loose and packed objects
[14/25]: prune: keep objects reachable from recent objects
[15/25]: pack-objects: refactor unpack-unreachable expiration check
[16/25]: pack-objects: match prune logic for discarding objects
[17/25]: write_sha1_file: freshen existing objects
[18/25]: make add_object_array_with_context interface more sane
[19/25]: traverse_commit_list: support pending blobs/trees with paths
[20/25]: rev-list: document --reflog option
[21/25]: rev-list: add --index-objects option
[22/25]: reachable: use revision machinery's --index-objects code
[23/25]: pack-objects: use argv_array
[24/25]: repack: pack objects mentioned by the index
[25/25]: pack-objects: double-check options before discarding objects
Jeff King
2014-10-15 22:33:13 UTC
Permalink
We check the return value of the callback and stop iterating
if it is non-zero. However, we do not make the non-zero
return value available to the caller, so they have no way of
knowing whether the operation succeeded or not (technically
they can keep their own error flag in the callback data, but
that is unlike our other for_each functions).

Signed-off-by: Jeff King <***@peff.net>
---
cache.h | 2 +-
sha1_file.c | 12 ++++++++----
2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/cache.h b/cache.h
index 5b86065..13fadb6 100644
--- a/cache.h
+++ b/cache.h
@@ -1125,7 +1125,7 @@ extern void prepare_alt_odb(void);
extern void read_info_alternates(const char * relative_base, int depth);
extern void add_to_alternates_file(const char *reference);
typedef int alt_odb_fn(struct alternate_object_database *, void *);
-extern void foreach_alt_odb(alt_odb_fn, void*);
+extern int foreach_alt_odb(alt_odb_fn, void*);

struct pack_window {
struct pack_window *next;
diff --git a/sha1_file.c b/sha1_file.c
index 83f77f0..fa881bf 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -413,14 +413,18 @@ void add_to_alternates_file(const char *reference)
link_alt_odb_entries(alt, strlen(alt), '\n', NULL, 0);
}

-void foreach_alt_odb(alt_odb_fn fn, void *cb)
+int foreach_alt_odb(alt_odb_fn fn, void *cb)
{
struct alternate_object_database *ent;
+ int r = 0;

prepare_alt_odb();
- for (ent = alt_odb_list; ent; ent = ent->next)
- if (fn(ent, cb))
- return;
+ for (ent = alt_odb_list; ent; ent = ent->next) {
+ r = fn(ent, cb);
+ if (r)
+ break;
+ }
+ return r;
}

void prepare_alt_odb(void)
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:34:05 UTC
Permalink
Otherwise, callers must do so or risk triggering warnings
-Wchar-subscript (and rightfully so; a signed char might
cause us to use a bogus negative index into the
hexval_table).

While we are dropping the now-unnecessary casts from the
caller in urlmatch.c, we can get rid of similar casts in
actually parsing the hex by using the hexval() helper, which
implicitly casts to unsigned (but note that we cannot
implement isxdigit in terms of hexval(), as it also casts
its return value to unsigned).

Signed-off-by: Jeff King <***@peff.net>
---
The patch that added more calls to isxdigit later in the series actually
got reworked. So this is purely a cleanup and can be dropped if need be,
though I still think it is an improvement.

git-compat-util.h | 2 +-
urlmatch.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index fb41118..44890d5 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -677,7 +677,7 @@ extern const unsigned char sane_ctype[256];
#define iscntrl(x) (sane_istest(x,GIT_CNTRL))
#define ispunct(x) sane_istest(x, GIT_PUNCT | GIT_REGEX_SPECIAL | \
GIT_GLOB_SPECIAL | GIT_PATHSPEC_MAGIC)
-#define isxdigit(x) (hexval_table[x] != -1)
+#define isxdigit(x) (hexval_table[(unsigned char)(x)] != -1)
#define tolower(x) sane_case((unsigned char)(x), 0x20)
#define toupper(x) sane_case((unsigned char)(x), 0)
#define is_pathspec_magic(x) sane_istest(x,GIT_PATHSPEC_MAGIC)
diff --git a/urlmatch.c b/urlmatch.c
index 3d4c54b..618d216 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -43,11 +43,11 @@ static int append_normalized_escapes(struct strbuf *buf,
from_len--;
if (ch == '%') {
if (from_len < 2 ||
- !isxdigit((unsigned char)from[0]) ||
- !isxdigit((unsigned char)from[1]))
+ !isxdigit(from[0]) ||
+ !isxdigit(from[1]))
return 0;
- ch = hexval_table[(unsigned char)*from++] << 4;
- ch |= hexval_table[(unsigned char)*from++];
+ ch = hexval(*from++) << 4;
+ ch |= hexval(*from++);
from_len -= 2;
was_esc = 1;
}
--
2.1.2.596.g7379948
Junio C Hamano
2014-10-16 17:16:10 UTC
Permalink
Post by Jeff King
Otherwise, callers must do so or risk triggering warnings
-Wchar-subscript (and rightfully so; a signed char might
cause us to use a bogus negative index into the
hexval_table).
While we are dropping the now-unnecessary casts from the
caller in urlmatch.c, we can get rid of similar casts in
actually parsing the hex by using the hexval() helper, which
implicitly casts to unsigned (but note that we cannot
implement isxdigit in terms of hexval(), as it also casts
its return value to unsigned).
---
The patch that added more calls to isxdigit later in the series actually
got reworked. So this is purely a cleanup and can be dropped if need be,
though I still think it is an improvement.
Yes, thanks.
Post by Jeff King
git-compat-util.h | 2 +-
urlmatch.c | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
diff --git a/git-compat-util.h b/git-compat-util.h
index fb41118..44890d5 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -677,7 +677,7 @@ extern const unsigned char sane_ctype[256];
#define iscntrl(x) (sane_istest(x,GIT_CNTRL))
#define ispunct(x) sane_istest(x, GIT_PUNCT | GIT_REGEX_SPECIAL | \
GIT_GLOB_SPECIAL | GIT_PATHSPEC_MAGIC)
-#define isxdigit(x) (hexval_table[x] != -1)
+#define isxdigit(x) (hexval_table[(unsigned char)(x)] != -1)
#define tolower(x) sane_case((unsigned char)(x), 0x20)
#define toupper(x) sane_case((unsigned char)(x), 0)
#define is_pathspec_magic(x) sane_istest(x,GIT_PATHSPEC_MAGIC)
diff --git a/urlmatch.c b/urlmatch.c
index 3d4c54b..618d216 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -43,11 +43,11 @@ static int append_normalized_escapes(struct strbuf *buf,
from_len--;
if (ch == '%') {
if (from_len < 2 ||
- !isxdigit((unsigned char)from[0]) ||
- !isxdigit((unsigned char)from[1]))
+ !isxdigit(from[0]) ||
+ !isxdigit(from[1]))
return 0;
- ch = hexval_table[(unsigned char)*from++] << 4;
- ch |= hexval_table[(unsigned char)*from++];
+ ch = hexval(*from++) << 4;
+ ch |= hexval(*from++);
from_len -= 2;
was_esc = 1;
}
Jeff King
2014-10-15 22:34:19 UTC
Permalink
This is not a lot of code, but it's a logical construct that
should not need to be repeated (and we are about to add a
third repetition).

Signed-off-by: Jeff King <***@peff.net>
---
object.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/object.c b/object.c
index ca9d790..60f4864 100644
--- a/object.c
+++ b/object.c
@@ -355,6 +355,16 @@ void add_object_array_with_context(struct object *obj, const char *name, struct
add_object_array_with_mode_context(obj, name, array, S_IFINVALID, context);
}

+/*
+ * Free all memory associated with an entry; the result is
+ * in an unspecified state and should not be examined.
+ */
+static void object_array_release_entry(struct object_array_entry *ent)
+{
+ if (ent->name != object_array_slopbuf)
+ free(ent->name);
+}
+
void object_array_filter(struct object_array *array,
object_array_each_func_t want, void *cb_data)
{
@@ -367,8 +377,7 @@ void object_array_filter(struct object_array *array,
objects[dst] = objects[src];
dst++;
} else {
- if (objects[src].name != object_array_slopbuf)
- free(objects[src].name);
+ object_array_release_entry(&objects[src]);
}
}
array->nr = dst;
@@ -400,8 +409,7 @@ void object_array_remove_duplicates(struct object_array *array)
objects[array->nr] = objects[src];
array->nr++;
} else {
- if (objects[src].name != object_array_slopbuf)
- free(objects[src].name);
+ object_array_release_entry(&objects[src]);
}
}
}
--
2.1.2.596.g7379948
Junio C Hamano
2014-10-16 17:39:54 UTC
Permalink
Post by Jeff King
This is not a lot of code, but it's a logical construct that
should not need to be repeated (and we are about to add a
third repetition).
Good, but I have two and a half tangential comments about the
context that appears in this patch ;-)
Post by Jeff King
void object_array_filter(struct object_array *array,
object_array_each_func_t want, void *cb_data)
{
@@ -367,8 +377,7 @@ void object_array_filter(struct object_array *array,
objects[dst] = objects[src];
dst++;
} else {
- if (objects[src].name != object_array_slopbuf)
- free(objects[src].name);
+ object_array_release_entry(&objects[src]);
}
}
array->nr = dst;
@@ -400,8 +409,7 @@ void object_array_remove_duplicates(struct object_array *array)
objects[array->nr] = objects[src];
array->nr++;
} else {
- if (objects[src].name != object_array_slopbuf)
- free(objects[src].name);
+ object_array_release_entry(&objects[src]);
}
}
}
1. These two functions both remove elements from a given array
in-place, the former being in a more generic form that takes a
caller-specified criterion while the latter uses a hardcoded
condition to decide what to filter. aeb4a51e (object_array:
add function object_array_filter(), 2013-05-25) and later
1506510c (object_array_remove_duplicates(): rewrite to reduce
copying, 2013-05-25) should have refactored the latter further
to implement it in terms of the former, perhaps?

1.5 I would have expected a function to "remove duplicates from an
array" to remove duplicates from the array by comparing the objects
contained in the array, not entries that may (or may not) point at
different objects but happens to share the same name; I think this
function is misnamed.

2. We use object_array_remove_duplicates() to de-dup "git bundle
create x master master", which came from b2a6d1c6 (bundle:
allow the same ref to be given more than once, 2009-01-17),
which is still the sole caller of the function, and I think
this is bogus. Comparing .name would not de-dup "git bundle
create x master refs/heads/master".

I think the right way to fix these two and a half problems is to do
the following:

- object_array_remove_duplicates() (and contains_name() helper it
uses) should be removed from object.c;

- create_bundle() in bundle.c should implement a helper that is
similar to contains_name() but knows about ref dwimming and use
it to call object_array_filter() to replace its call to
object_array_remove_duplicates().

I am not doing this myself, and I do not expect either you or
Michael to do so, either. I am just writing this down to point out
a low hanging fruit to aspiring new contributors (hint, hint).

Thanks.
Jeff King
2014-10-17 00:33:56 UTC
Permalink
[subject tweaked as we have veered quite far off the original, and
this might get more attention from interested people]
Post by Junio C Hamano
2. We use object_array_remove_duplicates() to de-dup "git bundle
allow the same ref to be given more than once, 2009-01-17),
which is still the sole caller of the function, and I think
this is bogus. Comparing .name would not de-dup "git bundle
create x master refs/heads/master".
Hmm. We also doesn't respect the UNINTERESTING flag, so it will dedup
"foo" and "^foo" into a single entry. I _think_ this doesn't have any
bad side effects, because they are equivalent (i.e., the flag is carried
on the object, not the pending entry). I would expect this:

git bundle create ... foo ^foo

to barf (and it does) because the bundle is empty. But with this:

git bundle create ... another-ref foo ^foo

I would expect the resulting bundle to contain the objects from
another-ref, but still mention "foo" as an update (we didn't need to
send any objects, since presumably the other side already had them). It
doesn't, but that is not because of the dedup. It is because we generate
the list of refs to send based on the pending list, and we skip all
UNINTERESTING objects (we must, because otherwise "^foo" would make an
entry). I.e., it is the same "the flag is on the object, not the pending
entry" that saved us above now causing its own bug. Obviously the
example above is sort of silly, but if you have:

# two branches point at the same object
git branch foo master

# the other side already has master. Let's send them foo.
# this will fail because the bundle is empty. That's mildly
# annoying because we really want to tell them "hey, update
# your foo branch". But at least we get an error.
git bundle create tmp.bundle foo ^master

# now the same thing, but we're sending them some other objects.
# This one succeeds, but silently omits foo from the bundle!
git bundle create tmp.bundle foo another-ref ^master

I have a feeling that the list needs to be generated from revs.cmdline,
not revs.pending, and the de-duplication needs to happen there (with the
ref resolution that you mention).

I also have the feeling that fast-export had to deal with this exact
same issue. It looks like we use revs.cmdline there. I seem to recall
there was some ongoing work in that area, but I stopped paying close
attention due to some personality conflicts, and I don't know if all of
the issues were ever resolved.
Post by Junio C Hamano
I think the right way to fix these two and a half problems is to do
- object_array_remove_duplicates() (and contains_name() helper it
uses) should be removed from object.c;
- create_bundle() in bundle.c should implement a helper that is
similar to contains_name() but knows about ref dwimming and use
it to call object_array_filter() to replace its call to
object_array_remove_duplicates().
Agreed. The loop in create_bundle right after the de-dup does the
dwimming. Probably it would be simple to just skip duplicates there
using a hashmap or sorted list.
Post by Junio C Hamano
I am not doing this myself, and I do not expect either you or
Michael to do so, either. I am just writing this down to point out
a low hanging fruit to aspiring new contributors (hint, hint).
I am also not planning on working on it soon, but now we have hopefully
fed plenty of possibilities to anybody who wants to. :)

-Peff
Philip Oakley
2014-10-17 21:03:14 UTC
Permalink
Post by Jeff King
[subject tweaked as we have veered quite far off the original, and
this might get more attention from interested people]
Post by Junio C Hamano
2. We use object_array_remove_duplicates() to de-dup "git bundle
allow the same ref to be given more than once, 2009-01-17),
which is still the sole caller of the function, and I think
this is bogus. Comparing .name would not de-dup "git bundle
create x master refs/heads/master".
Hmm. We also doesn't respect the UNINTERESTING flag, so it will dedup
"foo" and "^foo" into a single entry. I _think_ this doesn't have any
bad side effects, because they are equivalent (i.e., the flag is carried
git bundle create ... foo ^foo
git bundle create ... another-ref foo ^foo
I would expect the resulting bundle to contain the objects from
another-ref, but still mention "foo" as an update (we didn't need to
send any objects, since presumably the other side already had them). It
doesn't, but that is not because of the dedup. It is because we generate
the list of refs to send based on the pending list, and we skip all
UNINTERESTING objects (we must, because otherwise "^foo" would make an
entry). I.e., it is the same "the flag is on the object, not the pending
entry" that saved us above now causing its own bug. Obviously the
# two branches point at the same object
git branch foo master
# the other side already has master. Let's send them foo.
# this will fail because the bundle is empty. That's mildly
# annoying because we really want to tell them "hey, update
# your foo branch". But at least we get an error.
git bundle create tmp.bundle foo ^master
Isn't this kindof what happens as an issue when we want the right HEAD
to be included explicitly in a bundle. Though
ttp://thread.gmane.org/gmane.comp.version-control.git/234053 suggests
its more complicated than that.
Post by Jeff King
# now the same thing, but we're sending them some other objects.
# This one succeeds, but silently omits foo from the bundle!
git bundle create tmp.bundle foo another-ref ^master
I have a feeling that the list needs to be generated from
revs.cmdline,
not revs.pending, and the de-duplication needs to happen there (with the
ref resolution that you mention).
I also have the feeling that fast-export had to deal with this exact
same issue. It looks like we use revs.cmdline there. I seem to recall
there was some ongoing work in that area, but I stopped paying close
attention due to some personality conflicts, and I don't know if all of
the issues were ever resolved.
Post by Junio C Hamano
I think the right way to fix these two and a half problems is to do
- object_array_remove_duplicates() (and contains_name() helper it
uses) should be removed from object.c;
- create_bundle() in bundle.c should implement a helper that is
similar to contains_name() but knows about ref dwimming and use
it to call object_array_filter() to replace its call to
object_array_remove_duplicates().
Agreed. The loop in create_bundle right after the de-dup does the
dwimming. Probably it would be simple to just skip duplicates there
using a hashmap or sorted list.
Post by Junio C Hamano
I am not doing this myself, and I do not expect either you or
Michael to do so, either. I am just writing this down to point out
a low hanging fruit to aspiring new contributors (hint, hint).
I am also not planning on working on it soon, but now we have
hopefully
fed plenty of possibilities to anybody who wants to. :)
--
Philip
Junio C Hamano
2014-10-17 22:41:22 UTC
Permalink
Post by Philip Oakley
Post by Jeff King
# two branches point at the same object
git branch foo master
# the other side already has master. Let's send them foo.
# this will fail because the bundle is empty. That's mildly
# annoying because we really want to tell them "hey, update
# your foo branch". But at least we get an error.
git bundle create tmp.bundle foo ^master
Isn't this kindof what happens as an issue when we want the right HEAD
to be included explicitly in a bundle. Though
What we are discussing here is "we tell from the command line where
the histories end, but do we correctly record all these end points
as fetchable refs in the resulting bundle?"

It does not have anything to do with "bundle that does not record
its HEAD cannot be cloned", which happens when you do not mention
HEAD when creating the bundle in the first place, which is a totally
different thing.
Post by Philip Oakley
http://thread.gmane.org/gmane.comp.version-control.git/234053 suggests
its more complicated than that.
The main topic of discussion does not have much to what bundle
records and what a reader of a bundle guesses. It is about what
goes on the wire and mention of bundle was just a tangent brought up
by those who do not know what was being discussed, I think.

I think the right fix to the "git bundle" issue is to make it easier
on the "git bundle create" side to have the resulting bundle record
its HEAD, even when the user did not mention HEAD on the command
line. For example, when there is only one end point, e.g. "git
bundle create x next", record refs/heads/next _and_ HEAD pointing at
the same commit, because there is no other seneible choice.

"git bundle create y master next" may record master, next and HEAD
while HEAD is likely pointing at the same commit as master (because
'master' is special). Or we could give a warning and even go
interactive to ask which ref to record as HEAD.

But the above three paragraphs are tangent so I'd stop here.
Jeff King
2014-10-15 22:34:34 UTC
Permalink
There's currently no easy way to free the memory associated
with an object_array (and in most cases, we simply leak the
memory in a rev_info's pending array). Let's provide a
helper to make this easier to handle.

We can make use of it in list-objects.c, which does the same
thing by hand (but fails to free the "name" field of each
entry, potentially leaking memory).

Signed-off-by: Jeff King <***@peff.net>
---
list-objects.c | 7 +------
object.c | 10 ++++++++++
object.h | 6 ++++++
3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index 3595ee7..fad6808 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -228,11 +228,6 @@ void traverse_commit_list(struct rev_info *revs,
die("unknown pending object %s (%s)",
sha1_to_hex(obj->sha1), name);
}
- if (revs->pending.nr) {
- free(revs->pending.objects);
- revs->pending.nr = 0;
- revs->pending.alloc = 0;
- revs->pending.objects = NULL;
- }
+ object_array_clear(&revs->pending);
strbuf_release(&base);
}
diff --git a/object.c b/object.c
index 60f4864..6aeb1bb 100644
--- a/object.c
+++ b/object.c
@@ -383,6 +383,16 @@ void object_array_filter(struct object_array *array,
array->nr = dst;
}

+void object_array_clear(struct object_array *array)
+{
+ int i;
+ for (i = 0; i < array->nr; i++)
+ object_array_release_entry(&array->objects[i]);
+ free(array->objects);
+ array->objects = NULL;
+ array->nr = array->alloc = 0;
+}
+
/*
* Return true iff array already contains an entry with name.
*/
diff --git a/object.h b/object.h
index e028ced..2a755a2 100644
--- a/object.h
+++ b/object.h
@@ -133,6 +133,12 @@ void object_array_filter(struct object_array *array,
*/
void object_array_remove_duplicates(struct object_array *array);

+/*
+ * Remove any objects from the array, freeing all used memory; afterwards
+ * the array is ready to store more objects with add_object_array().
+ */
+void object_array_clear(struct object_array *array);
+
void clear_object_flags(unsigned flags);

#endif /* OBJECT_H */
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:35:12 UTC
Permalink
When we enter prepare_revision_walk, we have zero or more
entries in our "pending" array. We disconnect that array
from the rev_info, and then process each entry:

1. If the entry is a commit and the --source option is in
effect, we keep a pointer to the object name.

2. Otherwise, we re-add the item to the pending list with
a blank name.

We then throw away the old array by freeing the array
itself, but do not touch the "name" field of each entry. For
any items of type (2), we leak the memory associated with
the name. This commit fixes that by calling object_array_clear,
which handles the cleanup for us.

That breaks (1), though, because it depends on the memory
pointed to by the name to last forever. We can solve that by
making a copy of the name. This is slightly less efficient,
but it shouldn't matter in practice, as we do it only for
the tip commits of the traversal.

Signed-off-by: Jeff King <***@peff.net>
---
revision.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/revision.c b/revision.c
index e498b7c..01cc276 100644
--- a/revision.c
+++ b/revision.c
@@ -300,7 +300,7 @@ static struct commit *handle_commit(struct rev_info *revs,
revs->limited = 1;
}
if (revs->show_source && !commit->util)
- commit->util = (void *) name;
+ commit->util = xstrdup(name);
return commit;
}

@@ -2656,15 +2656,16 @@ void reset_revision_walk(void)

int prepare_revision_walk(struct rev_info *revs)
{
- int nr = revs->pending.nr;
- struct object_array_entry *e, *list;
+ int i;
+ struct object_array old_pending;
struct commit_list **next = &revs->commits;

- e = list = revs->pending.objects;
+ memcpy(&old_pending, &revs->pending, sizeof(old_pending));
revs->pending.nr = 0;
revs->pending.alloc = 0;
revs->pending.objects = NULL;
- while (--nr >= 0) {
+ for (i = 0; i < old_pending.nr; i++) {
+ struct object_array_entry *e = old_pending.objects + i;
struct commit *commit = handle_commit(revs, e->item, e->name);
if (commit) {
if (!(commit->object.flags & SEEN)) {
@@ -2672,10 +2673,9 @@ int prepare_revision_walk(struct rev_info *revs)
next = commit_list_append(commit, next);
}
}
- e++;
}
if (!revs->leak_pending)
- free(list);
+ object_array_clear(&old_pending);

/* Signal whether we need per-parent treesame decoration */
if (revs->simplify_merges ||
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:37:28 UTC
Permalink
To find the set of reachable objects, we add a bunch of
possible sources to our rev_info, call prepare_revision_walk,
and then launch into a custom walker that handles each
object top. This is a subset of what traverse_commit_list
does, so we can just reuse that code (it can also handle
more complex cases like UNINTERESTING commits and pathspecs,
but we don't use those features).

Signed-off-by: Jeff King <***@peff.net>
---
I was concerned this would be slower because traverse_commit_list is
more featureful. To my surprise, it was consistently about 3-4% faster!
The major difference is that traverse_commit_list will hit all of the
commits first, and then the trees. For reachability that doesn't matter
either way, but I suspect the new way has slightly better cache
locality, leading to the minor speedup.

reachable.c | 130 ++++++++----------------------------------------------------
1 file changed, 17 insertions(+), 113 deletions(-)

diff --git a/reachable.c b/reachable.c
index 6f6835b..02bf6c2 100644
--- a/reachable.c
+++ b/reachable.c
@@ -8,6 +8,7 @@
#include "reachable.h"
#include "cache-tree.h"
#include "progress.h"
+#include "list-objects.h"

struct connectivity_progress {
struct progress *progress;
@@ -21,118 +22,6 @@ static void update_progress(struct connectivity_progress *cp)
display_progress(cp->progress, cp->count);
}

-static void process_blob(struct blob *blob,
- struct object_array *p,
- struct name_path *path,
- const char *name,
- struct connectivity_progress *cp)
-{
- struct object *obj = &blob->object;
-
- if (!blob)
- die("bad blob object");
- if (obj->flags & SEEN)
- return;
- obj->flags |= SEEN;
- update_progress(cp);
- /* Nothing to do, really .. The blob lookup was the important part */
-}
-
-static void process_gitlink(const unsigned char *sha1,
- struct object_array *p,
- struct name_path *path,
- const char *name)
-{
- /* I don't think we want to recurse into this, really. */
-}
-
-static void process_tree(struct tree *tree,
- struct object_array *p,
- struct name_path *path,
- const char *name,
- struct connectivity_progress *cp)
-{
- struct object *obj = &tree->object;
- struct tree_desc desc;
- struct name_entry entry;
- struct name_path me;
-
- if (!tree)
- die("bad tree object");
- if (obj->flags & SEEN)
- return;
- obj->flags |= SEEN;
- update_progress(cp);
- if (parse_tree(tree) < 0)
- die("bad tree object %s", sha1_to_hex(obj->sha1));
- add_object(obj, p, path, name);
- me.up = path;
- me.elem = name;
- me.elem_len = strlen(name);
-
- init_tree_desc(&desc, tree->buffer, tree->size);
-
- while (tree_entry(&desc, &entry)) {
- if (S_ISDIR(entry.mode))
- process_tree(lookup_tree(entry.sha1), p, &me, entry.path, cp);
- else if (S_ISGITLINK(entry.mode))
- process_gitlink(entry.sha1, p, &me, entry.path);
- else
- process_blob(lookup_blob(entry.sha1), p, &me, entry.path, cp);
- }
- free_tree_buffer(tree);
-}
-
-static void process_tag(struct tag *tag, struct object_array *p,
- const char *name, struct connectivity_progress *cp)
-{
- struct object *obj = &tag->object;
-
- if (obj->flags & SEEN)
- return;
- obj->flags |= SEEN;
- update_progress(cp);
-
- if (parse_tag(tag) < 0)
- die("bad tag object %s", sha1_to_hex(obj->sha1));
- if (tag->tagged)
- add_object(tag->tagged, p, NULL, name);
-}
-
-static void walk_commit_list(struct rev_info *revs,
- struct connectivity_progress *cp)
-{
- int i;
- struct commit *commit;
- struct object_array objects = OBJECT_ARRAY_INIT;
-
- /* Walk all commits, process their trees */
- while ((commit = get_revision(revs)) != NULL) {
- process_tree(commit->tree, &objects, NULL, "", cp);
- update_progress(cp);
- }
-
- /* Then walk all the pending objects, recursively processing them too */
- for (i = 0; i < revs->pending.nr; i++) {
- struct object_array_entry *pending = revs->pending.objects + i;
- struct object *obj = pending->item;
- const char *name = pending->name;
- if (obj->type == OBJ_TAG) {
- process_tag((struct tag *) obj, &objects, name, cp);
- continue;
- }
- if (obj->type == OBJ_TREE) {
- process_tree((struct tree *)obj, &objects, NULL, name, cp);
- continue;
- }
- if (obj->type == OBJ_BLOB) {
- process_blob((struct blob *)obj, &objects, NULL, name, cp);
- continue;
- }
- die("unknown pending object %s (%s)", sha1_to_hex(obj->sha1), name);
- }
-}
-
static int add_one_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
const char *email, unsigned long timestamp, int tz,
const char *message, void *cb_data)
@@ -210,6 +99,21 @@ static void add_cache_refs(struct rev_info *revs)
add_cache_tree(active_cache_tree, revs);
}

+/*
+ * The traversal will have already marked us as SEEN, so we
+ * only need to handle any progress reporting here.
+ */
+static void mark_object(struct object *obj, const struct name_path *path,
+ const char *name, void *data)
+{
+ update_progress(data);
+}
+
+static void mark_commit(struct commit *c, void *data)
+{
+ mark_object(&c->object, NULL, NULL, data);
+}
+
void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
struct progress *progress)
{
@@ -245,6 +149,6 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
*/
if (prepare_revision_walk(revs))
die("revision walk setup failed");
- walk_commit_list(revs, &cp);
+ traverse_commit_list(revs, mark_commit, mark_object, &cp);
display_progress(cp.progress, cp.count);
}
--
2.1.2.596.g7379948
Junio C Hamano
2014-10-16 17:53:52 UTC
Permalink
Post by Jeff King
To find the set of reachable objects, we add a bunch of
possible sources to our rev_info, call prepare_revision_walk,
and then launch into a custom walker that handles each
object top. This is a subset of what traverse_commit_list
does, so we can just reuse that code (it can also handle
more complex cases like UNINTERESTING commits and pathspecs,
but we don't use those features).
---
I was concerned this would be slower because traverse_commit_list is
more featureful. To my surprise, it was consistently about 3-4% faster!
The major difference is that traverse_commit_list will hit all of the
commits first, and then the trees. For reachability that doesn't matter
either way, but I suspect the new way has slightly better cache
locality, leading to the minor speedup.
I am not very surprised, as "custom walk" hasn't changed much ever
since it was done in ba84a797 (builtin "git prune", 2006-07-06),
while the generic traversal code has been worked heavily while it
was still in builtin-rev-list.c and then later moved to
list-objects.c.
Post by Jeff King
reachable.c | 130 ++++++++----------------------------------------------------
1 file changed, 17 insertions(+), 113 deletions(-)
;-) ;-) ;-) ;-) ;-)
Jeff King
2014-10-15 22:38:31 UTC
Permalink
We want to add all reflog entries as tips for finding
reachable objects. The revision machinery can already do
this (to support "rev-list --reflog"); we can reuse that
code.

Signed-off-by: Jeff King <***@peff.net>
---
This one is not strictly necessary, but it seems like a nice cleanup.
Note that the big difference in the revision.c code is that it will
print a warning for broken reflogs, but I think that's fine (and perhaps
even desirable) here.

reachable.c | 24 +-----------------------
revision.c | 4 ++--
revision.h | 1 +
3 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/reachable.c b/reachable.c
index 02bf6c2..4e68cfa 100644
--- a/reachable.c
+++ b/reachable.c
@@ -22,22 +22,6 @@ static void update_progress(struct connectivity_progress *cp)
display_progress(cp->progress, cp->count);
}

-static int add_one_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
- const char *email, unsigned long timestamp, int tz,
- const char *message, void *cb_data)
-{
- struct object *object;
- struct rev_info *revs = (struct rev_info *)cb_data;
-
- object = parse_object(osha1);
- if (object)
- add_pending_object(revs, object, "");
- object = parse_object(nsha1);
- if (object)
- add_pending_object(revs, object, "");
- return 0;
-}
-
static int add_one_ref(const char *path, const unsigned char *sha1, int flag, void *cb_data)
{
struct object *object = parse_object_or_die(sha1, path);
@@ -48,12 +32,6 @@ static int add_one_ref(const char *path, const unsigned char *sha1, int flag, vo
return 0;
}

-static int add_one_reflog(const char *path, const unsigned char *sha1, int flag, void *cb_data)
-{
- for_each_reflog_ent(path, add_one_reflog_ent, cb_data);
- return 0;
-}
-
static void add_one_tree(const unsigned char *sha1, struct rev_info *revs)
{
struct tree *tree = lookup_tree(sha1);
@@ -138,7 +116,7 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,

/* Add all reflog info */
if (mark_reflog)
- for_each_reflog(add_one_reflog, revs);
+ add_reflogs_to_pending(revs, 0);

cp.progress = progress;
cp.count = 0;
diff --git a/revision.c b/revision.c
index 01cc276..b8e02e2 100644
--- a/revision.c
+++ b/revision.c
@@ -1275,7 +1275,7 @@ static int handle_one_reflog(const char *path, const unsigned char *sha1, int fl
return 0;
}

-static void handle_reflog(struct rev_info *revs, unsigned flags)
+void add_reflogs_to_pending(struct rev_info *revs, unsigned flags)
{
struct all_refs_cb cb;
cb.all_revs = revs;
@@ -2061,7 +2061,7 @@ static int handle_revision_pseudo_opt(const char *submodule,
for_each_glob_ref_in(handle_one_ref, arg + 10, "refs/remotes/", &cb);
clear_ref_exclusion(&revs->ref_excludes);
} else if (!strcmp(arg, "--reflog")) {
- handle_reflog(revs, *flags);
+ add_reflogs_to_pending(revs, *flags);
} else if (!strcmp(arg, "--not")) {
*flags ^= UNINTERESTING | BOTTOM;
} else if (!strcmp(arg, "--no-walk")) {
diff --git a/revision.h b/revision.h
index a620530..e644044 100644
--- a/revision.h
+++ b/revision.h
@@ -276,6 +276,7 @@ extern void add_pending_sha1(struct rev_info *revs,
unsigned int flags);

extern void add_head_to_pending(struct rev_info *);
+extern void add_reflogs_to_pending(struct rev_info *, unsigned int flags);

enum commit_action {
commit_ignore,
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:38:55 UTC
Permalink
Prune has to walk $GIT_DIR/objects/?? in order to find the
set of loose objects to prune. Other parts of the code
(e.g., count-objects) want to do the same. Let's factor it
out into a reusable for_each-style function.

Note that this is not quite a straight code movement. The
original code had strange behavior when it found a file of
the form "[0-9a-f]{2}/.{38}" that did _not_ contain all hex
digits. It executed a "break" from the loop, meaning that we
stopped pruning in that directory (but still pruned other
directories!). This was probably a bug; we do not want to
process the file as an object, but we should keep going
otherwise (and that is how the new code handles it).

We are also a little more careful with loose object
directories which fail to open. The original code silently
ignored any failures, but the new code will complain about
any problems besides ENOENT.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/prune.c | 87 +++++++++++++++++----------------------------------------
cache.h | 33 ++++++++++++++++++++++
sha1_file.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 143 insertions(+), 61 deletions(-)

diff --git a/builtin/prune.c b/builtin/prune.c
index 144a3bd..763f53e 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -31,11 +31,23 @@ static int prune_tmp_file(const char *fullpath)
return 0;
}

-static int prune_object(const char *fullpath, const unsigned char *sha1)
+static int prune_object(const unsigned char *sha1, const char *fullpath,
+ void *data)
{
struct stat st;
- if (lstat(fullpath, &st))
- return error("Could not stat '%s'", fullpath);
+
+ /*
+ * Do we know about this object?
+ * It must have been reachable
+ */
+ if (lookup_object(sha1))
+ return 0;
+
+ if (lstat(fullpath, &st)) {
+ /* report errors, but do not stop pruning */
+ error("Could not stat '%s'", fullpath);
+ return 0;
+ }
if (st.st_mtime > expire)
return 0;
if (show_only || verbose) {
@@ -48,68 +60,20 @@ static int prune_object(const char *fullpath, const unsigned char *sha1)
return 0;
}

-static int prune_dir(int i, struct strbuf *path)
+static int prune_cruft(const char *basename, const char *path, void *data)
{
- size_t baselen = path->len;
- DIR *dir = opendir(path->buf);
- struct dirent *de;
-
- if (!dir)
- return 0;
-
- while ((de = readdir(dir)) != NULL) {
- char name[100];
- unsigned char sha1[20];
-
- if (is_dot_or_dotdot(de->d_name))
- continue;
- if (strlen(de->d_name) == 38) {
- sprintf(name, "%02x", i);
- memcpy(name+2, de->d_name, 39);
- if (get_sha1_hex(name, sha1) < 0)
- break;
-
- /*
- * Do we know about this object?
- * It must have been reachable
- */
- if (lookup_object(sha1))
- continue;
-
- strbuf_addf(path, "/%s", de->d_name);
- prune_object(path->buf, sha1);
- strbuf_setlen(path, baselen);
- continue;
- }
- if (starts_with(de->d_name, "tmp_obj_")) {
- strbuf_addf(path, "/%s", de->d_name);
- prune_tmp_file(path->buf);
- strbuf_setlen(path, baselen);
- continue;
- }
- fprintf(stderr, "bad sha1 file: %s/%s\n", path->buf, de->d_name);
- }
- closedir(dir);
- if (!show_only)
- rmdir(path->buf);
+ if (starts_with(basename, "tmp_obj_"))
+ prune_tmp_file(path);
+ else
+ fprintf(stderr, "bad sha1 file: %s\n", path);
return 0;
}

-static void prune_object_dir(const char *path)
+static int prune_subdir(int nr, const char *path, void *data)
{
- struct strbuf buf = STRBUF_INIT;
- size_t baselen;
- int i;
-
- strbuf_addstr(&buf, path);
- strbuf_addch(&buf, '/');
- baselen = buf.len;
-
- for (i = 0; i < 256; i++) {
- strbuf_addf(&buf, "%02x", i);
- prune_dir(i, &buf);
- strbuf_setlen(&buf, baselen);
- }
+ if (!show_only)
+ rmdir(path);
+ return 0;
}

/*
@@ -173,7 +137,8 @@ int cmd_prune(int argc, const char **argv, const char *prefix)

mark_reachable_objects(&revs, 1, progress);
stop_progress(&progress);
- prune_object_dir(get_object_directory());
+ for_each_loose_file_in_objdir(get_object_directory(), prune_object,
+ prune_cruft, prune_subdir, NULL);

prune_packed_objects(show_only ? PRUNE_PACKED_DRY_RUN : 0);
remove_temporary_files(get_object_directory());
diff --git a/cache.h b/cache.h
index 13fadb6..8ffefaa 100644
--- a/cache.h
+++ b/cache.h
@@ -1221,6 +1221,39 @@ extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsig
extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);

+/*
+ * Iterate over the files in the loose-object parts of the object
+ * directory "path", triggering the following callbacks:
+ *
+ * - loose_object is called for each loose object we find.
+ *
+ * - loose_cruft is called for any files that do not appear to be
+ * loose objects. Note that we only look in the loose object
+ * directories "objects/[0-9a-f]{2}/", so we will not report
+ * "objects/foobar" as cruft.
+ *
+ * - loose_subdir is called for each top-level hashed subdirectory
+ * of the object directory (e.g., "$OBJDIR/f0"). It is called
+ * after the objects in the directory are processed.
+ *
+ * Any callback that is NULL will be ignored. Callbacks returning non-zero
+ * will end the iteration.
+ */
+typedef int each_loose_object_fn(const unsigned char *sha1,
+ const char *path,
+ void *data);
+typedef int each_loose_cruft_fn(const char *basename,
+ const char *path,
+ void *data);
+typedef int each_loose_subdir_fn(int nr,
+ const char *path,
+ void *data);
+int for_each_loose_file_in_objdir(const char *path,
+ each_loose_object_fn obj_cb,
+ each_loose_cruft_fn cruft_cb,
+ each_loose_subdir_fn subdir_cb,
+ void *data);
+
struct object_info {
/* Request */
enum object_type *typep;
diff --git a/sha1_file.c b/sha1_file.c
index fa881bf..a20240b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3265,3 +3265,87 @@ void assert_sha1_type(const unsigned char *sha1, enum object_type expect)
die("%s is not a valid '%s' object", sha1_to_hex(sha1),
typename(expect));
}
+
+static int for_each_file_in_obj_subdir(int subdir_nr,
+ struct strbuf *path,
+ each_loose_object_fn obj_cb,
+ each_loose_cruft_fn cruft_cb,
+ each_loose_subdir_fn subdir_cb,
+ void *data)
+{
+ size_t baselen = path->len;
+ DIR *dir = opendir(path->buf);
+ struct dirent *de;
+ int r = 0;
+
+ if (!dir) {
+ if (errno == ENOENT)
+ return 0;
+ return error("unable to open %s: %s", path->buf, strerror(errno));
+ }
+
+ while ((de = readdir(dir))) {
+ if (is_dot_or_dotdot(de->d_name))
+ continue;
+
+ strbuf_setlen(path, baselen);
+ strbuf_addf(path, "/%s", de->d_name);
+
+ if (strlen(de->d_name) == 38) {
+ char hex[41];
+ unsigned char sha1[20];
+
+ snprintf(hex, sizeof(hex), "%02x%s",
+ subdir_nr, de->d_name);
+ if (!get_sha1_hex(hex, sha1)) {
+ if (obj_cb) {
+ r = obj_cb(sha1, path->buf, data);
+ if (r)
+ break;
+ }
+ continue;
+ }
+ }
+
+ if (cruft_cb) {
+ r = cruft_cb(de->d_name, path->buf, data);
+ if (r)
+ break;
+ }
+ }
+ strbuf_setlen(path, baselen);
+
+ if (!r && subdir_cb)
+ r = subdir_cb(subdir_nr, path->buf, data);
+
+ closedir(dir);
+ return r;
+}
+
+int for_each_loose_file_in_objdir(const char *path,
+ each_loose_object_fn obj_cb,
+ each_loose_cruft_fn cruft_cb,
+ each_loose_subdir_fn subdir_cb,
+ void *data)
+{
+ struct strbuf buf = STRBUF_INIT;
+ size_t baselen;
+ int r = 0;
+ int i;
+
+ strbuf_addstr(&buf, path);
+ strbuf_addch(&buf, '/');
+ baselen = buf.len;
+
+ for (i = 0; i < 256; i++) {
+ strbuf_addf(&buf, "%02x", i);
+ r = for_each_file_in_obj_subdir(i, &buf, obj_cb, cruft_cb,
+ subdir_cb, data);
+ strbuf_setlen(&buf, baselen);
+ if (r)
+ break;
+ }
+
+ strbuf_release(&buf);
+ return r;
+}
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:40:37 UTC
Permalink
When we mark all reachable objects for pruning, that
includes blobs mentioned by the index. However, we do not
mark these with the SEEN flag, as we do for objects that we
find by traversing (we also do not add them to the pending
list, but that is because there is nothing further to
traverse with them).

This doesn't cause any problems with prune, because it
checks only that the object exists in the global object
hash, and not its flags. However, let's mark these objects
to be consistent and avoid any later surprises.

Signed-off-by: Jeff King <***@peff.net>
---
This code actually gets dropped later on in the series (when we teach
the revision machinery to look at index objects), so this could
technically be omitted. But it also keeps the minor behavior change
here by itself, where it is explained, instead of as a side effect of
the movement.

reachable.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 4e68cfa..d03f829 100644
--- a/reachable.c
+++ b/reachable.c
@@ -55,6 +55,8 @@ static void add_cache_refs(struct rev_info *revs)

read_cache();
for (i = 0; i < active_nr; i++) {
+ struct blob *blob;
+
/*
* The index can contain blobs and GITLINKs, GITLINKs are hashes
* that don't actually point to objects in the repository, it's
@@ -65,7 +67,10 @@ static void add_cache_refs(struct rev_info *revs)
if (S_ISGITLINK(active_cache[i]->ce_mode))
continue;

- lookup_blob(active_cache[i]->sha1);
+ blob = lookup_blob(active_cache[i]->sha1);
+ if (blob)
+ blob->object.flags |= SEEN;
+
/*
* We could add the blobs to the pending list, but quite
* frankly, we don't care. Once we've looked them up, and
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:40:53 UTC
Permalink
This saves us from manually traversing the directory
structure ourselves.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/prune-packed.c | 69 +++++++++++++++++---------------------------------
1 file changed, 23 insertions(+), 46 deletions(-)

diff --git a/builtin/prune-packed.c b/builtin/prune-packed.c
index d430731..f24a2c2 100644
--- a/builtin/prune-packed.c
+++ b/builtin/prune-packed.c
@@ -10,65 +10,42 @@ static const char * const prune_packed_usage[] = {

static struct progress *progress;

-static void prune_dir(int i, DIR *dir, struct strbuf *pathname, int opts)
+static int prune_subdir(int nr, const char *path, void *data)
{
- struct dirent *de;
- char hex[40];
- int top_len = pathname->len;
+ int *opts = data;
+ display_progress(progress, nr + 1);
+ if (!(*opts & PRUNE_PACKED_DRY_RUN))
+ rmdir(path);
+ return 0;
+}
+
+static int prune_object(const unsigned char *sha1, const char *path,
+ void *data)
+{
+ int *opts = data;

- sprintf(hex, "%02x", i);
- while ((de = readdir(dir)) != NULL) {
- unsigned char sha1[20];
- if (strlen(de->d_name) != 38)
- continue;
- memcpy(hex + 2, de->d_name, 38);
- if (get_sha1_hex(hex, sha1))
- continue;
- if (!has_sha1_pack(sha1))
- continue;
+ if (!has_sha1_pack(sha1))
+ return 0;

- strbuf_add(pathname, de->d_name, 38);
- if (opts & PRUNE_PACKED_DRY_RUN)
- printf("rm -f %s\n", pathname->buf);
- else
- unlink_or_warn(pathname->buf);
- display_progress(progress, i + 1);
- strbuf_setlen(pathname, top_len);
- }
+ if (*opts & PRUNE_PACKED_DRY_RUN)
+ printf("rm -f %s\n", path);
+ else
+ unlink_or_warn(path);
+ return 0;
}

void prune_packed_objects(int opts)
{
- int i;
- const char *dir = get_object_directory();
- struct strbuf pathname = STRBUF_INIT;
- int top_len;
-
- strbuf_addstr(&pathname, dir);
if (opts & PRUNE_PACKED_VERBOSE)
progress = start_progress_delay(_("Removing duplicate objects"),
256, 95, 2);

- if (pathname.len && pathname.buf[pathname.len - 1] != '/')
- strbuf_addch(&pathname, '/');
-
- top_len = pathname.len;
- for (i = 0; i < 256; i++) {
- DIR *d;
+ for_each_loose_file_in_objdir(get_object_directory(),
+ prune_object, NULL, prune_subdir, &opts);

- display_progress(progress, i + 1);
- strbuf_setlen(&pathname, top_len);
- strbuf_addf(&pathname, "%02x/", i);
- d = opendir(pathname.buf);
- if (!d)
- continue;
- prune_dir(i, d, &pathname, opts);
- closedir(d);
- strbuf_setlen(&pathname, top_len + 2);
- rmdir(pathname.buf);
- }
+ /* Ensure we show 100% before finishing progress */
+ display_progress(progress, 256);
stop_progress(&progress);
- strbuf_release(&pathname);
}

int cmd_prune_packed(int argc, const char **argv, const char *prefix)
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:40:58 UTC
Permalink
The point of xsize_t is to safely cast an off_t into a size_t
(because we are about to mmap). But in count-objects, we are
summing the sizes in an off_t. Using xsize_t means that
count-objects could fail on a 32-bit system with a 4G
object (not likely, as other parts of git would fail, but
we should at least be correct here).

Signed-off-by: Jeff King <***@peff.net>
---
builtin/count-objects.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index a7f70cb..316a805 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -53,7 +53,7 @@ static void count_objects(DIR *d, char *path, int len, int verbose,
if (lstat(path, &st) || !S_ISREG(st.st_mode))
bad = 1;
else
- (*loose_size) += xsize_t(on_disk_bytes(st));
+ (*loose_size) += on_disk_bytes(st);
}
if (bad) {
if (verbose) {
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:41:11 UTC
Permalink
This drops our line count considerably, and should make
things more readable by keeping the counting logic separate
from the traversal.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/count-objects.c | 101 ++++++++++++++----------------------------------
1 file changed, 30 insertions(+), 71 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index 316a805..e47ef0b 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -11,6 +11,9 @@

static unsigned long garbage;
static off_t size_garbage;
+static int verbose;
+static unsigned long loose, packed, packed_loose;
+static off_t loose_size;

static void real_report_garbage(const char *desc, const char *path)
{
@@ -21,61 +24,31 @@ static void real_report_garbage(const char *desc, const char *path)
garbage++;
}

-static void count_objects(DIR *d, char *path, int len, int verbose,
- unsigned long *loose,
- off_t *loose_size,
- unsigned long *packed_loose)
+static void loose_garbage(const char *path)
{
- struct dirent *ent;
- while ((ent = readdir(d)) != NULL) {
- char hex[41];
- unsigned char sha1[20];
- const char *cp;
- int bad = 0;
+ if (verbose)
+ report_garbage("garbage found", path);
+}

- if (is_dot_or_dotdot(ent->d_name))
- continue;
- for (cp = ent->d_name; *cp; cp++) {
- int ch = *cp;
- if (('0' <= ch && ch <= '9') ||
- ('a' <= ch && ch <= 'f'))
- continue;
- bad = 1;
- break;
- }
- if (cp - ent->d_name != 38)
- bad = 1;
- else {
- struct stat st;
- memcpy(path + len + 3, ent->d_name, 38);
- path[len + 2] = '/';
- path[len + 41] = 0;
- if (lstat(path, &st) || !S_ISREG(st.st_mode))
- bad = 1;
- else
- (*loose_size) += on_disk_bytes(st);
- }
- if (bad) {
- if (verbose) {
- struct strbuf sb = STRBUF_INIT;
- strbuf_addf(&sb, "%.*s/%s",
- len + 2, path, ent->d_name);
- report_garbage("garbage found", sb.buf);
- strbuf_release(&sb);
- }
- continue;
- }
- (*loose)++;
- if (!verbose)
- continue;
- memcpy(hex, path+len, 2);
- memcpy(hex+2, ent->d_name, 38);
- hex[40] = 0;
- if (get_sha1_hex(hex, sha1))
- die("internal error");
- if (has_sha1_pack(sha1))
- (*packed_loose)++;
+static int count_loose(const unsigned char *sha1, const char *path, void *data)
+{
+ struct stat st;
+
+ if (lstat(path, &st) || !S_ISREG(st.st_mode))
+ loose_garbage(path);
+ else {
+ loose_size += on_disk_bytes(st);
+ loose++;
+ if (verbose && has_sha1_pack(sha1))
+ packed_loose++;
}
+ return 0;
+}
+
+static int count_cruft(const char *basename, const char *path, void *data)
+{
+ loose_garbage(path);
+ return 0;
}

static char const * const count_objects_usage[] = {
@@ -85,12 +58,7 @@ static char const * const count_objects_usage[] = {

int cmd_count_objects(int argc, const char **argv, const char *prefix)
{
- int i, verbose = 0, human_readable = 0;
- const char *objdir = get_object_directory();
- int len = strlen(objdir);
- char *path = xmalloc(len + 50);
- unsigned long loose = 0, packed = 0, packed_loose = 0;
- off_t loose_size = 0;
+ int human_readable = 0;
struct option opts[] = {
OPT__VERBOSE(&verbose, N_("be verbose")),
OPT_BOOL('H', "human-readable", &human_readable,
@@ -104,19 +72,10 @@ int cmd_count_objects(int argc, const char **argv, const char *prefix)
usage_with_options(count_objects_usage, opts);
if (verbose)
report_garbage = real_report_garbage;
- memcpy(path, objdir, len);
- if (len && objdir[len-1] != '/')
- path[len++] = '/';
- for (i = 0; i < 256; i++) {
- DIR *d;
- sprintf(path + len, "%02x", i);
- d = opendir(path);
- if (!d)
- continue;
- count_objects(d, path, len, verbose,
- &loose, &loose_size, &packed_loose);
- closedir(d);
- }
+
+ for_each_loose_file_in_objdir(get_object_directory(),
+ count_loose, count_cruft, NULL, NULL);
+
if (verbose) {
struct packed_git *p;
unsigned long num_pack = 0;
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:41:21 UTC
Permalink
We typically iterate over the reachable objects in a
repository by starting at the tips and walking the graph.
There's no easy way to iterate over all of the objects,
including unreachable ones. Let's provide a way of doing so.

Signed-off-by: Jeff King <***@peff.net>
---
cache.h | 11 +++++++++++
sha1_file.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+)

diff --git a/cache.h b/cache.h
index 8ffefaa..d24dd32 100644
--- a/cache.h
+++ b/cache.h
@@ -1254,6 +1254,17 @@ int for_each_loose_file_in_objdir(const char *path,
each_loose_subdir_fn subdir_cb,
void *data);

+/*
+ * Iterate over loose and packed objects in both the local
+ * repository and any alternates repositories.
+ */
+typedef int each_packed_object_fn(const unsigned char *sha1,
+ struct packed_git *pack,
+ uint32_t pos,
+ void *data);
+extern int for_each_loose_object(each_loose_object_fn, void *);
+extern int for_each_packed_object(each_packed_object_fn, void *);
+
struct object_info {
/* Request */
enum object_type *typep;
diff --git a/sha1_file.c b/sha1_file.c
index a20240b..eb410a2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3349,3 +3349,65 @@ int for_each_loose_file_in_objdir(const char *path,
strbuf_release(&buf);
return r;
}
+
+struct loose_alt_odb_data {
+ each_loose_object_fn *cb;
+ void *data;
+};
+
+static int loose_from_alt_odb(struct alternate_object_database *alt,
+ void *vdata)
+{
+ struct loose_alt_odb_data *data = vdata;
+ return for_each_loose_file_in_objdir(alt->base,
+ data->cb, NULL, NULL,
+ data->data);
+}
+
+int for_each_loose_object(each_loose_object_fn cb, void *data)
+{
+ struct loose_alt_odb_data alt;
+ int r;
+
+ r = for_each_loose_file_in_objdir(get_object_directory(),
+ cb, NULL, NULL, data);
+ if (r)
+ return r;
+
+ alt.cb = cb;
+ alt.data = data;
+ return foreach_alt_odb(loose_from_alt_odb, &alt);
+}
+
+static int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
+{
+ uint32_t i;
+ int r = 0;
+
+ for (i = 0; i < p->num_objects; i++) {
+ const unsigned char *sha1 = nth_packed_object_sha1(p, i);
+
+ if (!sha1)
+ return error("unable to get sha1 of object %u in %s",
+ i, p->pack_name);
+
+ r = cb(sha1, p, i, data);
+ if (r)
+ break;
+ }
+ return r;
+}
+
+int for_each_packed_object(each_packed_object_fn cb, void *data)
+{
+ struct packed_git *p;
+ int r = 0;
+
+ prepare_packed_git();
+ for (p = packed_git; p; p = p->next) {
+ r = for_each_object_in_pack(p, cb, data);
+ if (r)
+ break;
+ }
+ return r;
+}
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:41:35 UTC
Permalink
Our current strategy with prune is that an object falls into
one of three categories:

1. Reachable (from ref tips, reflogs, index, etc).

2. Not reachable, but recent (based on the --expire time).

3. Not reachable and not recent.

We keep objects from (1) and (2), but prune objects in (3).
The point of (2) is that these objects may be part of an
in-progress operation that has not yet updated any refs.

However, it is not always the case that objects for an
in-progress operation will have a recent mtime. For example,
the object database may have an old copy of a blob (from an
abandoned operation, a branch that was deleted, etc). If we
create a new tree that points to it, a simultaneous prune
will leave our tree, but delete the blob. Referencing that
tree with a commit will then work (we check that the tree is
in the object database, but not that all of its referred
objects are), as will mentioning the commit in a ref. But
the resulting repo is corrupt; we are missing the blob
reachable from a ref.

One way to solve this is to be more thorough when
referencing a sha1: make sure that not only do we have that
sha1, but that we have objects it refers to, and so forth
recursively. The problem is that this is very expensive.
Creating a parent link would require traversing the entire
object graph!

Instead, this patch pushes the extra work onto prune, which
runs less frequently (and has to look at the whole object
graph anyway). It creates a new category of objects: objects
which are not recent, but which are reachable from a recent
object. We do not prune these objects, just like the
reachable and recent ones.

This lets us avoid the recursive check above, because if we
have an object, even if it is unreachable, we should have
its referent. We can make a simple inductive argument that
with this patch, this property holds (that there are no
objects with missing referents in the repository):

0. When we have no objects, we have nothing to refer or be
referred to, so the property holds.

1. If we add objects to the repository, their direct
referents must generally exist (e.g., if you create a
tree, the blobs it references must exist; if you create
a commit to point at the tree, the tree must exist).
This is already the case before this patch. And it is
not 100% foolproof (you can make bogus objects using
`git hash-object`, for example), but it should be the
case for normal usage.

Therefore for any sequence of object additions, the
property will continue to hold.

2. If we remove objects from the repository, then we will
not remove a child object (like a blob) if an object
that refers to it is being kept. That is the part
implemented by this patch.

Note, however, that our reachability check and the
actual pruning are not atomic. So it _is_ still
possible to violate the property (e.g., an object
becomes referenced just as we are deleting it). This
patch is shooting for eliminating problems where the
mtimes of dependent objects differ by hours or days,
and one is dropped without the other. It does nothing
to help with short races.

Naively, the simplest way to implement this would be to add
all recent objects as tips to the reachability traversal.
However, this does not perform well. In a recently-packed
repository, all reachable objects will also be recent, and
therefore we have to look at each object twice. This patch
instead performs the reachability traversal, then follows up
with a second traversal for recent objects, skipping any
that have already been marked.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/prune.c | 2 +-
builtin/reflog.c | 2 +-
reachable.c | 112 +++++++++++++++++++++++++++++++++++++++++++++
reachable.h | 3 +-
t/t6501-freshen-objects.sh | 88 +++++++++++++++++++++++++++++++++++
5 files changed, 204 insertions(+), 3 deletions(-)
create mode 100755 t/t6501-freshen-objects.sh

diff --git a/builtin/prune.c b/builtin/prune.c
index 763f53e..04d3b12 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -135,7 +135,7 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
if (show_progress)
progress = start_progress_delay(_("Checking connectivity"), 0, 0, 2);

- mark_reachable_objects(&revs, 1, progress);
+ mark_reachable_objects(&revs, 1, expire, progress);
stop_progress(&progress);
for_each_loose_file_in_objdir(get_object_directory(), prune_object,
prune_cruft, prune_subdir, NULL);
diff --git a/builtin/reflog.c b/builtin/reflog.c
index b6388f7..2d85d26 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -649,7 +649,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
init_revisions(&cb.revs, prefix);
if (cb.verbose)
printf("Marking reachable objects...");
- mark_reachable_objects(&cb.revs, 0, NULL);
+ mark_reachable_objects(&cb.revs, 0, 0, NULL);
if (cb.verbose)
putchar('\n');
}
diff --git a/reachable.c b/reachable.c
index d03f829..55589a0 100644
--- a/reachable.c
+++ b/reachable.c
@@ -97,7 +97,109 @@ static void mark_commit(struct commit *c, void *data)
mark_object(&c->object, NULL, NULL, data);
}

+struct recent_data {
+ struct rev_info *revs;
+ unsigned long timestamp;
+};
+
+static void add_recent_object(const unsigned char *sha1,
+ unsigned long mtime,
+ struct recent_data *data)
+{
+ struct object *obj;
+ enum object_type type;
+
+ if (mtime <= data->timestamp)
+ return;
+
+ /*
+ * We do not want to call parse_object here, because
+ * inflating blobs and trees could be very expensive.
+ * However, we do need to know the correct type for
+ * later processing, and the revision machinery expects
+ * commits and tags to have been parsed.
+ */
+ type = sha1_object_info(sha1, NULL);
+ if (type < 0)
+ die("unable to get object info for %s", sha1_to_hex(sha1));
+
+ switch (type) {
+ case OBJ_TAG:
+ case OBJ_COMMIT:
+ obj = parse_object_or_die(sha1, NULL);
+ break;
+ case OBJ_TREE:
+ obj = (struct object *)lookup_tree(sha1);
+ break;
+ case OBJ_BLOB:
+ obj = (struct object *)lookup_blob(sha1);
+ break;
+ default:
+ die("unknown object type for %s: %s",
+ sha1_to_hex(sha1), typename(type));
+ }
+
+ if (!obj)
+ die("unable to lookup %s", sha1_to_hex(sha1));
+
+ add_pending_object(data->revs, obj, "");
+}
+
+static int add_recent_loose(const unsigned char *sha1,
+ const char *path, void *data)
+{
+ struct stat st;
+ struct object *obj = lookup_object(sha1);
+
+ if (obj && obj->flags & SEEN)
+ return 0;
+
+ if (stat(path, &st) < 0) {
+ /*
+ * It's OK if an object went away during our iteration; this
+ * could be due to a simultaneous repack. But anything else
+ * we should abort, since we might then fail to mark objects
+ * which should not be pruned.
+ */
+ if (errno == ENOENT)
+ return 0;
+ return error("unable to stat %s: %s",
+ sha1_to_hex(sha1), strerror(errno));
+ }
+
+ add_recent_object(sha1, st.st_mtime, data);
+ return 0;
+}
+
+static int add_recent_packed(const unsigned char *sha1,
+ struct packed_git *p, uint32_t pos,
+ void *data)
+{
+ struct object *obj = lookup_object(sha1);
+
+ if (obj && obj->flags & SEEN)
+ return 0;
+ add_recent_object(sha1, p->mtime, data);
+ return 0;
+}
+
+static int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
+ unsigned long timestamp)
+{
+ struct recent_data data;
+ int r;
+
+ data.revs = revs;
+ data.timestamp = timestamp;
+
+ r = for_each_loose_object(add_recent_loose, &data);
+ if (r)
+ return r;
+ return for_each_packed_object(add_recent_packed, &data);
+}
+
void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
+ unsigned long mark_recent,
struct progress *progress)
{
struct connectivity_progress cp;
@@ -133,5 +235,15 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
if (prepare_revision_walk(revs))
die("revision walk setup failed");
traverse_commit_list(revs, mark_commit, mark_object, &cp);
+
+ if (mark_recent) {
+ revs->ignore_missing_links = 1;
+ if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+ die("unable to mark recent objects");
+ if (prepare_revision_walk(revs))
+ die("revision walk setup failed");
+ traverse_commit_list(revs, mark_commit, mark_object, &cp);
+ }
+
display_progress(cp.progress, cp.count);
}
diff --git a/reachable.h b/reachable.h
index 5d082ad..141fe30 100644
--- a/reachable.h
+++ b/reachable.h
@@ -2,6 +2,7 @@
#define REACHEABLE_H

struct progress;
-extern void mark_reachable_objects(struct rev_info *revs, int mark_reflog, struct progress *);
+extern void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
+ unsigned long mark_recent, struct progress *);

#endif
diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
new file mode 100755
index 0000000..de941c2
--- /dev/null
+++ b/t/t6501-freshen-objects.sh
@@ -0,0 +1,88 @@
+#!/bin/sh
+#
+# This test covers the handling of objects which might have old
+# mtimes in the filesystem (because they were used previously)
+# and are just now becoming referenced again.
+#
+# We're going to do two things that are a little bit "fake" to
+# help make our simulation easier:
+#
+# 1. We'll turn off reflogs. You can still run into
+# problems with reflogs on, but your objects
+# don't get pruned until both the reflog expiration
+# has passed on their references, _and_ they are out
+# of prune's expiration period. Dropping reflogs
+# means we only have to deal with one variable in our tests,
+# but the results generalize.
+#
+# 2. We'll use a temporary index file to create our
+# works-in-progress. Most workflows would mention
+# referenced objects in the index, which prune takes
+# into account. However, many operations don't. For
+# example, a partial commit with "git commit foo"
+# will use a temporary index. Or they may not need
+# an index at all (e.g., creating a new commit
+# to refer to an existing tree).
+
+test_description='check pruning of dependent objects'
+. ./test-lib.sh
+
+# We care about reachability, so we do not want to use
+# the normal test_commit, which creates extra tags.
+add () {
+ echo "$1" >"$1" &&
+ git add "$1"
+}
+commit () {
+ test_tick &&
+ add "$1" &&
+ git commit -m "$1"
+}
+
+test_expect_success 'disable reflogs' '
+ git config core.logallrefupdates false &&
+ rm -rf .git/logs
+'
+
+test_expect_success 'setup basic history' '
+ commit base
+'
+
+test_expect_success 'create and abandon some objects' '
+ git checkout -b experiment &&
+ commit abandon &&
+ git checkout master &&
+ git branch -D experiment
+'
+
+test_expect_success 'simulate time passing' '
+ find .git/objects -type f |
+ xargs test-chmtime -v -86400
+'
+
+test_expect_success 'start writing new commit with old blob' '
+ tree=$(
+ GIT_INDEX_FILE=index.tmp &&
+ export GIT_INDEX_FILE &&
+ git read-tree HEAD &&
+ add unrelated &&
+ add abandon &&
+ git write-tree
+ )
+'
+
+test_expect_success 'simultaneous gc' '
+ git gc --prune=12.hours.ago
+'
+
+test_expect_success 'finish writing out commit' '
+ commit=$(echo foo | git commit-tree -p HEAD $tree) &&
+ git update-ref HEAD $commit
+'
+
+# "abandon" blob should have been rescued by reference from new tree
+test_expect_success 'repository passes fsck' '
+ git fsck
+'
+
+test_done
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:41:53 UTC
Permalink
When we are loosening unreachable packed objects, we do not
bother to process objects that would simply be pruned
immediately anyway. The "would be pruned" check is a simple
comparison, but is about to get more complicated. Let's pull
it out into a separate function.

Note that this is slightly less efficient than the original,
which avoided even opening old packs, since no object in
them could pass the current check, which cares only about
the pack mtime. But the new rules will depend on the exact
object, so we need to perform the check even for old packs.

Note also that we fix a minor buglet when the pack mtime is
exactly the same as the expiration time. The prune code
considers that worth pruning, whereas our check here
considered it worth keeping. This wasn't a big deal. Besides
being unlikely to happen, the result was simply that the
object was loosened and then pruned, missing the
optimization. Still, we can easily fix it while we are here.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/pack-objects.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d391934..2fe2ab0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2407,6 +2407,16 @@ static int has_sha1_pack_kept_or_nonlocal(const unsigned char *sha1)
return 0;
}

+static int loosened_object_can_be_discarded(const unsigned char *sha1,
+ unsigned long mtime)
+{
+ if (!unpack_unreachable_expiration)
+ return 0;
+ if (mtime > unpack_unreachable_expiration)
+ return 0;
+ return 1;
+}
+
static void loosen_unused_packed_objects(struct rev_info *revs)
{
struct packed_git *p;
@@ -2417,17 +2427,14 @@ static void loosen_unused_packed_objects(struct rev_info *revs)
if (!p->pack_local || p->pack_keep)
continue;

- if (unpack_unreachable_expiration &&
- p->mtime < unpack_unreachable_expiration)
- continue;
-
if (open_pack_index(p))
die("cannot open pack index");

for (i = 0; i < p->num_objects; i++) {
sha1 = nth_packed_object_sha1(p, i);
if (!packlist_find(&to_pack, sha1, NULL) &&
- !has_sha1_pack_kept_or_nonlocal(sha1))
+ !has_sha1_pack_kept_or_nonlocal(sha1) &&
+ !loosened_object_can_be_discarded(sha1, p->mtime))
if (force_object_loose(sha1, p->mtime))
die("unable to force loose object");
}
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:42:09 UTC
Permalink
A recent commit taught git-prune to keep non-recent objects
that are reachable from recent ones. However, pack-objects,
when loosening unreachable objects, tries to optimize out
the write in the case that the object will be immediately
pruned. It now gets this wrong, since its rule does not
reflect the new prune code (and this can be seen by running
t6501 with a strategically placed repack).

Let's teach pack-objects similar logic.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/pack-objects.c | 39 +++++++++++++++++++
reachable.c | 4 +-
reachable.h | 2 +
t/t6501-freshen-objects.sh | 93 +++++++++++++++++++++++++++-------------------
4 files changed, 98 insertions(+), 40 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2fe2ab0..4df9499 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -20,6 +20,8 @@
#include "streaming.h"
#include "thread-utils.h"
#include "pack-bitmap.h"
+#include "reachable.h"
+#include "sha1-array.h"

static const char *pack_usage[] = {
N_("git pack-objects --stdout [options...] [< ref-list | < object-list]"),
@@ -2407,6 +2409,15 @@ static int has_sha1_pack_kept_or_nonlocal(const unsigned char *sha1)
return 0;
}

+/*
+ * Store a list of sha1s that are should not be discarded
+ * because they are either written too recently, or are
+ * reachable from another object that was.
+ *
+ * This is filled by get_object_list.
+ */
+static struct sha1_array recent_objects;
+
static int loosened_object_can_be_discarded(const unsigned char *sha1,
unsigned long mtime)
{
@@ -2414,6 +2425,8 @@ static int loosened_object_can_be_discarded(const unsigned char *sha1,
return 0;
if (mtime > unpack_unreachable_expiration)
return 0;
+ if (sha1_array_lookup(&recent_objects, sha1) >= 0)
+ return 0;
return 1;
}

@@ -2470,6 +2483,19 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
return 0;
}

+static void record_recent_object(struct object *obj,
+ const struct name_path *path,
+ const char *last,
+ void *data)
+{
+ sha1_array_append(&recent_objects, obj->sha1);
+}
+
+static void record_recent_commit(struct commit *commit, void *data)
+{
+ sha1_array_append(&recent_objects, commit->object.sha1);
+}
+
static void get_object_list(int ac, const char **av)
{
struct rev_info revs;
@@ -2517,10 +2543,23 @@ static void get_object_list(int ac, const char **av)
mark_edges_uninteresting(&revs, show_edge);
traverse_commit_list(&revs, show_commit, show_object, NULL);

+ if (unpack_unreachable_expiration) {
+ revs.ignore_missing_links = 1;
+ if (add_unseen_recent_objects_to_traversal(&revs,
+ unpack_unreachable_expiration))
+ die("unable to add recent objects");
+ if (prepare_revision_walk(&revs))
+ die("revision walk setup failed");
+ traverse_commit_list(&revs, record_recent_commit,
+ record_recent_object, NULL);
+ }
+
if (keep_unreachable)
add_objects_in_unpacked_packs(&revs);
if (unpack_unreachable)
loosen_unused_packed_objects(&revs);
+
+ sha1_array_clear(&recent_objects);
}

static int option_parse_index_version(const struct option *opt,
diff --git a/reachable.c b/reachable.c
index 55589a0..0176a88 100644
--- a/reachable.c
+++ b/reachable.c
@@ -183,8 +183,8 @@ static int add_recent_packed(const unsigned char *sha1,
return 0;
}

-static int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
- unsigned long timestamp)
+int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
+ unsigned long timestamp)
{
struct recent_data data;
int r;
diff --git a/reachable.h b/reachable.h
index 141fe30..d23efc3 100644
--- a/reachable.h
+++ b/reachable.h
@@ -2,6 +2,8 @@
#define REACHEABLE_H

struct progress;
+extern int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
+ unsigned long timestamp);
extern void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
unsigned long mark_recent, struct progress *);

diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
index de941c2..e25c47d 100755
--- a/t/t6501-freshen-objects.sh
+++ b/t/t6501-freshen-objects.sh
@@ -39,50 +39,67 @@ commit () {
git commit -m "$1"
}

-test_expect_success 'disable reflogs' '
- git config core.logallrefupdates false &&
- rm -rf .git/logs
-'
+maybe_repack () {
+ if test -n "$repack"; then
+ git repack -ad
+ fi
+}
+
+for repack in '' true; do
+ title=${repack:+repack}
+ title=${title:-loose}
+
+ test_expect_success "make repo completely empty ($title)" '
+ rm -rf .git &&
+ git init
+ '
+
+ test_expect_success "disable reflogs ($title)" '
+ git config core.logallrefupdates false &&
+ rm -rf .git/logs
+ '

-test_expect_success 'setup basic history' '
- commit base
-'
+ test_expect_success "setup basic history ($title)" '
+ commit base
+ '

-test_expect_success 'create and abandon some objects' '
- git checkout -b experiment &&
- commit abandon &&
- git checkout master &&
- git branch -D experiment
-'
+ test_expect_success "create and abandon some objects ($title)" '
+ git checkout -b experiment &&
+ commit abandon &&
+ maybe_repack &&
+ git checkout master &&
+ git branch -D experiment
+ '

-test_expect_success 'simulate time passing' '
- find .git/objects -type f |
- xargs test-chmtime -v -86400
-'
+ test_expect_success "simulate time passing ($title)" '
+ find .git/objects -type f |
+ xargs test-chmtime -v -86400
+ '

-test_expect_success 'start writing new commit with old blob' '
- tree=$(
- GIT_INDEX_FILE=index.tmp &&
- export GIT_INDEX_FILE &&
- git read-tree HEAD &&
- add unrelated &&
- add abandon &&
- git write-tree
- )
-'
+ test_expect_success "start writing new commit with old blob ($title)" '
+ tree=$(
+ GIT_INDEX_FILE=index.tmp &&
+ export GIT_INDEX_FILE &&
+ git read-tree HEAD &&
+ add unrelated &&
+ add abandon &&
+ git write-tree
+ )
+ '

-test_expect_success 'simultaneous gc' '
- git gc --prune=12.hours.ago
-'
+ test_expect_success "simultaneous gc ($title)" '
+ git gc --prune=12.hours.ago
+ '

-test_expect_success 'finish writing out commit' '
- commit=$(echo foo | git commit-tree -p HEAD $tree) &&
- git update-ref HEAD $commit
-'
+ test_expect_success "finish writing out commit ($title)" '
+ commit=$(echo foo | git commit-tree -p HEAD $tree) &&
+ git update-ref HEAD $commit
+ '

-# "abandon" blob should have been rescued by reference from new tree
-test_expect_success 'repository passes fsck' '
- git fsck
-'
+ # "abandon" blob should have been rescued by reference from new tree
+ test_expect_success "repository passes fsck ($title)" '
+ git fsck
+ '
+done

test_done
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:42:22 UTC
Permalink
When we try to write a loose object file, we first check
whether that object already exists. If so, we skip the
write as an optimization. However, this can interfere with
prune's strategy of using mtimes to mark files in progress.

For example, if a branch contains a particular tree object
and is deleted, that tree object may become unreachable, and
have an old mtime. If a new operation then tries to write
the same tree, this ends up as a noop; we notice we
already have the object and do nothing. A prune running
simultaneously with this operation will see the object as
old, and may delete it.

We can solve this by "freshening" objects that we avoid
writing by updating their mtime. The algorithm for doing so
is essentially the same as that of has_sha1_file. Therefore
we provide a new (static) interface "check_and_freshen",
which finds and optionally freshens the object. It's trivial
to implement freshening and simple checking by tweaking a
single parameter.

Signed-off-by: Jeff King <***@peff.net>
---
sha1_file.c | 51 +++++++++++++++++++++++++++++++++++++++-------
t/t6501-freshen-objects.sh | 27 ++++++++++++++++++++++++
2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index eb410a2..d7f1838 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -443,27 +443,53 @@ void prepare_alt_odb(void)
read_info_alternates(get_object_directory(), 0);
}

-static int has_loose_object_local(const unsigned char *sha1)
+static int freshen_file(const char *fn)
{
- return !access(sha1_file_name(sha1), F_OK);
+ struct utimbuf t;
+ t.actime = t.modtime = time(NULL);
+ return !utime(fn, &t);
}

-int has_loose_object_nonlocal(const unsigned char *sha1)
+static int check_and_freshen_file(const char *fn, int freshen)
+{
+ if (access(fn, F_OK))
+ return 0;
+ if (freshen && freshen_file(fn))
+ return 0;
+ return 1;
+}
+
+static int check_and_freshen_local(const unsigned char *sha1, int freshen)
+{
+ return check_and_freshen_file(sha1_file_name(sha1), freshen);
+}
+
+static int check_and_freshen_nonlocal(const unsigned char *sha1, int freshen)
{
struct alternate_object_database *alt;
prepare_alt_odb();
for (alt = alt_odb_list; alt; alt = alt->next) {
fill_sha1_path(alt->name, sha1);
- if (!access(alt->base, F_OK))
+ if (check_and_freshen_file(alt->base, freshen))
return 1;
}
return 0;
}

+static int check_and_freshen(const unsigned char *sha1, int freshen)
+{
+ return check_and_freshen_local(sha1, freshen) ||
+ check_and_freshen_nonlocal(sha1, freshen);
+}
+
+int has_loose_object_nonlocal(const unsigned char *sha1)
+{
+ return check_and_freshen_nonlocal(sha1, 0);
+}
+
static int has_loose_object(const unsigned char *sha1)
{
- return has_loose_object_local(sha1) ||
- has_loose_object_nonlocal(sha1);
+ return check_and_freshen(sha1, 0);
}

static unsigned int pack_used_ctr;
@@ -2966,6 +2992,17 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
return move_temp_to_file(tmp_file, filename);
}

+static int freshen_loose_object(const unsigned char *sha1)
+{
+ return check_and_freshen(sha1, 1);
+}
+
+static int freshen_packed_object(const unsigned char *sha1)
+{
+ struct pack_entry e;
+ return find_pack_entry(sha1, &e) && freshen_file(e.p->pack_name);
+}
+
int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
{
unsigned char sha1[20];
@@ -2978,7 +3015,7 @@ int write_sha1_file(const void *buf, unsigned long len, const char *type, unsign
write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
if (returnsha1)
hashcpy(returnsha1, sha1);
- if (has_sha1_file(sha1))
+ if (freshen_loose_object(sha1) || freshen_packed_object(sha1))
return 0;
return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
}
diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
index e25c47d..157f3f9 100755
--- a/t/t6501-freshen-objects.sh
+++ b/t/t6501-freshen-objects.sh
@@ -100,6 +100,33 @@ for repack in '' true; do
test_expect_success "repository passes fsck ($title)" '
git fsck
'
+
+ test_expect_success "abandon objects again ($title)" '
+ git reset --hard HEAD^ &&
+ find .git/objects -type f |
+ xargs test-chmtime -v -86400
+ '
+
+ test_expect_success "start writing new commit with same tree ($title)" '
+ tree=$(
+ GIT_INDEX_FILE=index.tmp &&
+ export GIT_INDEX_FILE &&
+ git read-tree HEAD &&
+ add abandon &&
+ add unrelated &&
+ git write-tree
+ )
+ '
+
+ test_expect_success "simultaneous gc ($title)" '
+ git gc --prune=12.hours.ago
+ '
+
+ # tree should have been refreshed by write-tree
+ test_expect_success "finish writing out commit ($title)" '
+ commit=$(echo foo | git commit-tree -p HEAD $tree) &&
+ git update-ref HEAD $commit
+ '
done

test_done
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:42:57 UTC
Permalink
When you resolve a sha1, you can optionally keep any context
found during the resolution, including the path and mode of
a tree entry (e.g., when looking up "HEAD:subdir/file.c").

The add_object_array_with_context function lets you then
attach that context to an entry in a list. Unfortunately,
the interface for doing so is horrible. The object_context
structure is large and most object_array users do not use
it. Therefore we keep a pointer to the structure to avoid
burdening other users too much. But that means when we do
use it that we must allocate the struct ourselves. And the
struct contains a fixed PATH_MAX-sized buffer, which makes
this wholly unsuitable for any large arrays.

We can observe that there is only a single user of the
"with_context" variant: builtin/grep.c. And in that use
case, the only element we care about is the path. We can
therefore store only the path as a pointer (the context's
mode field was redundant with the object_array_entry itself,
and nobody actually cared about the surrounding tree). This
still requires a strdup of the pathname, but at least we are
only consuming the minimum amount of memory for each string.

We can also handle the copying ourselves in
add_object_array_*, and free it as appropriate in
object_array_release_entry.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/grep.c | 8 ++++----
object.c | 23 +++++++++--------------
object.h | 4 ++--
3 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index c86a142..4063882 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -456,10 +456,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
}

static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
- struct object *obj, const char *name, struct object_context *oc)
+ struct object *obj, const char *name, const char *path)
{
if (obj->type == OBJ_BLOB)
- return grep_sha1(opt, obj->sha1, name, 0, oc ? oc->path : NULL);
+ return grep_sha1(opt, obj->sha1, name, 0, path);
if (obj->type == OBJ_COMMIT || obj->type == OBJ_TREE) {
struct tree_desc tree;
void *data;
@@ -501,7 +501,7 @@ static int grep_objects(struct grep_opt *opt, const struct pathspec *pathspec,
for (i = 0; i < nr; i++) {
struct object *real_obj;
real_obj = deref_tag(list->objects[i].item, NULL, 0);
- if (grep_object(opt, pathspec, real_obj, list->objects[i].name, list->objects[i].context)) {
+ if (grep_object(opt, pathspec, real_obj, list->objects[i].name, list->objects[i].path)) {
hit = 1;
if (opt->status_only)
break;
@@ -821,7 +821,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
struct object *object = parse_object_or_die(sha1, arg);
if (!seen_dashdash)
verify_non_filename(prefix, arg);
- add_object_array_with_context(object, arg, &list, xmemdupz(&oc, sizeof(struct object_context)));
+ add_object_array_with_path(object, arg, &list, oc.mode, oc.path);
continue;
}
if (!strcmp(arg, "--")) {
diff --git a/object.c b/object.c
index 6aeb1bb..df86bdd 100644
--- a/object.c
+++ b/object.c
@@ -307,10 +307,9 @@ int object_list_contains(struct object_list *list, struct object *obj)
*/
static char object_array_slopbuf[1];

-static void add_object_array_with_mode_context(struct object *obj, const char *name,
- struct object_array *array,
- unsigned mode,
- struct object_context *context)
+void add_object_array_with_path(struct object *obj, const char *name,
+ struct object_array *array,
+ unsigned mode, const char *path)
{
unsigned nr = array->nr;
unsigned alloc = array->alloc;
@@ -333,7 +332,10 @@ static void add_object_array_with_mode_context(struct object *obj, const char *n
else
entry->name = xstrdup(name);
entry->mode = mode;
- entry->context = context;
+ if (path)
+ entry->path = xstrdup(path);
+ else
+ entry->path = NULL;
array->nr = ++nr;
}

@@ -344,15 +346,7 @@ void add_object_array(struct object *obj, const char *name, struct object_array

void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode)
{
- add_object_array_with_mode_context(obj, name, array, mode, NULL);
-}
-
-void add_object_array_with_context(struct object *obj, const char *name, struct object_array *array, struct object_context *context)
-{
- if (context)
- add_object_array_with_mode_context(obj, name, array, context->mode, context);
- else
- add_object_array_with_mode_context(obj, name, array, S_IFINVALID, context);
+ add_object_array_with_path(obj, name, array, mode, NULL);
}

/*
@@ -363,6 +357,7 @@ static void object_array_release_entry(struct object_array_entry *ent)
{
if (ent->name != object_array_slopbuf)
free(ent->name);
+ free(ent->path);
}

void object_array_filter(struct object_array *array,
diff --git a/object.h b/object.h
index 2a755a2..e5178a5 100644
--- a/object.h
+++ b/object.h
@@ -18,8 +18,8 @@ struct object_array {
* empty string.
*/
char *name;
+ char *path;
unsigned mode;
- struct object_context *context;
} *objects;
};

@@ -115,7 +115,7 @@ int object_list_contains(struct object_list *list, struct object *obj);
/* Object array handling .. */
void add_object_array(struct object *obj, const char *name, struct object_array *array);
void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode);
-void add_object_array_with_context(struct object *obj, const char *name, struct object_array *array, struct object_context *context);
+void add_object_array_with_path(struct object *obj, const char *name, struct object_array *array, unsigned mode, const char *path);

typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:43:19 UTC
Permalink
When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).

When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.

However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:

1. Use the "path" field as the path instead of the "name"
field in traverse_commit_list. If the path is not set,
default to "" (which is what we always ended up with in
the current code, because of prepare_revision_walk).

2. In prepare_revision_walk, make a complete copy of the
entry. This makes the path field available to the
walker (if there is one), solving our problem.
Leaving the name field intact is now OK, as we do not
use it as a path due to point (1) above (and we can use
it to make more meaningful error messages if we want).
We also make the original "mode" field available to the
walker, though it does not actually use it.

Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.

Signed-off-by: Jeff King <***@peff.net>
---
list-objects.c | 7 +++++--
revision.c | 30 +++++++++++++++++++++++-------
2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index fad6808..2910bec 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -208,6 +208,7 @@ void traverse_commit_list(struct rev_info *revs,
struct object_array_entry *pending = revs->pending.objects + i;
struct object *obj = pending->item;
const char *name = pending->name;
+ const char *path = pending->path;
if (obj->flags & (UNINTERESTING | SEEN))
continue;
if (obj->type == OBJ_TAG) {
@@ -215,14 +216,16 @@ void traverse_commit_list(struct rev_info *revs,
show_object(obj, NULL, name, data);
continue;
}
+ if (!path)
+ path = "";
if (obj->type == OBJ_TREE) {
process_tree(revs, (struct tree *)obj, show_object,
- NULL, &base, name, data);
+ NULL, &base, path, data);
continue;
}
if (obj->type == OBJ_BLOB) {
process_blob(revs, (struct blob *)obj, show_object,
- NULL, name, data);
+ NULL, path, data);
continue;
}
die("unknown pending object %s (%s)",
diff --git a/revision.c b/revision.c
index b8e02e2..9a0f99a 100644
--- a/revision.c
+++ b/revision.c
@@ -198,9 +198,10 @@ void mark_parents_uninteresting(struct commit *commit)
}
}

-static void add_pending_object_with_mode(struct rev_info *revs,
+static void add_pending_object_with_path(struct rev_info *revs,
struct object *obj,
- const char *name, unsigned mode)
+ const char *name, unsigned mode,
+ const char *path)
{
if (!obj)
return;
@@ -220,7 +221,20 @@ static void add_pending_object_with_mode(struct rev_info *revs,
if (st)
return;
}
- add_object_array_with_mode(obj, name, &revs->pending, mode);
+ add_object_array_with_path(obj, name, &revs->pending, mode, path);
+}
+
+static void add_pending_object_with_mode(struct rev_info *revs,
+ struct object *obj,
+ const char *name, unsigned mode)
+{
+ add_pending_object_with_path(revs, obj, name, mode, NULL);
+}
+
+static void add_pending_object_from_entry(struct rev_info *revs,
+ struct object_array_entry *e)
+{
+ add_pending_object_with_path(revs, e->item, e->name, e->mode, e->path);
}

void add_pending_object(struct rev_info *revs,
@@ -265,8 +279,10 @@ void add_pending_sha1(struct rev_info *revs, const char *name,
}

static struct commit *handle_commit(struct rev_info *revs,
- struct object *object, const char *name)
+ struct object_array_entry *entry)
{
+ struct object *object = entry->item;
+ const char *name = entry->name;
unsigned long flags = object->flags;

/*
@@ -316,7 +332,7 @@ static struct commit *handle_commit(struct rev_info *revs,
mark_tree_contents_uninteresting(tree);
return NULL;
}
- add_pending_object(revs, object, "");
+ add_pending_object_from_entry(revs, entry);
return NULL;
}

@@ -328,7 +344,7 @@ static struct commit *handle_commit(struct rev_info *revs,
return NULL;
if (flags & UNINTERESTING)
return NULL;
- add_pending_object(revs, object, "");
+ add_pending_object_from_entry(revs, entry);
return NULL;
}
die("%s is unknown object", name);
@@ -2666,7 +2682,7 @@ int prepare_revision_walk(struct rev_info *revs)
revs->pending.objects = NULL;
for (i = 0; i < old_pending.nr; i++) {
struct object_array_entry *e = old_pending.objects + i;
- struct commit *commit = handle_commit(revs, e->item, e->name);
+ struct commit *commit = handle_commit(revs, e);
if (commit) {
if (!(commit->object.flags & SEEN)) {
commit->object.flags |= SEEN;
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:43:28 UTC
Permalink
This is mostly used internally, but it does not hurt to
explain it.

Signed-off-by: Jeff King <***@peff.net>
---
Documentation/rev-list-options.txt | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 5d311b8..4cf94c6 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -168,6 +168,10 @@ respectively, and they must begin with `refs/` when applied to `--glob`
or `--all`. If a trailing '/{asterisk}' is intended, it must be given
explicitly.

+--reflog::
+ Pretend as if all objects mentioned by reflogs are listed on the
+ command line as `<commit>`.
+
--ignore-missing::
Upon seeing an invalid object name in the input, pretend as if
the bad input was not given.
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:44:30 UTC
Permalink
There is currently no easy way to ask the revision traversal
machinery to include objects reachable from the index (e.g.,
blobs and trees that have not yet been committed). This
patch adds an option to do so.

Signed-off-by: Jeff King <***@peff.net>
---
I was tempted to call this just "--index", because I could not think of
what else "--index" would mean in the context of rev-list. But I also
worried about weird interactions with other commands that take revision
arguments. And since this is mostly for internal use anyway, I figured
the more verbose name is not too bad. I could be convinced otherwise,
though.

Documentation/rev-list-options.txt | 7 ++++++
revision.c | 51 ++++++++++++++++++++++++++++++++++++++
revision.h | 1 +
t/t6000-rev-list-misc.sh | 23 +++++++++++++++++
4 files changed, 82 insertions(+)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 4cf94c6..03ab343 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -172,6 +172,13 @@ explicitly.
Pretend as if all objects mentioned by reflogs are listed on the
command line as `<commit>`.

+--index-objects::
+ Pretend as if all objects used by the index (any blobs, and any
+ trees which are mentioned by the index's cache-tree extension)
+ ad listed on the command line. Note that you probably want to
+ use `--objects`, too, as there are by definition no commits in
+ the index.
+
--ignore-missing::
Upon seeing an invalid object name in the input, pretend as if
the bad input was not given.
diff --git a/revision.c b/revision.c
index 9a0f99a..caabaf1 100644
--- a/revision.c
+++ b/revision.c
@@ -17,6 +17,7 @@
#include "mailmap.h"
#include "commit-slab.h"
#include "dir.h"
+#include "cache-tree.h"

volatile show_early_output_fn_t show_early_output;

@@ -1299,6 +1300,53 @@ void add_reflogs_to_pending(struct rev_info *revs, unsigned flags)
for_each_reflog(handle_one_reflog, &cb);
}

+static void add_cache_tree(struct cache_tree *it, struct rev_info *revs,
+ struct strbuf *path)
+{
+ size_t baselen = path->len;
+ int i;
+
+ if (it->entry_count >= 0) {
+ struct tree *tree = lookup_tree(it->sha1);
+ add_pending_object_with_path(revs, &tree->object, "",
+ 040000, path->buf);
+ }
+
+ for (i = 0; i < it->subtree_nr; i++) {
+ struct cache_tree_sub *sub = it->down[i];
+ strbuf_addf(path, "%s%s", baselen ? "/" : "", sub->name);
+ add_cache_tree(sub->cache_tree, revs, path);
+ strbuf_setlen(path, baselen);
+ }
+
+}
+
+void add_index_objects_to_pending(struct rev_info *revs, unsigned flags)
+{
+ int i;
+
+ read_cache();
+ for (i = 0; i < active_nr; i++) {
+ struct cache_entry *ce = active_cache[i];
+ struct blob *blob;
+
+ if (S_ISGITLINK(ce->ce_mode))
+ continue;
+
+ blob = lookup_blob(ce->sha1);
+ if (!blob)
+ die("unable to add index blob to traversal");
+ add_pending_object_with_path(revs, &blob->object, "",
+ ce->ce_mode, ce->name);
+ }
+
+ if (active_cache_tree) {
+ struct strbuf path = STRBUF_INIT;
+ add_cache_tree(active_cache_tree, revs, &path);
+ strbuf_release(&path);
+ }
+}
+
static int add_parents_only(struct rev_info *revs, const char *arg_, int flags)
{
unsigned char sha1[20];
@@ -1649,6 +1697,7 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
!strcmp(arg, "--reflog") || !strcmp(arg, "--not") ||
!strcmp(arg, "--no-walk") || !strcmp(arg, "--do-walk") ||
!strcmp(arg, "--bisect") || starts_with(arg, "--glob=") ||
+ !strcmp(arg, "--index-objects") ||
starts_with(arg, "--exclude=") ||
starts_with(arg, "--branches=") || starts_with(arg, "--tags=") ||
starts_with(arg, "--remotes=") || starts_with(arg, "--no-walk="))
@@ -2078,6 +2127,8 @@ static int handle_revision_pseudo_opt(const char *submodule,
clear_ref_exclusion(&revs->ref_excludes);
} else if (!strcmp(arg, "--reflog")) {
add_reflogs_to_pending(revs, *flags);
+ } else if (!strcmp(arg, "--index-objects")) {
+ add_index_objects_to_pending(revs, *flags);
} else if (!strcmp(arg, "--not")) {
*flags ^= UNINTERESTING | BOTTOM;
} else if (!strcmp(arg, "--no-walk")) {
diff --git a/revision.h b/revision.h
index e644044..e6dcd5d 100644
--- a/revision.h
+++ b/revision.h
@@ -277,6 +277,7 @@ extern void add_pending_sha1(struct rev_info *revs,

extern void add_head_to_pending(struct rev_info *);
extern void add_reflogs_to_pending(struct rev_info *, unsigned int flags);
+extern void add_index_objects_to_pending(struct rev_info *, unsigned int flags);

enum commit_action {
commit_ignore,
diff --git a/t/t6000-rev-list-misc.sh b/t/t6000-rev-list-misc.sh
index 3794e4c..c2fb09c 100755
--- a/t/t6000-rev-list-misc.sh
+++ b/t/t6000-rev-list-misc.sh
@@ -73,4 +73,27 @@ test_expect_success 'symleft flag bit is propagated down from tag' '
test_cmp expect actual
'

+test_expect_success 'rev-list can show index objects' '
+ # Of the blobs and trees in the index, note:
+ #
+ # - we do not show two/three, because it is the
+ # same blob as "one", and we show objects only once
+ #
+ # - we do show the tree "two", because it has a valid cache tree
+ # from the last commit
+ #
+ # - we do not show the root tree; since we updated the index, it
+ # does not have a valid cache tree
+ #
+ cat >expect <<-\EOF
+ 8e4020bb5a8d8c873b25de15933e75cc0fc275df one
+ d9d3a7417b9605cfd88ee6306b28dadc29e6ab08 only-in-index
+ 9200b628cf9dc883a85a7abc8d6e6730baee589c two
+ EOF
+ echo only-in-index >only-in-index &&
+ git add only-in-index &&
+ git rev-list --objects --index-objects >actual &&
+ test_cmp expect actual
+'
+
test_done
--
2.1.2.596.g7379948
Junio C Hamano
2014-10-16 18:41:35 UTC
Permalink
Post by Jeff King
There is currently no easy way to ask the revision traversal
machinery to include objects reachable from the index (e.g.,
blobs and trees that have not yet been committed). This
patch adds an option to do so.
---
I was tempted to call this just "--index", because I could not think of
what else "--index" would mean in the context of rev-list. But I also
worried about weird interactions with other commands that take revision
arguments. And since this is mostly for internal use anyway, I figured
the more verbose name is not too bad. I could be convinced otherwise,
though.
I agree that "--index" is a bad name as it usually is used in a
particular context: the command can work on various combination of
working tree and the index, and I am asking it to work on both
(e.g. "apply --index" as opposed to "apply --cached").
Post by Jeff King
diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 4cf94c6..03ab343 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -172,6 +172,13 @@ explicitly.
Pretend as if all objects mentioned by reflogs are listed on the
command line as `<commit>`.
This risks "index" getting misunderstood as a verb, e.g. "please
enumerate the objects and assign labels to later refer to them",
doesn't it?

"--indexed-objects" (short for "--show-objects-in-the-index") or
something?
Post by Jeff King
+ Pretend as if all objects used by the index (any blobs, and any
+ trees which are mentioned by the index's cache-tree extension)
+ ad listed on the command line. Note that you probably want to
s/ad/are/, probably?
Post by Jeff King
+ use `--objects`, too, as there are by definition no commits in
+ the index.
For gitlinks/submodules, the index records names of the commit
objects, they are not listed, and that is the right behaviour, but
this description invites some confusion.
Jeff King
2014-10-17 00:12:31 UTC
Permalink
Post by Junio C Hamano
I agree that "--index" is a bad name as it usually is used in a
particular context: the command can work on various combination of
working tree and the index, and I am asking it to work on both
(e.g. "apply --index" as opposed to "apply --cached").
Thanks. I wasn't sure if I was just being paranoid or not, but now there
are at least two of us.
Post by Junio C Hamano
This risks "index" getting misunderstood as a verb, e.g. "please
enumerate the objects and assign labels to later refer to them",
doesn't it?
"--indexed-objects" (short for "--show-objects-in-the-index") or
something?
That sounds reasonable. We could technically do `--indexed` as that is
different from `--index`, but maybe they are still confusingly close.
Post by Junio C Hamano
Post by Jeff King
+ Pretend as if all objects used by the index (any blobs, and any
+ trees which are mentioned by the index's cache-tree extension)
+ ad listed on the command line. Note that you probably want to
s/ad/are/, probably?
Yeah, sorry, vi cruft I think (at least I didn't accidentally insert
"C-a C-k" ;) ).
Post by Junio C Hamano
Post by Jeff King
+ use `--objects`, too, as there are by definition no commits in
+ the index.
For gitlinks/submodules, the index records names of the commit
objects, they are not listed, and that is the right behaviour, but
this description invites some confusion.
Good point. How about this:

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 03ab343..3301fde 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -172,12 +172,10 @@ explicitly.
Pretend as if all objects mentioned by reflogs are listed on the
command line as `<commit>`.

---index-objects::
- Pretend as if all objects used by the index (any blobs, and any
- trees which are mentioned by the index's cache-tree extension)
- ad listed on the command line. Note that you probably want to
- use `--objects`, too, as there are by definition no commits in
- the index.
+--indexed-objects::
+ Pretend as if all trees and blobs used by the index are listed
+ on the command line. Note that you probably want to use
+ `--objects`, too.

--ignore-missing::
Upon seeing an invalid object name in the input, pretend as if

I was tempted to not document this at all (nor to add documentation for
--reflog), as I think these are really only going to be used internally.
But it's nice to have documentation even for this internal stuff (if
anything, we should probably be making sure they are just limited to
rev-list plumbing, and not included in the log manpage).

-Peff
Jeff King
2014-10-17 00:43:46 UTC
Permalink
Post by Jeff King
Post by Junio C Hamano
"--indexed-objects" (short for "--show-objects-in-the-index") or
something?
That sounds reasonable. We could technically do `--indexed` as that is
different from `--index`, but maybe they are still confusingly close.
Here's a re-roll of the final 5 patches of the series with the updated
name (`--indexed-objects`). The name change cascades from patch 22 to
patches 23 and 25 (and I renamed the matching pack-objects option as
well). 24 and 26 are unchanged, but I figured it is easier on you to
replace the whole block of patches at once.

[22/26]: rev-list: add --indexed-objects option
[23/26]: reachable: use revision machinery's --indexed-objects code
[24/26]: pack-objects: use argv_array
[25/26]: repack: pack objects mentioned by the index
[26/26]: pack-objects: double-check options before discarding objects
Jeff King
2014-10-17 00:44:23 UTC
Permalink
There is currently no easy way to ask the revision traversal
machinery to include objects reachable from the index (e.g.,
blobs and trees that have not yet been committed). This
patch adds an option to do so.

Signed-off-by: Jeff King <***@peff.net>
---
Documentation/rev-list-options.txt | 5 ++++
revision.c | 51 ++++++++++++++++++++++++++++++++++++++
revision.h | 1 +
t/t6000-rev-list-misc.sh | 23 +++++++++++++++++
4 files changed, 80 insertions(+)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 4cf94c6..3301fde 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -172,6 +172,11 @@ explicitly.
Pretend as if all objects mentioned by reflogs are listed on the
command line as `<commit>`.

+--indexed-objects::
+ Pretend as if all trees and blobs used by the index are listed
+ on the command line. Note that you probably want to use
+ `--objects`, too.
+
--ignore-missing::
Upon seeing an invalid object name in the input, pretend as if
the bad input was not given.
diff --git a/revision.c b/revision.c
index 8030fc8..a6c5dc3 100644
--- a/revision.c
+++ b/revision.c
@@ -17,6 +17,7 @@
#include "mailmap.h"
#include "commit-slab.h"
#include "dir.h"
+#include "cache-tree.h"

volatile show_early_output_fn_t show_early_output;

@@ -1303,6 +1304,53 @@ void add_reflogs_to_pending(struct rev_info *revs, unsigned flags)
for_each_reflog(handle_one_reflog, &cb);
}

+static void add_cache_tree(struct cache_tree *it, struct rev_info *revs,
+ struct strbuf *path)
+{
+ size_t baselen = path->len;
+ int i;
+
+ if (it->entry_count >= 0) {
+ struct tree *tree = lookup_tree(it->sha1);
+ add_pending_object_with_path(revs, &tree->object, "",
+ 040000, path->buf);
+ }
+
+ for (i = 0; i < it->subtree_nr; i++) {
+ struct cache_tree_sub *sub = it->down[i];
+ strbuf_addf(path, "%s%s", baselen ? "/" : "", sub->name);
+ add_cache_tree(sub->cache_tree, revs, path);
+ strbuf_setlen(path, baselen);
+ }
+
+}
+
+void add_index_objects_to_pending(struct rev_info *revs, unsigned flags)
+{
+ int i;
+
+ read_cache();
+ for (i = 0; i < active_nr; i++) {
+ struct cache_entry *ce = active_cache[i];
+ struct blob *blob;
+
+ if (S_ISGITLINK(ce->ce_mode))
+ continue;
+
+ blob = lookup_blob(ce->sha1);
+ if (!blob)
+ die("unable to add index blob to traversal");
+ add_pending_object_with_path(revs, &blob->object, "",
+ ce->ce_mode, ce->name);
+ }
+
+ if (active_cache_tree) {
+ struct strbuf path = STRBUF_INIT;
+ add_cache_tree(active_cache_tree, revs, &path);
+ strbuf_release(&path);
+ }
+}
+
static int add_parents_only(struct rev_info *revs, const char *arg_, int flags)
{
unsigned char sha1[20];
@@ -1653,6 +1701,7 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
!strcmp(arg, "--reflog") || !strcmp(arg, "--not") ||
!strcmp(arg, "--no-walk") || !strcmp(arg, "--do-walk") ||
!strcmp(arg, "--bisect") || starts_with(arg, "--glob=") ||
+ !strcmp(arg, "--indexed-objects") ||
starts_with(arg, "--exclude=") ||
starts_with(arg, "--branches=") || starts_with(arg, "--tags=") ||
starts_with(arg, "--remotes=") || starts_with(arg, "--no-walk="))
@@ -2082,6 +2131,8 @@ static int handle_revision_pseudo_opt(const char *submodule,
clear_ref_exclusion(&revs->ref_excludes);
} else if (!strcmp(arg, "--reflog")) {
add_reflogs_to_pending(revs, *flags);
+ } else if (!strcmp(arg, "--indexed-objects")) {
+ add_index_objects_to_pending(revs, *flags);
} else if (!strcmp(arg, "--not")) {
*flags ^= UNINTERESTING | BOTTOM;
} else if (!strcmp(arg, "--no-walk")) {
diff --git a/revision.h b/revision.h
index e644044..e6dcd5d 100644
--- a/revision.h
+++ b/revision.h
@@ -277,6 +277,7 @@ extern void add_pending_sha1(struct rev_info *revs,

extern void add_head_to_pending(struct rev_info *);
extern void add_reflogs_to_pending(struct rev_info *, unsigned int flags);
+extern void add_index_objects_to_pending(struct rev_info *, unsigned int flags);

enum commit_action {
commit_ignore,
diff --git a/t/t6000-rev-list-misc.sh b/t/t6000-rev-list-misc.sh
index 3794e4c..2602086 100755
--- a/t/t6000-rev-list-misc.sh
+++ b/t/t6000-rev-list-misc.sh
@@ -73,4 +73,27 @@ test_expect_success 'symleft flag bit is propagated down from tag' '
test_cmp expect actual
'

+test_expect_success 'rev-list can show index objects' '
+ # Of the blobs and trees in the index, note:
+ #
+ # - we do not show two/three, because it is the
+ # same blob as "one", and we show objects only once
+ #
+ # - we do show the tree "two", because it has a valid cache tree
+ # from the last commit
+ #
+ # - we do not show the root tree; since we updated the index, it
+ # does not have a valid cache tree
+ #
+ cat >expect <<-\EOF
+ 8e4020bb5a8d8c873b25de15933e75cc0fc275df one
+ d9d3a7417b9605cfd88ee6306b28dadc29e6ab08 only-in-index
+ 9200b628cf9dc883a85a7abc8d6e6730baee589c two
+ EOF
+ echo only-in-index >only-in-index &&
+ git add only-in-index &&
+ git rev-list --objects --indexed-objects >actual &&
+ test_cmp expect actual
+'
+
test_done
--
2.1.2.596.g7379948
Jeff King
2014-10-17 00:44:30 UTC
Permalink
This does the same thing as our custom code, so let's not
repeat ourselves.

Signed-off-by: Jeff King <***@peff.net>
---
reachable.c | 52 +---------------------------------------------------
1 file changed, 1 insertion(+), 51 deletions(-)

diff --git a/reachable.c b/reachable.c
index 0176a88..a647267 100644
--- a/reachable.c
+++ b/reachable.c
@@ -32,56 +32,6 @@ static int add_one_ref(const char *path, const unsigned char *sha1, int flag, vo
return 0;
}

-static void add_one_tree(const unsigned char *sha1, struct rev_info *revs)
-{
- struct tree *tree = lookup_tree(sha1);
- if (tree)
- add_pending_object(revs, &tree->object, "");
-}
-
-static void add_cache_tree(struct cache_tree *it, struct rev_info *revs)
-{
- int i;
-
- if (it->entry_count >= 0)
- add_one_tree(it->sha1, revs);
- for (i = 0; i < it->subtree_nr; i++)
- add_cache_tree(it->down[i]->cache_tree, revs);
-}
-
-static void add_cache_refs(struct rev_info *revs)
-{
- int i;
-
- read_cache();
- for (i = 0; i < active_nr; i++) {
- struct blob *blob;
-
- /*
- * The index can contain blobs and GITLINKs, GITLINKs are hashes
- * that don't actually point to objects in the repository, it's
- * almost guaranteed that they are NOT blobs, so we don't call
- * lookup_blob() on them, to avoid populating the hash table
- * with invalid information
- */
- if (S_ISGITLINK(active_cache[i]->ce_mode))
- continue;
-
- blob = lookup_blob(active_cache[i]->sha1);
- if (blob)
- blob->object.flags |= SEEN;
-
- /*
- * We could add the blobs to the pending list, but quite
- * frankly, we don't care. Once we've looked them up, and
- * added them as objects, we've really done everything
- * there is to do for a blob
- */
- }
- if (active_cache_tree)
- add_cache_tree(active_cache_tree, revs);
-}
-
/*
* The traversal will have already marked us as SEEN, so we
* only need to handle any progress reporting here.
@@ -213,7 +163,7 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
revs->tree_objects = 1;

/* Add all refs from the index file */
- add_cache_refs(revs);
+ add_index_objects_to_pending(revs, 0);

/* Add all external refs */
for_each_ref(add_one_ref, revs);
--
2.1.2.596.g7379948
Jeff King
2014-10-17 00:44:35 UTC
Permalink
This saves us from having to bump the rp_av count when we
add new traversal options.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/pack-objects.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4df9499..b26276b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -22,6 +22,7 @@
#include "pack-bitmap.h"
#include "reachable.h"
#include "sha1-array.h"
+#include "argv-array.h"

static const char *pack_usage[] = {
N_("git pack-objects --stdout [options...] [< ref-list | < object-list]"),
@@ -2614,8 +2615,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
int use_internal_rev_list = 0;
int thin = 0;
int all_progress_implied = 0;
- const char *rp_av[6];
- int rp_ac = 0;
+ struct argv_array rp = ARGV_ARRAY_INIT;
int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
struct option pack_objects_options[] = {
OPT_SET_INT('q', "quiet", &progress,
@@ -2705,24 +2705,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
if (pack_to_stdout != !base_name || argc)
usage_with_options(pack_usage, pack_objects_options);

- rp_av[rp_ac++] = "pack-objects";
+ argv_array_push(&rp, "pack-objects");
if (thin) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--objects-edge";
+ argv_array_push(&rp, "--objects-edge");
} else
- rp_av[rp_ac++] = "--objects";
+ argv_array_push(&rp, "--objects");

if (rev_list_all) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--all";
+ argv_array_push(&rp, "--all");
}
if (rev_list_reflog) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--reflog";
+ argv_array_push(&rp, "--reflog");
}
if (rev_list_unpacked) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--unpacked";
+ argv_array_push(&rp, "--unpacked");
}

if (!reuse_object)
@@ -2766,8 +2766,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
if (!use_internal_rev_list)
read_object_list_from_stdin();
else {
- rp_av[rp_ac] = NULL;
- get_object_list(rp_ac, rp_av);
+ get_object_list(rp.argc, rp.argv);
+ argv_array_clear(&rp);
}
cleanup_preferred_base();
if (include_tag && nr_result)
--
2.1.2.596.g7379948
Jeff King
2014-10-17 00:44:49 UTC
Permalink
When we pack all objects, we use only the objects reachable
from references and reflogs. This misses any objects which
are reachable from the index, but not yet referenced.

By itself this isn't a big deal; the objects can remain
loose until they are actually used in a commit. However, it
does create a problem when we drop packed but unreachable
objects. We try to optimize out the writing of objects that
we will immediately prune, which means we must follow the
same rules as prune in determining what is reachable. And
prune uses the index for this purpose.

This is rather uncommon in practice, as objects in the index
would not usually have been packed in the first place. But
it could happen in a sequence like:

1. You make a commit on a branch that references blob X.

2. You repack, moving X into the pack.

3. You delete the branch (and its reflog), so that X is
unreferenced.

4. You "git add" blob X so that it is now referenced only
by the index.

5. You repack again with git-gc. The pack-objects we
invoke will see that X is neither referenced nor
recent and not bother loosening it.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/pack-objects.c | 8 ++++++++
builtin/repack.c | 1 +
t/t7701-repack-unpack-unreachable.sh | 13 +++++++++++++
3 files changed, 22 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b26276b..0cf95c9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2617,6 +2617,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
int all_progress_implied = 0;
struct argv_array rp = ARGV_ARRAY_INIT;
int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
+ int rev_list_index = 0;
struct option pack_objects_options[] = {
OPT_SET_INT('q', "quiet", &progress,
N_("do not show progress meter"), 0),
@@ -2663,6 +2664,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
{ OPTION_SET_INT, 0, "reflog", &rev_list_reflog, NULL,
N_("include objects referred by reflog entries"),
PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
+ { OPTION_SET_INT, 0, "indexed-objects", &rev_list_index, NULL,
+ N_("include objects referred to by the index"),
+ PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
OPT_BOOL(0, "stdout", &pack_to_stdout,
N_("output pack to stdout")),
OPT_BOOL(0, "include-tag", &include_tag,
@@ -2720,6 +2724,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
use_internal_rev_list = 1;
argv_array_push(&rp, "--reflog");
}
+ if (rev_list_index) {
+ use_internal_rev_list = 1;
+ argv_array_push(&rp, "--indexed-objects");
+ }
if (rev_list_unpacked) {
use_internal_rev_list = 1;
argv_array_push(&rp, "--unpacked");
diff --git a/builtin/repack.c b/builtin/repack.c
index 2aae05d..2845620 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -209,6 +209,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
argv_array_push(&cmd_args, "--non-empty");
argv_array_push(&cmd_args, "--all");
argv_array_push(&cmd_args, "--reflog");
+ argv_array_push(&cmd_args, "--indexed-objects");
if (window)
argv_array_pushf(&cmd_args, "--window=%s", window);
if (window_memory)
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index b8d4cde..aad8a9c 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -109,4 +109,17 @@ test_expect_success 'do not bother loosening old objects' '
test_must_fail git cat-file -p $obj2
'

+test_expect_success 'keep packed objects found only in index' '
+ echo my-unique-content >file &&
+ git add file &&
+ git commit -m "make it reachable" &&
+ git gc &&
+ git reset HEAD^ &&
+ git reflog expire --expire=now --all &&
+ git add file &&
+ test-chmtime =-86400 .git/objects/pack/* &&
+ git gc --prune=1.hour.ago &&
+ git cat-file blob :file
+'
+
test_done
--
2.1.2.596.g7379948
Jeff King
2014-10-17 00:44:54 UTC
Permalink
When we are given an expiration time like
--unpack-unreachable=2.weeks.ago, we avoid writing out old,
unreachable loose objects entirely, under the assumption
that running "prune" would simply delete them immediately
anyway. However, this is only valid if we computed the same
set of reachable objects as prune would.

In practice, this is the case, because only git-repack uses
the --unpack-unreachable option with an expiration, and it
always feeds as many objects into the pack as possible. But
we can double-check at runtime just to be sure.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/pack-objects.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0cf95c9..64123d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2757,6 +2757,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)

if (keep_unreachable && unpack_unreachable)
die("--keep-unreachable and --unpack-unreachable are incompatible.");
+ if (!rev_list_all || !rev_list_reflog || !rev_list_index)
+ unpack_unreachable_expiration = 0;

if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
use_bitmap_index = 0;
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:44:51 UTC
Permalink
This does the same thing as our custom code, so let's not
repeat ourselves.

Signed-off-by: Jeff King <***@peff.net>
---
reachable.c | 52 +---------------------------------------------------
1 file changed, 1 insertion(+), 51 deletions(-)

diff --git a/reachable.c b/reachable.c
index 0176a88..a647267 100644
--- a/reachable.c
+++ b/reachable.c
@@ -32,56 +32,6 @@ static int add_one_ref(const char *path, const unsigned char *sha1, int flag, vo
return 0;
}

-static void add_one_tree(const unsigned char *sha1, struct rev_info *revs)
-{
- struct tree *tree = lookup_tree(sha1);
- if (tree)
- add_pending_object(revs, &tree->object, "");
-}
-
-static void add_cache_tree(struct cache_tree *it, struct rev_info *revs)
-{
- int i;
-
- if (it->entry_count >= 0)
- add_one_tree(it->sha1, revs);
- for (i = 0; i < it->subtree_nr; i++)
- add_cache_tree(it->down[i]->cache_tree, revs);
-}
-
-static void add_cache_refs(struct rev_info *revs)
-{
- int i;
-
- read_cache();
- for (i = 0; i < active_nr; i++) {
- struct blob *blob;
-
- /*
- * The index can contain blobs and GITLINKs, GITLINKs are hashes
- * that don't actually point to objects in the repository, it's
- * almost guaranteed that they are NOT blobs, so we don't call
- * lookup_blob() on them, to avoid populating the hash table
- * with invalid information
- */
- if (S_ISGITLINK(active_cache[i]->ce_mode))
- continue;
-
- blob = lookup_blob(active_cache[i]->sha1);
- if (blob)
- blob->object.flags |= SEEN;
-
- /*
- * We could add the blobs to the pending list, but quite
- * frankly, we don't care. Once we've looked them up, and
- * added them as objects, we've really done everything
- * there is to do for a blob
- */
- }
- if (active_cache_tree)
- add_cache_tree(active_cache_tree, revs);
-}
-
/*
* The traversal will have already marked us as SEEN, so we
* only need to handle any progress reporting here.
@@ -213,7 +163,7 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
revs->tree_objects = 1;

/* Add all refs from the index file */
- add_cache_refs(revs);
+ add_index_objects_to_pending(revs, 0);

/* Add all external refs */
for_each_ref(add_one_ref, revs);
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:45:16 UTC
Permalink
This saves us from having to bump the rp_av count when we
add new traversal options.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/pack-objects.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4df9499..b26276b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -22,6 +22,7 @@
#include "pack-bitmap.h"
#include "reachable.h"
#include "sha1-array.h"
+#include "argv-array.h"

static const char *pack_usage[] = {
N_("git pack-objects --stdout [options...] [< ref-list | < object-list]"),
@@ -2614,8 +2615,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
int use_internal_rev_list = 0;
int thin = 0;
int all_progress_implied = 0;
- const char *rp_av[6];
- int rp_ac = 0;
+ struct argv_array rp = ARGV_ARRAY_INIT;
int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
struct option pack_objects_options[] = {
OPT_SET_INT('q', "quiet", &progress,
@@ -2705,24 +2705,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
if (pack_to_stdout != !base_name || argc)
usage_with_options(pack_usage, pack_objects_options);

- rp_av[rp_ac++] = "pack-objects";
+ argv_array_push(&rp, "pack-objects");
if (thin) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--objects-edge";
+ argv_array_push(&rp, "--objects-edge");
} else
- rp_av[rp_ac++] = "--objects";
+ argv_array_push(&rp, "--objects");

if (rev_list_all) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--all";
+ argv_array_push(&rp, "--all");
}
if (rev_list_reflog) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--reflog";
+ argv_array_push(&rp, "--reflog");
}
if (rev_list_unpacked) {
use_internal_rev_list = 1;
- rp_av[rp_ac++] = "--unpacked";
+ argv_array_push(&rp, "--unpacked");
}

if (!reuse_object)
@@ -2766,8 +2766,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
if (!use_internal_rev_list)
read_object_list_from_stdin();
else {
- rp_av[rp_ac] = NULL;
- get_object_list(rp_ac, rp_av);
+ get_object_list(rp.argc, rp.argv);
+ argv_array_clear(&rp);
}
cleanup_preferred_base();
if (include_tag && nr_result)
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:46:32 UTC
Permalink
When we pack all objects, we use only the objects reachable
from references and reflogs. This misses any objects which
are reachable from the index, but not yet referenced.

By itself this isn't a big deal; the objects can remain
loose until they are actually used in a commit. However, it
does create a problem when we drop packed but unreachable
objects. We try to optimize out the writing of objects that
we will immediately prune, which means we must follow the
same rules as prune in determining what is reachable. And
prune uses the index for this purpose.

This is rather uncommon in practice, as objects in the index
would not usually have been packed in the first place. But
it could happen in a sequence like:

1. You make a commit on a branch that references blob X.

2. You repack, moving X into the pack.

3. You delete the branch (and its reflog), so that X is
unreferenced.

4. You "git add" blob X so that it is now referenced only
by the index.

5. You repack again with git-gc. The pack-objects we
invoke will see that X is neither referenced nor
recent and not bother loosening it.

Signed-off-by: Jeff King <***@peff.net>
---
In addition to fixing the safety, this is actually _packing_ those index
objects. I don't see anything wrong with that, but if would prefer not
to, it should be possible to do a separate traversal over the index
objects (and mark them as "to keep", but not "to pack").

builtin/pack-objects.c | 8 ++++++++
builtin/repack.c | 1 +
t/t7701-repack-unpack-unreachable.sh | 13 +++++++++++++
3 files changed, 22 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b26276b..3dc9108 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2617,6 +2617,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
int all_progress_implied = 0;
struct argv_array rp = ARGV_ARRAY_INIT;
int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
+ int rev_list_index = 0;
struct option pack_objects_options[] = {
OPT_SET_INT('q', "quiet", &progress,
N_("do not show progress meter"), 0),
@@ -2663,6 +2664,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
{ OPTION_SET_INT, 0, "reflog", &rev_list_reflog, NULL,
N_("include objects referred by reflog entries"),
PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
+ { OPTION_SET_INT, 0, "index-objects", &rev_list_index, NULL,
+ N_("include objects referred to by the index"),
+ PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
OPT_BOOL(0, "stdout", &pack_to_stdout,
N_("output pack to stdout")),
OPT_BOOL(0, "include-tag", &include_tag,
@@ -2720,6 +2724,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
use_internal_rev_list = 1;
argv_array_push(&rp, "--reflog");
}
+ if (rev_list_index) {
+ use_internal_rev_list = 1;
+ argv_array_push(&rp, "--index-objects");
+ }
if (rev_list_unpacked) {
use_internal_rev_list = 1;
argv_array_push(&rp, "--unpacked");
diff --git a/builtin/repack.c b/builtin/repack.c
index 2aae05d..80b6021 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -209,6 +209,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
argv_array_push(&cmd_args, "--non-empty");
argv_array_push(&cmd_args, "--all");
argv_array_push(&cmd_args, "--reflog");
+ argv_array_push(&cmd_args, "--index-objects");
if (window)
argv_array_pushf(&cmd_args, "--window=%s", window);
if (window_memory)
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index b8d4cde..aad8a9c 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -109,4 +109,17 @@ test_expect_success 'do not bother loosening old objects' '
test_must_fail git cat-file -p $obj2
'

+test_expect_success 'keep packed objects found only in index' '
+ echo my-unique-content >file &&
+ git add file &&
+ git commit -m "make it reachable" &&
+ git gc &&
+ git reset HEAD^ &&
+ git reflog expire --expire=now --all &&
+ git add file &&
+ test-chmtime =-86400 .git/objects/pack/* &&
+ git gc --prune=1.hour.ago &&
+ git cat-file blob :file
+'
+
test_done
--
2.1.2.596.g7379948
Jeff King
2014-10-15 22:48:15 UTC
Permalink
When we are given an expiration time like
--unpack-unreachable=2.weeks.ago, we avoid writing out old,
unreachable loose objects entirely, under the assumption
that running "prune" would simply delete them immediately
anyway. However, this is only valid if we computed the same
set of reachable objects as prune would.

In practice, this is the case, because only git-repack uses
the --unpack-unreachable option with an expiration, and it
always feeds as many objects into the pack as possible. But
we can double-check at runtime just to be sure.

Signed-off-by: Jeff King <***@peff.net>
---
No test here, because the potential breakage cannot be seen by running
"git repack" (because it gives sane options), nor by just running "git
pack-objects" (you can convince it not explode a loose object, but then
you would have to reimplement the part of "git repack" where it deletes
all the packs except the one we just created). So really this would only
be protecting in practice against somebody who tried to reimplement
git-repack themselves (and I do not know of anybody who has done that).

builtin/pack-objects.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3dc9108..72589ed 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2757,6 +2757,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)

if (keep_unreachable && unpack_unreachable)
die("--keep-unreachable and --unpack-unreachable are incompatible.");
+ if (!rev_list_all || !rev_list_reflog || !rev_list_index)
+ unpack_unreachable_expiration = 0;

if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
use_bitmap_index = 0;
--
2.1.2.596.g7379948
Junio C Hamano
2014-10-16 21:07:47 UTC
Permalink
Somewhere in this series the object enumeration seems to get
broken. "git clone --no-local" of git.git repository died
with

Cloning into 'victim-7'...
remote: Counting objects: 173727, done.
remote: Compressing objects: 100% (43697/43697), done.
remote: Total 173727 (delta 130908), reused 170009 (delta 128151)
Receiving objects: 100% (173727/173727), 33.45 MiB | 13.69 MiB/s,
done.
Resolving deltas: 100% (130908/130908), done.
fatal: did not receive expected object a74380c3117994efba501df1707418eb6feb9284
fatal: index-pack failed

where a74380c... is such an object.

If you have a clone of https://github.com/git/git.git

$ git rev-list --parents --objects --all | grep a74380c3117994

would be an easy way to reproduce.
Junio C Hamano
2014-10-16 21:10:56 UTC
Permalink
Post by Junio C Hamano
Somewhere in this series the object enumeration seems to get
broken. "git clone --no-local" of git.git repository died
with
Cloning into 'victim-7'...
remote: Counting objects: 173727, done.
remote: Compressing objects: 100% (43697/43697), done.
remote: Total 173727 (delta 130908), reused 170009 (delta 128151)
Receiving objects: 100% (173727/173727), 33.45 MiB | 13.69 MiB/s,
done.
Resolving deltas: 100% (130908/130908), done.
fatal: did not receive expected object a74380c3117994efba501df1707418eb6feb9284
fatal: index-pack failed
where a74380c... is such an object.
Ehh, "such" is a nonsense. It is a blob that directly is pointed by
a tag that is in refs/tags/*.
Post by Junio C Hamano
If you have a clone of https://github.com/git/git.git
$ git rev-list --parents --objects --all | grep a74380c3117994
would be an easy way to reproduce.
Jeff King
2014-10-16 21:21:13 UTC
Permalink
Post by Junio C Hamano
Somewhere in this series the object enumeration seems to get
broken. "git clone --no-local" of git.git repository died
with
Cloning into 'victim-7'...
remote: Counting objects: 173727, done.
remote: Compressing objects: 100% (43697/43697), done.
remote: Total 173727 (delta 130908), reused 170009 (delta 128151)
Receiving objects: 100% (173727/173727), 33.45 MiB | 13.69 MiB/s,
done.
Resolving deltas: 100% (130908/130908), done.
fatal: did not receive expected object a74380c3117994efba501df1707418eb6feb9284
fatal: index-pack failed
where a74380c... is such an object.
If you have a clone of https://github.com/git/git.git
Hrm. I cannot reproduce the clone failure...
Post by Junio C Hamano
$ git rev-list --parents --objects --all | grep a74380c3117994
would be an easy way to reproduce.
But this I can. Digging into it...

-Peff
Jeff King
2014-10-16 21:39:18 UTC
Permalink
Post by Jeff King
Hrm. I cannot reproduce the clone failure...
Oh, because I have bitmaps turned on, and we do not run the list-objects
code at all in that case.

The bug unsurprisingly bisects to "traverse_commit_list: support
pending blobs/trees with paths". The problem is that I didn't notice
that handle_commit actually rewrites the "object" pointer when
unwrapping the tags, and that commit reuses the original pointer from
the entry. So we end up putting two copies of the tag object into the
pending list, rather than the tag and the blob.

The fix is:

diff --git a/revision.c b/revision.c
index 9a0f99a..8030fc8 100644
--- a/revision.c
+++ b/revision.c
@@ -231,12 +231,6 @@ static void add_pending_object_with_mode(struct rev_info *revs,
add_pending_object_with_path(revs, obj, name, mode, NULL);
}

-static void add_pending_object_from_entry(struct rev_info *revs,
- struct object_array_entry *e)
-{
- add_pending_object_with_path(revs, e->item, e->name, e->mode, e->path);
-}
-
void add_pending_object(struct rev_info *revs,
struct object *obj, const char *name)
{
@@ -283,6 +277,8 @@ static struct commit *handle_commit(struct rev_info *revs,
{
struct object *object = entry->item;
const char *name = entry->name;
+ const char *path = entry->path;
+ unsigned int mode = entry->mode;
unsigned long flags = object->flags;

/*
@@ -301,6 +297,14 @@ static struct commit *handle_commit(struct rev_info *revs,
die("bad object %s", sha1_to_hex(tag->tagged->sha1));
}
object->flags |= flags;
+ /*
+ * We'll handle the tagged object by looping or dropping
+ * through to the non-tag handlers below. Do not
+ * propagate data from the tag's pending entry.
+ */
+ name = NULL;
+ path = NULL;
+ mode = 0;
}

/*
@@ -332,7 +336,7 @@ static struct commit *handle_commit(struct rev_info *revs,
mark_tree_contents_uninteresting(tree);
return NULL;
}
- add_pending_object_from_entry(revs, entry);
+ add_pending_object_with_path(revs, object, name, mode, path);
return NULL;
}

@@ -344,7 +348,7 @@ static struct commit *handle_commit(struct rev_info *revs,
return NULL;
if (flags & UNINTERESTING)
return NULL;
- add_pending_object_from_entry(revs, entry);
+ add_pending_object_with_path(revs, object, name, mode, path);
return NULL;
}
die("%s is unknown object", name);

We should probably add a test for cloning a tag of an otherwise
unreferenced object, too.

-Peff
Junio C Hamano
2014-10-16 22:18:34 UTC
Permalink
Post by Jeff King
Post by Jeff King
Hrm. I cannot reproduce the clone failure...
Oh, because I have bitmaps turned on, and we do not run the list-objects
code at all in that case.
;-)
Post by Jeff King
We should probably add a test for cloning a tag of an otherwise
unreferenced object, too.
Yeah, that too ;-)

Thanks for a quick diag.
Jeff King
2014-10-17 00:03:41 UTC
Permalink
Post by Junio C Hamano
Post by Jeff King
We should probably add a test for cloning a tag of an otherwise
unreferenced object, too.
Yeah, that too ;-)
Here's that test (after the scissors below). It can be applied totally
separately, though I stuck it in as patch "18.5/25" in the existing
series to confirm that my original series does cause the test to fail,
and that it passes with the patch I posted earlier.

Do you want to just squash the fix I posted earlier (into patch 19, the
"traverse_commit_list: ..." one)? I'm happy to repost the revised patch,
but my impression of your workflow is that squashing is just as easy
than replacing a patch (i.e., you're running "rebase -i" either way).

Or I'm happy to re-post the whole series, but it's rather long. :)
Post by Junio C Hamano
Thanks for a quick diag.
I'm impressed that you found the bug so quickly. :) My biggest fear with
the whole series is that it's disrupting and refactoring some
fundamental parts of the code that might cause regressions. I put a lot
of my faith in the test suite, but that did not work out here (I do
still feel pretty good about the series overall, though I think I'd cook
it longer than usual given the complexity and the areas it touches).

-- >8 --
Subject: t5516: test pushing a tag of an otherwise unreferenced blob

It's not unreasonable to have a tag that points to a blob
that is not part of the normal history. We do this in
git.git to distribute gpg keys. However, we never explicitly
checked in our test suite that this actually works (i.e.,
that pack-objects actually sends the blob because of the tag
mentioning it).

It does in fact work fine, but a recent patch under
discussion broke this, and the test suite didn't notice.
Let's make the test suite more complete.

Signed-off-by: Jeff King <***@peff.net>
---
The "should have" below is belt-and-suspenders. The test actually fails
with my series without the cat-file check, but I'm concerned a bug
that affects pack-objects could also affect the connectivity check on
the receiving side.

t/t5516-fetch-push.sh | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
index 67e0ab3..7c8a769 100755
--- a/t/t5516-fetch-push.sh
+++ b/t/t5516-fetch-push.sh
@@ -1277,4 +1277,17 @@ EOF
git push --no-thin --receive-pack="$rcvpck" no-thin/.git refs/heads/master:refs/heads/foo
'

+test_expect_success 'pushing a tag pushes the tagged object' '
+ rm -rf dst.git &&
+ blob=$(echo unreferenced | git hash-object -w --stdin) &&
+ git tag -m foo tag-of-blob $blob &&
+ git init --bare dst.git &&
+ git push dst.git tag-of-blob &&
+ # the receiving index-pack should have noticed
+ # any problems, but we double check
+ echo unreferenced >expect &&
+ git --git-dir=dst.git cat-file blob tag-of-blob >actual &&
+ test_cmp expect actual
+'
+
test_done
--
2.1.2.596.g7379948
Jeff King
2014-10-17 04:49:11 UTC
Permalink
Just being curious, but would the same bug, if allowed to be triggered
cutting repacking of your repository, have corrupted the resulting bitmap?
I didn't test, but yes, almost certainly. The bug was in list-objects.c,
which is used by pack-objects to generate the list of objects to pack,
as well as to build the bitmaps. So not only would it have corrupted the
bitmaps, a `git repack -ad`[1] would have dropped the object completely,
corrupting the repository!

-Peff

[1] git-gc uses `repack -Ad`, of course. So assuming you had packed more
recently than 2 weeks ago, it would have just been ejected to a
loose object. Small comfort. :)

-Peff
Jeff King
2014-10-18 12:31:06 UTC
Permalink
Post by Jeff King
@@ -301,6 +297,14 @@ static struct commit *handle_commit(struct rev_info *revs,
die("bad object %s", sha1_to_hex(tag->tagged->sha1));
}
object->flags |= flags;
+ /*
+ * We'll handle the tagged object by looping or dropping
+ * through to the non-tag handlers below. Do not
+ * propagate data from the tag's pending entry.
+ */
+ name = NULL;
+ path = NULL;
+ mode = 0;
Hmm. On second thought (and after seeing a warning from Coverity), this
should be:

diff --git a/revision.c b/revision.c
index 8030fc8..ebe3e93 100644
--- a/revision.c
+++ b/revision.c
@@ -302,7 +302,7 @@ static struct commit *handle_commit(struct rev_info *revs,
* through to the non-tag handlers below. Do not
* propagate data from the tag's pending entry.
*/
- name = NULL;
+ name = "";
path = NULL;
mode = 0;
}

The rest of the function assumes that name is not NULL (which I'm not
sure is entirely safe, as add_pending_object can take a NULL; presumably
every "add" uses the empty string instead of NULL. But either way,
setting it to NULL here is definite wrong).

The "path" field is explicitly OK to be NULL.

-Peff

Continue reading on narkive:
Loading...