[RFC/PATCH] fsck: do not canonicalize modes in trees we are checking

Discussion:

Jeff King

2014-09-23 15:47:51 UTC

Doing so means that we do not actually get to see bogus
modes; they are converted into one of our known-good modes
by decode_tree_entry. We want to see the raw data so that we
can complain about it.

Signed-off-by: Jeff King <***@peff.net>
---
As far as I can tell, fsck's mode-checking has been totally broken
basically forever. Which makes me a little nervous to fix it. :)
linux.git does have some bogus modes, but they are 100664, which is
specifically ignored here unless "fsck --strict" is in effect.

The implementation feels a bit hacky. The global is ugly, but it avoids
having to pass options all through the call chain of init_tree_entry,
update_tree_entry, etc.

Another option would be to have decode_tree_entry leave the raw mode in
the tree_desc, and only canonicalize it during tree_entry_extract (and
teach fsck to look at the raw data, not _extract). That would mean
reverting 7146e66 (tree-walk: finally switch over tree descriptors to
contain a pre-parsed entry, 2014-02-06). That commit says there's no
real benefit at the time, but that future code might benefit. I don't
know if that future code ever materialized (Kirill cc'd).

I thought at first that commit might have been responsible for this
breakage, but I don't think so. The canonicalization has happened in
tree_entry_extract since at least 2006, and we have always called that
function in fsck.

fsck.c | 2 ++
t/t1450-fsck.sh | 21 +++++++++++++++++++++
tree-walk.c | 4 +++-
tree-walk.h | 3 +++
4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/fsck.c b/fsck.c
index 56156ff..ef79396 100644
--- a/fsck.c
+++ b/fsck.c
@@ -153,6 +153,7 @@ static int fsck_tree(struct tree *item, int strict, fsck_error error_func)
unsigned o_mode;
const char *o_name;

+ decode_tree_raw = 1;
init_tree_desc(&desc, item->buffer, item->size);

o_mode = 0;
@@ -213,6 +214,7 @@ static int fsck_tree(struct tree *item, int strict, fsck_error error_func)
o_name = name;
}

+ decode_tree_raw = 0;
retval = 0;
if (has_null_sha1)
retval += error_func(&item->object, FSCK_WARN, "contains entries pointing to null sha1");
diff --git a/t/t1450-fsck.sh b/t/t1450-fsck.sh
index b52397a..ba40c6f 100755
--- a/t/t1450-fsck.sh
+++ b/t/t1450-fsck.sh
@@ -176,6 +176,27 @@ test_expect_success 'malformatted tree object' '
grep "error in tree .*contains duplicate file entries" out
'

+# Basically an 8-bit clean version of sed 's/$1/$2/g'.
+munge_tree_bytes () {
+ perl -e '
+ binmode(STDIN); binmode(STDOUT);
+ $_ = do { local $/; <STDIN> };
+ s/$ARGV[0]/$ARGV[1]/g;
+ print
+ ' "$@"
+}
+
+test_expect_success 'bogus mode in tree' '
+ test_when_finished "remove_object \$T" &&
+ T=$(
+ git cat-file tree HEAD >good &&
+ munge_tree_bytes 100644 123456 <good >bad &&
+ git hash-object -w -t tree bad
+ ) &&
+ git fsck 2>out &&
+ grep "warning.*contains bad file modes" out
+'
+
test_expect_success 'tag pointing to nonexistent' '
cat >invalid-tag <<-\EOF &&
object ffffffffffffffffffffffffffffffffffffffff
diff --git a/tree-walk.c b/tree-walk.c
index 5dd9a71..baca766 100644
--- a/tree-walk.c
+++ b/tree-walk.c
@@ -5,6 +5,8 @@
#include "tree.h"
#include "pathspec.h"

+int decode_tree_raw;
+
static const char *get_mode(const char *str, unsigned int *modep)
{
unsigned char c;
@@ -37,7 +39,7 @@ static void decode_tree_entry(struct tree_desc *desc, const char *buf, unsigned

/* Initialize the descriptor entry */
desc->entry.path = path;
- desc->entry.mode = canon_mode(mode);
+ desc->entry.mode = decode_tree_raw ? mode : canon_mode(mode);
desc->entry.sha1 = (const unsigned char *)(path + len);
}

diff --git a/tree-walk.h b/tree-walk.h
index ae7fb3a..9f78e85 100644
--- a/tree-walk.h
+++ b/tree-walk.h
@@ -25,6 +25,9 @@ static inline int tree_entry_len(const struct name_entry *ne)
return (const char *)ne->sha1 - ne->path - 1;
}

+/* If set, do not canonicalize modes in tree entries. */
+extern int decode_tree_raw;
+
void update_tree_entry(struct tree_desc *);
void init_tree_desc(struct tree_desc *desc, const void *buf, unsigned long size);

--
2.1.1.496.g1ba8081

Edward Thomson

2014-09-23 16:23:43 UTC

Permalink

Post by Jeff King
As far as I can tell, fsck's mode-checking has been totally broken
basically forever. Which makes me a little nervous to fix it. :)
linux.git does have some bogus modes, but they are 100664, which is
specifically ignored here unless "fsck --strict" is in effect.

I'm in favor of checking the mode in fsck, at least when --strict.
But I would suggest we be lax (by default) about other likely-to-exist
but strictly invalid modes to prevent peoples previously workable
repositories from being now broken.

I have, for example, encountered 100775 in the wild, and would argue that
like 100644, it should probably not fail unless we are in --strict mode.

Thanks-
-ed

Jeff King

2014-09-23 16:30:09 UTC

Permalink

[-cc Kirill, as his address seem out-of-date]

Post by Edward Thomson

I'm in favor of checking the mode in fsck, at least when --strict.
But I would suggest we be lax (by default) about other likely-to-exist
but strictly invalid modes to prevent peoples previously workable
repositories from being now broken.
I have, for example, encountered 100775 in the wild, and would argue that
like 100644, it should probably not fail unless we are in --strict mode.

Yeah, I'd agree with that. The big question is: what breakage have we
seen in the wild? :)

I think treating 100775 the same as 100664 makes sense (want to do a
patch?). Do we know of any others? I guess we can collect them as time
goes on and reports come in. That's not the nicest thing for people with
such repos, but then again, their repos _are_ broken (and it's only
really a showstopper if they are trying to push to somebody with
receive.fsckObjects turned on).

-Peff

Ben Aveling

2014-10-12 22:37:27 UTC

Permalink

Hi,

A question about fsck - is there a reason it doesn't have an option to
delete bad objects?

Regards, Ben

Post by Jeff King
[-cc Kirill, as his address seem out-of-date]

Post by Edward Thomson

I'm in favor of checking the mode in fsck, at least when --strict.
But I would suggest we be lax (by default) about other likely-to-exist
but strictly invalid modes to prevent peoples previously workable
repositories from being now broken.
I have, for example, encountered 100775 in the wild, and would argue that
like 100644, it should probably not fail unless we are in --strict mode.

Yeah, I'd agree with that. The big question is: what breakage have we
seen in the wild? :)
I think treating 100775 the same as 100664 makes sense (want to do a
patch?). Do we know of any others? I guess we can collect them as time
goes on and reports come in. That's not the nicest thing for people with
such repos, but then again, their repos _are_ broken (and it's only
really a showstopper if they are trying to push to somebody with
receive.fsckObjects turned on).
-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
More majordomo info at http://vger.kernel.org/majordomo-info.html

Jeff King

2014-10-14 08:21:21 UTC

Permalink

Post by Ben Aveling
A question about fsck - is there a reason it doesn't have an option to
delete bad objects?

If the objects are reachable, then deleting them would create other big
problems (i.e., we would be breaking the object graph!).

If they are not, then it is probably safest for them to go away via the
normal means (repack/prune via "git gc"). Deleting via fsck would mean
replicating the reachability and deletion code found elsewhere.

-Peff

Jeff King

2014-10-16 00:20:21 UTC

Permalink

Post by Jeff King

Post by Ben Aveling
A question about fsck - is there a reason it doesn't have an option to
delete bad objects?

If the objects are reachable, then deleting them would create other big
problems (i.e., we would be breaking the object graph!).

"Any corrupt objects you will have to find in backups or other
archives (i.e., you can just remove them and do an /rsync/ with some
other site in the hopes that somebody else has the object you have
corrupted)."

I think there are really two different types of corruption worth
considering here.

One is where the data bytes of the object were correct at one point, but
now aren't anymore. The sha1 does not check out, and the object is
broken. You need to go find another copy of the object somewhere else.

The other is where the object was malformed in the first place,
generally due to a software bug (e.g., syntactically bogus email
addresses on committer lines, trees that are not sorted properly, etc).
There is no point in finding another object, as any you find will be
byte-for-byte identical.

For this second type (which is what I thought we were talking about in
this thread), deleting the object would be Very Bad.

For the first type, deleting the object _could_ be a path forward. But
it's not the usual method for fixing corruption. If you copy a good
version of the object into the repository, then a subsequent repack
should throw away the broken copy in favor of the good one. Note that
you need to copy the objects in manually; you can't use the git protocol
to do so. But that is true whether you have the broken object or not
(because git does not communicate with the other side about _every_
object you have; it talks about commits, and assumes that you have the
rest of the object graph that a commit refers to).

So I don't think deleting really helps in that case. And it may hurt if
it later turns out you don't have another copy of the object and need to
recover what you can from the corrupted pieces (which has actually
happened before).

I ask because I have a corrupt repository, and every time I run fsck, it
reports one corrupt object, then stops.

Corrupt how? Bit-corruption, or a malformed object?

I could write a script to repeatedly call fsck and then remove the
next corrupt object, but it raises the question for me; could it make
sense to extend fsck with the option to do to the removes? Or even
better, do the removes and then do the necessary [r]sync, assuming the
user has another repository that has a good copy of the bad objects,
which in this case I do.

If you have a non-corrupted version of the repository, the simplest
thing is to just pack it, copy the resulting packfile to the broken
repository (again, using "cp" or "rsync" and not git), and then repack
there. That won't transfer the minimum amount of data, but it's simple
and only requires one round-trip (bearing in mind that git doesn't know
what out-of-band method you're using, nor whether we can even initiate a
request from the broken repo to the good one, so each round trip
generally does need to be done by hand).

-Peff

Ben Aveling

2014-10-19 12:40:22 UTC

Permalink

I have a corrupt repository, and every time I run fsck, it
reports one corrupt object, then stops.
Corrupt how? Bit-corruption, or a malformed object?

Bit-corruption, in multiple places.

This seems to have worked. I also had to move away the existing .idx and
copy in a new one before it was happy.

I'm not sure that what I've done is so different from simply copying the
other version of the repository - there shouldn't have been anything in
the corrupt version that wasn't also in the good one. But any rate, it
worked.

Thanks, Ben

Jeff King

2014-10-20 09:21:57 UTC

Permalink

Post by Ben Aveling
This seems to have worked. I also had to move away the existing .idx and
copy in a new one before it was happy.

Sorry if I wasn't clear; you do need to copy the .idx files along with
the packfiles (you can regenerate the .idx files from the packfiles on
the destination, but they're not that big; it's probably easier just to
copy them).

Post by Ben Aveling
I'm not sure that what I've done is so different from simply copying the
other version of the repository - there shouldn't have been anything in the
corrupt version that wasn't also in the good one. But any rate, it worked.

Right, a valid technique for repairing corruption is to just blow away
the original repo entirely and replace it with a good copy. But this is a
way of ensuring that no commits are missed, and keeping the original set
of refs, config options, and reflogs.

-Peff