fsck option to remove corrupt objects

Discussion:

fsck option to remove corrupt objects - why/why not?

Ben Aveling

2014-10-16 00:13:34 UTC

A question about fsck - is there a reason it doesn't have an option to
delete bad objects?

If the objects are reachable, then deleting them would create other big
problems (i.e., we would be breaking the object graph!).

The man page for fsck advises:

"Any corrupt objects you will have to find in backups or other
archives (i.e., you can just remove them and do an /rsync/ with some
other site in the hopes that somebody else has the object you have
corrupted)."

And that seems sensible to me - the object is corrupt, it is unusable,
the object graph is already broken, we already have big problems,
removing the corrupt object(s) doesn't create any new problems, and it
allows the possibility that the damaged objects can be restored.

I ask because I have a corrupt repository, and every time I run fsck, it
reports one corrupt object, then stops. I could write a script to
repeatedly call fsck and then remove the next corrupt object, but it
raises the question for me; could it make sense to extend fsck with the
option to do to the removes? Or even better, do the removes and then do
the necessary [r]sync, assuming the user has another repository that has
a good copy of the bad objects, which in this case I do.

Regards, Ben

Johan Herland

2014-10-16 09:04:04 UTC

Permalink

Post by Ben Aveling

A question about fsck - is there a reason it doesn't have an option to
delete bad objects?

If the objects are reachable, then deleting them would create other big
problems (i.e., we would be breaking the object graph!).

"Any corrupt objects you will have to find in backups or other
archives (i.e., you can just remove them and do an /rsync/ with some
other site in the hopes that somebody else has the object you have
corrupted)."
And that seems sensible to me - the object is corrupt, it is unusable, the
object graph is already broken, we already have big problems, removing the
corrupt object(s) doesn't create any new problems, and it allows the
possibility that the damaged objects can be restored.
I ask because I have a corrupt repository, and every time I run fsck, it
reports one corrupt object, then stops. I could write a script to repeatedly
call fsck and then remove the next corrupt object, but it raises the
question for me; could it make sense to extend fsck with the option to do to
the removes?

I am positive to this idea. Yesterday a colleague of mine came to me
with a repo containing a single corrupt object (in a 1.2GB packfile).
We were lucky, since we had a copy of the repo with a good copy of the
same object. However, we were lucky in a couple of other respects, as
well:

I simply copied the packfile containing the good copy into the
corrupted repo, and then ran a "git gc", which "happened" to use the
good copy of the corrupted object and complete successfully (instead
of barfing on the bad copy). The GC then removed the old
(now-obsolete) packfiles, and thus the corruption was gone.

However, exactly _why_ git happened to prefer the good copy in my
copied packfile instead of the bad copy in the existing packfile, I do
not know. I suspect some amount of pure luck was involved. Indeed, I
feared I would have to explode the corrupt pack, then manually replace
the )(now-loose) bad copy with a good copy (from a similarly exploded
pristine pack), and then finally repack everything again. That said,
I'm not at all sure that Git would be able to successfully explode a
pack containing corrupt objects...

I think a better solution would be to tell fsck to remove the corrupt
object(s), as you suggest above, and then copy in the good pack. In
that case, there would be no question that the good copy would be used
in the subsequent GC.

Post by Ben Aveling
Or even better, do the removes and then do the necessary
[r]sync, assuming the user has another repository that has a good copy of
the bad objects, which in this case I do.

Hmm. I am not sure we want to automate the syncing step. First, git
cannot know _which_ remote is likely to have a good copy of the bad
object. Second, we do not necessarily know what caused the corruption
in the first place, and whether syncing with a remote (which will
create certain amount of write activity on a possibly dying disk
drive) is a good idea at all. Finally, this syncing step will have to
bypass Git's usual reachability analysis (which easily skips fetching
a corrupt blob from otherwise-reachable history), is more involved
than simply calling out to "git fetch"...

...Johan

--
Johan Herland, <***@herland.net>
www.herland.net

Jeff King

2014-10-16 12:25:33 UTC

Permalink

Post by Johan Herland
I simply copied the packfile containing the good copy into the
corrupted repo, and then ran a "git gc", which "happened" to use the
good copy of the corrupted object and complete successfully (instead
of barfing on the bad copy). The GC then removed the old
(now-obsolete) packfiles, and thus the corruption was gone.
However, exactly _why_ git happened to prefer the good copy in my
copied packfile instead of the bad copy in the existing packfile, I do
not know. I suspect some amount of pure luck was involved.

I'm not sure that it is luck, but more like 8eca0b4 (implement some
resilience against pack corruptions, 2008-06-23) working as intended[1].
Generally, git should be able to warn about corrupted objects and look
in other packs for them (both for regular operations, and for
repacking).

-Peff

[1] That's just one of the many commits dealing with this. Try running
"git log --author=Nicolas.Pitre --grep=corrupt" for more. :)

Johan Herland

2014-10-16 12:48:15 UTC

Permalink

Post by Jeff King

I'm not sure that it is luck, but more like 8eca0b4 (implement some
resilience against pack corruptions, 2008-06-23) working as intended[1].
Generally, git should be able to warn about corrupted objects and look
in other packs for them (both for regular operations, and for
repacking).
-Peff
[1] That's just one of the many commits dealing with this. Try running
"git log --author=Nicolas.Pitre --grep=corrupt" for more. :)

Indeed, from reading the logs, it seems what I assumed was a lucky
strike, was actually carefully designed behavior. With that in mind,
I'm no longer so sure that fsck actually needs an option to remove
corrupt objects. Instead, it's probably better to leave the corrupt
object in place until a good copy can be located and copied into the
repo, at which point Nicolas' brilliant work will make sure a simple
repack takes care of fixing the corruption.

That said, we should consider documenting this strategy for fixing corruptions:
- Locate the a good copy of the affected objects in another repo
- Copy relevant pack file or loose object into this repo
- Run "git gc"
- Profit!

...Johan

--
Johan Herland, <***@herland.net>
www.herland.net

Junio C Hamano

2014-10-16 16:36:34 UTC

Permalink

By design ;-)

Matthieu Moy

2014-10-16 11:59:13 UTC

Permalink

Post by Ben Aveling
And that seems sensible to me - the object is corrupt, it is unusable,
the object graph is already broken, we already have big problems,
removing the corrupt object(s) doesn't create any new problems, and it
allows the possibility that the damaged objects can be restored.

Removing completely may remove a chance to restore the corrupt object
(rather unlikely, but I can imagine fine binary file surgery to un-break
a broken object file).

But we could move them out of Git's object directory (a bit like
.git/lost-found, we could have .git/corrupt). For unpacked objects, it's
trivial (just mv them in the directory). For packed objects, I don't
know what happens in case they are corrupt. That would solve essentially
any problem that you can solve by removing the file, but it makes the
operation reversible.

--
Matthieu Moy
http://www-verimag.imag.fr/~moy/