Discussion:
How should I handle binary file with GIT
moreau francis
2006-04-05 07:30:22 UTC
Permalink
Hi,

I'd like to use git to keep track of a documentation repository. This r=
epo is
mainly composed by text file and the documenation is generated by ascii=
doc.
People who are using my repo and updating some docs which may include n=
ew
images can't send their whole work through patches.

=46or now they only send me the text updates through patch and attach n=
ew images
with the patch email. Then I do:

$ git am < text_only_patch
$ git reset --soft HEAD^
$ git add <new images>
$ git commit -a -C ORIG_HEAD

Now my question: is it the best way to achieve this process ?

thanks for you answers

=46rancis=20


=09

=09
=09
_______________________________________________________________________=
____=20
Nouveau : t=E9l=E9phonez moins cher avec Yahoo! Messenger ! D=E9couvez =
les tarifs exceptionnels pour appeler la France et l'international.
T=E9l=E9chargez sur http://fr.messenger.yahoo.com
Junio C Hamano
2006-04-05 08:14:16 UTC
Permalink
For now they only send me the text updates through patch and attach new images
$ git am < text_only_patch
$ git reset --soft HEAD^
$ git add <new images>
$ git commit -a -C ORIG_HEAD
Now my question: is it the best way to achieve this process ?
If I were doing that today, I would be doing almost exactly the
above sequence, or:

$ git am patch
$ git add <new images>
$ git commit -a --amend

It _might_ make sense to adopt a well-defined binary patch
format (or if there is no prior art, introduce our own) and
support that format with both git-diff-* brothers and git-apply,
but that would be a bit longer term project.
moreau francis
2006-04-05 12:21:13 UTC
Permalink
Post by Junio C Hamano
It _might_ make sense to adopt a well-defined binary patch
format (or if there is no prior art, introduce our own) and
support that format with both git-diff-* brothers and git-apply,
but that would be a bit longer term project.
=20
well maybe it's just stupid, but why not simply transforming binary fil=
es into
ascii files (maybe by using uuencode) before using git-diff-* brothers=
and
git-apply ?

=46rancis


=09

=09
=09
_______________________________________________________________________=
____=20
Nouveau : t=E9l=E9phonez moins cher avec Yahoo! Messenger ! D=E9couvez =
les tarifs exceptionnels pour appeler la France et l'international.
T=E9l=E9chargez sur http://fr.messenger.yahoo.com
Nicolas Pitre
2006-04-05 13:25:42 UTC
Permalink
Post by Junio C Hamano
It _might_ make sense to adopt a well-defined binary patch
format (or if there is no prior art, introduce our own) and
support that format with both git-diff-* brothers and git-apply,
but that would be a bit longer term project.
well maybe it's just stupid, but why not simply transforming binary files into
ascii files (maybe by using uuencode) before using git-diff-* brothers and
git-apply ?
Imagine if the only difference between two versions of the same file is
a single byte inserted at the very beginning. The uuencode would then
be totally different between the two files.


Nicolas
moreau francis
2006-04-05 13:35:44 UTC
Permalink
=20
well maybe it's just stupid, but why not simply transforming binary=
files
into
ascii files (maybe by using uuencode) before using git-diff-* brot=
hers and
git-apply ?
=20
Imagine if the only difference between two versions of the same file =
is=20
a single byte inserted at the very beginning. The uuencode would the=
n=20
be totally different between the two files.
=20
ok uuencode was just a bad example for encoding...

=46rancis



=09

=09
=09
_______________________________________________________________________=
____=20
Nouveau : t=E9l=E9phonez moins cher avec Yahoo! Messenger ! D=E9couvez =
les tarifs exceptionnels pour appeler la France et l'international.
T=E9l=E9chargez sur http://fr.messenger.yahoo.com
Nicolas Pitre
2006-04-05 13:06:48 UTC
Permalink
Post by Junio C Hamano
It _might_ make sense to adopt a well-defined binary patch
format (or if there is no prior art, introduce our own) and
support that format with both git-diff-* brothers and git-apply,
but that would be a bit longer term project.
What about simply using diff-delta and encoding its output with base64?


Nicolas
moreau francis
2006-04-05 13:18:34 UTC
Permalink
Post by Junio C Hamano
If I were doing that today, I would be doing almost exactly the
=20
$ git am patch
$ git add <new images>
$ git commit -a --amend
=20
BTW, what does "--amend" option do ? It doesn't seem to be documented a=
nywhere.

=46rancis


=09

=09
=09
_______________________________________________________________________=
____=20
Nouveau : t=E9l=E9phonez moins cher avec Yahoo! Messenger ! D=E9couvez =
les tarifs exceptionnels pour appeler la France et l'international.
T=E9l=E9chargez sur http://fr.messenger.yahoo.com
Marco Roeland
2006-04-05 19:23:21 UTC
Permalink
BTW, what does "--amend" option do ? It doesn't seem to be documented anywhere.
This is the original commit text that introduced it:

diff-tree b4019f045646b1770a80394da876b8a7c6b8ca7b (from d320a5437f8304cf9ea3ee1898e49d643e005738)
Author: Junio C Hamano <***@cox.net>
Date: Thu Mar 2 21:04:05 2006 -0800

git-commit --amend

The new flag is used to amend the tip of the current branch. Prepare
the tree object you would want to replace the latest commit as usual
(this includes the usual -i/-o and explicit paths), and the commit log
editor is seeded with the commit message from the tip of the current
branch. The commit you create replaces the current tip -- if it was a
merge, it will have the parents of the current tip as parents -- so the
current top commit is discarded.

It is a rough equivalent for:

$ git reset --soft HEAD^
$ ... do something else to come up with the right tree ...
$ git commit -c ORIG_HEAD

but can be used to amend a merge commit.

Signed-off-by: Junio C Hamano <***@cox.net>

So in the original context you can add separate binaries to a commit
of only text files that you just rescued from CVS or something and then
change the commit to include these binaries as well.

I've sent a separate patch for the documentation for git-commit using
Junio's clear explanation.
--
Marco Roeland
Jakub Narebski
2006-04-05 15:11:43 UTC
Permalink
Post by Junio C Hamano
It _might_ make sense to adopt a well-defined binary patch
format (or if there is no prior art, introduce our own) and
support that format with both git-diff-* brothers and git-apply,
but that would be a bit longer term project.
bsdiff? http://www.daemonology.net/bsdiff/
EDelta? http://www.diku.dk/~jacobg/edelta/
Xdelta? http://xdelta.blogspot.com/

IIRC bsdiff is used by Firefox to distribute binary software updates.
Xdelta is generic (not optimized for binaries like bsdiff and edelta), but
supposedly offers worse compression (bigger diffs).
--
Jakub Narebski
Warsaw, Poland
Nicolas Pitre
2006-04-05 15:32:21 UTC
Permalink
Post by Jakub Narebski
Post by Junio C Hamano
It _might_ make sense to adopt a well-defined binary patch
format (or if there is no prior art, introduce our own) and
support that format with both git-diff-* brothers and git-apply,
but that would be a bit longer term project.
bsdiff? http://www.daemonology.net/bsdiff/
EDelta? http://www.diku.dk/~jacobg/edelta/
Xdelta? http://xdelta.blogspot.com/
IIRC bsdiff is used by Firefox to distribute binary software updates.
Xdelta is generic (not optimized for binaries like bsdiff and edelta), but
supposedly offers worse compression (bigger diffs).
We already have our own delta code for pack storage.


Nicolas
Randal L. Schwartz
2006-04-05 15:37:10 UTC
Permalink
Post by Jakub Narebski
IIRC bsdiff is used by Firefox to distribute binary software updates.
Xdelta is generic (not optimized for binaries like bsdiff and edelta), but
supposedly offers worse compression (bigger diffs).
Nicolas> We already have our own delta code for pack storage.

I think the issue is related to being able to cherry-pick and merge
when binaries are involved. I've been worried about that myself.
How well are binaries supported these days for all the operations
we're taking for granted? When is a "diff" expected to be a real
"diff" and not just "binary files differ"?
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<***@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Shawn Pearce
2006-04-05 15:55:28 UTC
Permalink
Post by Randal L. Schwartz
Post by Jakub Narebski
IIRC bsdiff is used by Firefox to distribute binary software updates.
Xdelta is generic (not optimized for binaries like bsdiff and edelta), but
supposedly offers worse compression (bigger diffs).
Nicolas> We already have our own delta code for pack storage.
I think the issue is related to being able to cherry-pick and merge
when binaries are involved. I've been worried about that myself.
How well are binaries supported these days for all the operations
we're taking for granted? When is a "diff" expected to be a real
"diff" and not just "binary files differ"?
The clearly safe approach is to include the full SHA1 ID of the
old object the patch was created from and use the xdelta in the
patch only as a means of transporting a compressed form of the new
version of the object. If git-diff starts to export say a base 64
encoding of the xdelta then it should also include the full SHA1
ID for binary files, even if --full-index wasn't given.

git-apply should only apply an xdelta patch to the exact same
old object. If the tree currently has a different object at that
path then reject the patch entirely.

If a path has a different object then the patch was based on then
we can do one of two things to be ``nice'' to the human:

- If the old blob exists in the repository (it just isn't the
current version at that path) then generate a temporary merge
file holding the old blob with the delta applied. The user can
then finish the merge with whatever tool understands that binary
file format, or do the merge by hand.

- Supply a ``do it anyway'' flag to git-apply. If this flag is
given on the command line then the binary file is patched even
though the object versions differ. For some binary file formats
this may actually be a valid thing to do. But it probably isn't
for a very large percentage of known file formats.

I could see some cases where it might be nice to be able to perform
specialized merge handling of binary files via hooks or filters.

For example *.tar.gz, *.zip, *.jar - these files are all just
compressed trees. They should be somewhat mergeable with the same
semantics as other trees in GIT. Of course one could just unpack
these into a directory and let GIT track the directory instead,
but this is rather inconvenient in a Java project. :-)

If I recall correctly OpenOffice document files are XML compressed
into ZIP archives. The XML *might* diff/patch cleanly as plain text.
The other resources in that archive are typically binary graphic
files and the like, which of course wouldn't diff/patch nicely.
But being able to diff/patch the main content might be semi-useful.
--
Shawn.
Nicolas Pitre
2006-04-05 16:25:29 UTC
Permalink
Post by Shawn Pearce
The clearly safe approach is to include the full SHA1 ID of the
old object the patch was created from and use the xdelta in the
patch only as a means of transporting a compressed form of the new
version of the object. If git-diff starts to export say a base 64
encoding of the xdelta then it should also include the full SHA1
ID for binary files, even if --full-index wasn't given.
git-apply should only apply an xdelta patch to the exact same
old object. If the tree currently has a different object at that
path then reject the patch entirely.
Amen. Exactly what I just said.


Nicolas
Nicolas Pitre
2006-04-05 16:21:49 UTC
Permalink
Post by Randal L. Schwartz
Post by Jakub Narebski
IIRC bsdiff is used by Firefox to distribute binary software updates.
Xdelta is generic (not optimized for binaries like bsdiff and edelta), but
supposedly offers worse compression (bigger diffs).
Nicolas> We already have our own delta code for pack storage.
I think the issue is related to being able to cherry-pick and merge
when binaries are involved. I've been worried about that myself.
How well are binaries supported these days for all the operations
we're taking for granted? When is a "diff" expected to be a real
"diff" and not just "binary files differ"?
First of all, does cherry-picking binary patches is a sensible thing to
do?

Do you expect, say, a Word document, a JPEG image, or an MP3 file to
still be valid and error free if two binary patches modifying a
different part of the same file (same revision) are successively
applied? I seriously doubt it.

And what do you do with conflicts? Using diff3 might be sensible for
text data, but for binaries you really need a tool that understands the
type of data your binary contains, which means one tool for each
possible type of binary data which is outside the scope of GIT.

For example, if you patch a .wav file adding some data, then you end up
with the additional samples and a new length in the file header. If
another patch to that .wav is applied, then it is easy to find the
"surrounding context" where the second patch is adding/removing some
other samples, but then you really needs knowledge about the .wav format
to handle the conflict that will occur on the .wav header modification.

And so on for all possible binary types.

So IMHO a binary patch format is only useful for easy _transport_ along
with other text patches. And the binary patch must either apply
perfectly against the same source file or it must not apply at all.
That's the only sensible accommodation we can do with a generic binary
patch format.

When the patch doesn't apply to your tree, then nothing prevents you
from hooking a dedicated tool that will pick up the original file, the
reconstructed remote version according to the binary patch you received
and your own modified version so that tool can process them and do the
necessary changes with proper knowledge of the data format.


Nicolas
Junio C Hamano
2006-04-05 18:34:55 UTC
Permalink
Post by Randal L. Schwartz
I think the issue is related to being able to cherry-pick and merge
when binaries are involved. I've been worried about that myself.
How well are binaries supported these days for all the operations
we're taking for granted? When is a "diff" expected to be a real
"diff" and not just "binary files differ"?
First of all, binary files are handled by cherry-pick and merge
without needing to involve "diff"+"patch" (which is not so
useful for binary files anyway). They use 3-way read-tree merge
which compares the object names and leave the index unmerged if
there are conflicting changes, so you should be able to sort it
out by running up to three "git-cat-file blob $sha1".

What involves "diff"+"patch" are rebases and processing mailed-in
patches as in the example by the original poster.

In our diff output, we record the blob object name of preimage
and postimage, along with filemode, on the "index" line.
git-apply does not do anything with it by default, but if:

- --binary flag is given,

- the postimage blob is already available locally, and,

- the file the patch is being applied to is the same as the
recorded preimage,

then the file is _replaced_ with the postimage.

This is good enough for git-rebase (which uses format-patch
piped to am) and is safe (we do not "apply delta" -- only
replace when the file "being patched" matches the recorded
preimage). It does not do any good for transferring a postimage
that the person who applies the patch does not yet have.

I think "applying delta" to a binary file is not very useful
thing to do. Depending on the nature of the file being patched,
it may produce a perfectly good result, but verifying if the
result makes sense by the end user and hand-fixing it if does
not, which can be done for text files, is near impossible for
binary files. "replace with postimage only when you are
applying to the same preimage" rule would be the only practical,
sane thing.

If we wanted to use the patch+diff (i.e. "format-patch,
send-email, and then am" workflow) to transfer new version of
binary files to a recipient, which I think is useful in some
projects, the sanest way to handle this is probably to add
Nico's delta, going from preimage to postimage, encoded for
safer transport, to our diff output. For safety and sanity, we
will not "apply" the patch unless the patched file exactly
matches the preimage that is recorded in the diff, and as long
as the recipient has the preimage, such a patch would be able to
reproduce the postimage and hopefully be smaller than
transferring the whole thing.

We've been trying to keep our diff output reversible (e.g. we
show what the filemode of the preimage is), so if we take the
above route, it probably should record deltas for both going
from preimage to postimage _and_ going the other way (unless
xdelta can be applied in-reverse, which I do not think is the
case).

Of course, to be _completely_ generic, you could include both
compressed then uuencoded preimage and postimage, and let the
recipient sort it out. An advantage of that approach is that
the applicability of such a "patch" improves as the tools to
apply it improve, after the patch was originally generated. I
however think that is only a theoretical advantage, not a very
practical one.
Randal L. Schwartz
2006-04-05 18:51:39 UTC
Permalink
Junio> If we wanted to use the patch+diff (i.e. "format-patch,
Junio> send-email, and then am" workflow) to transfer new version of
Junio> binary files to a recipient, which I think is useful in some
Junio> projects, the sanest way to handle this is probably to add
Junio> Nico's delta, going from preimage to postimage, encoded for
Junio> safer transport, to our diff output.

This is what I was looking for, and thanks for confirming that at least within
a local respository, everything already works. Yeay.
--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<***@stonehenge.com> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Nicolas Pitre
2006-04-05 19:31:05 UTC
Permalink
Post by Junio C Hamano
If we wanted to use the patch+diff (i.e. "format-patch,
send-email, and then am" workflow) to transfer new version of
binary files to a recipient, which I think is useful in some
projects, the sanest way to handle this is probably to add
Nico's delta, going from preimage to postimage, encoded for
safer transport, to our diff output. For safety and sanity, we
will not "apply" the patch unless the patched file exactly
matches the preimage that is recorded in the diff, and as long
as the recipient has the preimage, such a patch would be able to
reproduce the postimage and hopefully be smaller than
transferring the whole thing.
Exactly the point.
Post by Junio C Hamano
We've been trying to keep our diff output reversible (e.g. we
show what the filemode of the preimage is), so if we take the
above route, it probably should record deltas for both going
from preimage to postimage _and_ going the other way (unless
xdelta can be applied in-reverse, which I do not think is the
case).
You cannot reverse a delta. However if you were able to apply a delta
from preimage to postimage that means you must already have had preimage
in your object store. Therefore reverting such a patch would simply
involve restoring preimage.
Post by Junio C Hamano
Of course, to be _completely_ generic, you could include both
compressed then uuencoded preimage and postimage, and let the
recipient sort it out.
I think this is just too much and besides the point of a diff. If the
work flow is so convoluted such that the simple binary patch as a delta
doesn't apply then it would probably be a better idea to simply transfer
those binaries as email attachments. In other words, if a binary patch
transfer mechanism is added, it should cover the common case and leave
the rest for a better process like git-fetch/pull.


Nicolas
Junio C Hamano
2006-04-05 20:20:34 UTC
Permalink
Post by Nicolas Pitre
Post by Junio C Hamano
We've been trying to keep our diff output reversible (e.g. we
show what the filemode of the preimage is), so if we take the
above route, it probably should record deltas for both going
from preimage to postimage _and_ going the other way (unless
xdelta can be applied in-reverse, which I do not think is the
case).
You cannot reverse a delta. However if you were able to apply a delta
from preimage to postimage that means you must already have had preimage
in your object store. Therefore reverting such a patch would simply
involve restoring preimage.
The case I had in mind was where you shipped a tarball of the
tip to somebody (or "a shallow clone"), and after seeing him
having problems with that release, sending him a patch telling
him "reverting this might help, could you please give it a try?"

Of course you could be nicer to him and generate the reverse
diff on your end in such a case instead.

Loading...