Discussion:
git pack/unpack over bittorrent - works!
Luke Kenneth Casson Leighton
2010-09-01 14:36:16 UTC
Permalink
http://gitorious.org/python-libbittorrent/pybtlib

hurrah - success! git fsck shows a "dangling commit"!

so, as a proof-of-concept, 400 lines of python code plus a bittorrent
library shows that it's possible to create a peer-to-peer distributed
version of "git fetch", by treating the pack objects as "files to be
shared".

as this code has only existed for less than three days, there are a
lot of loose ends. such as the pack object files being cached
in-memory, and it being strictly speaking unnecessary to obtain
absolutely every single pack object from the git repository at
start-up time, but to optimise that to being "on-demand" requires some
ferreting around in the bittorrent library, etc. etc.

if anyone is interested in helping out, or knows of a way to get this
sponsored and completed, please speak up. especially the sponsorship
bit because i am in a severely critical and ridiculous financial
situation and am in urgent and immediate need of money.

l.
Nguyen Thai Ngoc Duy
2010-09-01 22:04:03 UTC
Permalink
On Thu, Sep 2, 2010 at 12:36 AM, Luke Kenneth Casson Leighton
<***@gmail.com> wrote:
> http://gitorious.org/python-libbittorrent/pybtlib
>
> hurrah - success! =C2=A0git fsck shows a "dangling commit"!
>
> so, as a proof-of-concept, 400 lines of python code plus a bittorrent
> library shows that it's possible to create a peer-to-peer distributed
> version of "git fetch", by treating the pack objects as "files to be
> shared".

You should have a look at gittorrent [1] (and finish it too if you are
interested). There were discussions whether a pack is stable enough to
be shared like this, one of the reason "commit reel" was introduced.

[1] http://code.google.com/p/gittorrent/
--=20
Duy
Luke Kenneth Casson Leighton
2010-09-02 13:37:30 UTC
Permalink
On Wed, Sep 1, 2010 at 11:04 PM, Nguyen Thai Ngoc Duy <***@gmail.co=
m> wrote:
> On Thu, Sep 2, 2010 at 12:36 AM, Luke Kenneth Casson Leighton
> <***@gmail.com> wrote:
>> http://gitorious.org/python-libbittorrent/pybtlib
>>
>> hurrah - success! =C2=A0git fsck shows a "dangling commit"!
>>
>> so, as a proof-of-concept, 400 lines of python code plus a bittorren=
t
>> library shows that it's possible to create a peer-to-peer distribute=
d
>> version of "git fetch", by treating the pack objects as "files to be
>> shared".
>
> You should have a look at gittorrent [1] (and finish it too if you ar=
e
> interested).

it's in perl, and has been shelved afaik. sam (hi sam, you still
here? :) abandoned bittorrent as the underlying mechanism, and i
disagree with that decision, hence why i created what i have, to prove
that it's viable.

so a) i don't do perl so would need to re-create what's been done
(which i don't understand, and, because it's incomplete, i can't do a
perl-to-python translation and "have something working") b) i don't
believe in reinventing the wheel ESPECIALLY on something as complex as
peer-to-peer file distribution.

cameron dale created apt-p2p which is a recreation of a peer-to-peer
file distribution mechanism, and, not surprisingly, it's slow and
problematic.

> There were discussions whether a pack is stable enough to
> be shared like this,

it seems to be. as long as each version of git produces the exact
same pack object, off of the command "git pack-objects --all --stdout
--thin {ref} < {objref}"

is that what you're referring to?

because if it isn't, then yes, the sharing of files (named by a
virtual filename of packs/{ref}/{objref} of course) which _might_ have
differences, yeah, it becomes a bit of a fuck-up.

> one of the reason "commit reel" was introduced.

ah _ha_. i need to look that up. will get back to you.

l.
Luke Kenneth Casson Leighton
2010-09-02 13:53:54 UTC
Permalink
>> one of the reason "commit reel" was introduced.
>
> =C2=A0ah _ha_. =C2=A0i need to look that up. =C2=A0will get back to y=
ou.
>
> =C2=A0l.
>

http://www.mail-archive.com/***@lists.utsl.gen.nz/msg00032.html

=2E... nnnope. don't understand it. at all. there's no context.

http://www.mail-archive.com/***@lists.utsl.gen.nz/msg00003.html =
-->

http://gittorrent.utsl.gen.nz/rfc.html#anchor7

"Service Not Available".

greeat.

http://www.mail-archive.com/***@lists.utsl.gen.nz/msg00023.html

ah HA!
+<t hangText=3D"References:">
+
+ The References are the Git refs of the repository being shared.
+
+</t>
+<t hangText=3D"Commit reel offset:">
+
+ An offset into the list of all revisions sorted by their natura=
l
+ topological order, their commit date and their SHA-1.
+
+</t>
+<t hangText=3D"Commit reel:">
+
+ A commit reel consists of two commit reel offsets, and all the =
objects
+ that are reachable from the revisions between the two offsets, =
but
+ not the revisions after the second offset. In Git terms, a com=
mit
+ reel is a bundle.
+
+</t>
+<t hangText=3D"Block:">
+
+ A block is the actual content of a commit reel, i.e. the object=
s.
+ In Git terms, it is a pack.
+
+</t>

oh - right.... so, translating that: the concern is not that the git
pack-objects might be different (could someone pleaaaase confirm
that!) - the concern is that the order of _unpacking_ *has* to be done
in the specific order in which they were (originally) committed.

if that's all it is, then yes, i thought that that was plainly
obvious, and had taken it into consideration already, by creating a
virtual file which contains the order of the commits. this is
achieved merely by making the contents of "git rev-list" available.
voila, dead-simple: you now have enough information to be able to
apply the pack objects in the right order.

l.
Ævar Arnfjörð Bjarmason
2010-09-02 14:08:46 UTC
Permalink
On Thu, Sep 2, 2010 at 13:37, Luke Kenneth Casson Leighton
<***@gmail.com> wrote:

> =C2=A0cameron dale created apt-p2p which is a recreation of a peer-to=
-peer
> file distribution mechanism, and, not surprisingly, it's slow and
> problematic.

apt is somewhat of a bad fit for bittorrent. You can get a lot of
throughput over torrent, but connecting to the swarm and beginning the
download generally takes much longer than downloading & installing
dozens of packages using the normal apt transport.

Also as a matter of implementation they're using a really resource
hungry Python implementation of BitTorrent instead of something like
libtorrent, which is why I stopped running it.

But presumably most uses for GitTorrent (and what you're doing)
wouldn't suffer so badly from the latency, you could just leave some
daemon on which would download commits in the background.

So e.g. if someone submitted a series against git.git to the list 30
minutes ago and he/I were running some git-p2p thingy I could rely on
those commits having made it to my repository by now.

Just a thought.
A Large Angry SCM
2010-09-02 15:33:48 UTC
Permalink
On 09/02/2010 09:37 AM, Luke Kenneth Casson Leighton wrote:
> On Wed, Sep 1, 2010 at 11:04 PM, Nguyen Thai Ngoc Duy<***@gmail.com> wrote:
[...]
>> There were discussions whether a pack is stable enough to
>> be shared like this,
>
> it seems to be. as long as each version of git produces the exact
> same pack object, off of the command "git pack-objects --all --stdout
> --thin {ref}< {objref}"

This is not guaranteed.
Luke Kenneth Casson Leighton
2010-09-02 15:42:55 UTC
Permalink
On Thu, Sep 2, 2010 at 4:33 PM, A Large Angry SCM <***@gmail.com> =
wrote:
> On 09/02/2010 09:37 AM, Luke Kenneth Casson Leighton wrote:
>>
>> On Wed, Sep 1, 2010 at 11:04 PM, Nguyen Thai Ngoc Duy<***@gmail.=
com>
>> =C2=A0wrote:
>
> [...]
>>>
>>> There were discussions whether a pack is stable enough to
>>> be shared like this,
>>
>> =C2=A0it seems to be. =C2=A0as long as each version of git produces =
the exact
>> same pack object, off of the command "git pack-objects --all --stdou=
t
>> --thin {ref}< =C2=A0{objref}"
>
> This is not guaranteed.

ok. greeeat.

so, some sensible questions:

* what _can_ be guaranteed?

* diffs?

* git-format-patches? (which i am aware can do binary files and also
rms)?

* individual files in the .git/objects directory?

and, asking perhaps some silly questions:

* why is it not guaranteed?

* under what circumstances is it not guaranteed? and, crucially, is
it necessary to care? i.e. if someone does a shallow git clone, i
couldn't give a stuff.

* is it possible to _make_ the repository guaranteed to produce
identical pack objects?

* does for example "git gc" change the object store in such a way such
that one git repo will produce a different pack-object from the same
ref? if so, can running "git gc" prior to producing the pack-objects
gurantee that the pack-objects will be the same?

* is it a versioning issue? is it because there are different
versions (2 and 3)? if so, that's ok, you just force people to use
the same pack-object versions.

etc. etc.

l.
Luke Kenneth Casson Leighton
2010-09-02 15:51:46 UTC
Permalink
On Thu, Sep 2, 2010 at 4:42 PM, Luke Kenneth Casson Leighton
<***@gmail.com> wrote:
> On Thu, Sep 2, 2010 at 4:33 PM, A Large Angry SCM <***@gmail.com> wrote:

> * is it possible to _make_ the repository guaranteed to produce
> identical pack objects?

i.e. looking at these options, listed from
Documentation/technical/protocol.txt:

00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack
thin-pack side-band side-band-64k ofs-delta shallow no-progress
include-tag

is it possible to use shallow, thin-pack, side-band or side-band-64k
to guarantee that the pack object will be identical?

another important question:

* if after performing a "git unpack" of one pack-object, can it be
guaranteed that performing a "git pack-object" on the *exact* same ref
and the *exact* same object-ref, will produce the *exact* same
pack-object that was used by "git unpack", as long as the exact same
arguments are used? if not, why not, and if not under _some_
circumstances, under what circumstances _can_ the exact same
pack-object be retrieved that was just used?

if there is absolutely absolutely no way to guarantee that the
pack-objects can be the same, under no circumstances or combinations
of arguments or by forcing only compatible versions to communicate
etc. etc., a rather awful work-around can be applied which is to share
and permanently cache every single pack-object, rather than use what's
gone into the repo.

l.
A Large Angry SCM
2010-09-02 17:06:32 UTC
Permalink
On 09/02/2010 11:51 AM, Luke Kenneth Casson Leighton wrote:
> On Thu, Sep 2, 2010 at 4:42 PM, Luke Kenneth Casson Leighton
> <***@gmail.com> wrote:
>> On Thu, Sep 2, 2010 at 4:33 PM, A Large Angry SCM<***@gmail.com> wrote:
>
>> * is it possible to _make_ the repository guaranteed to produce
>> identical pack objects?
>
> i.e. looking at these options, listed from
> Documentation/technical/protocol.txt:
>
> 00887217a7c7e582c46cec22a130adf4b9d7d950fba0 HEAD\0multi_ack
> thin-pack side-band side-band-64k ofs-delta shallow no-progress
> include-tag
>
> is it possible to use shallow, thin-pack, side-band or side-band-64k
> to guarantee that the pack object will be identical?

No. Looking to use identical packs created on different systems is not
something that git guarantees or, likely, will ever guarantee. If you
need that, you need to create and implement the canonical pack-like
definition for your transfer protocol.

> another important question:
>
> * if after performing a "git unpack" of one pack-object, can it be
> guaranteed that performing a "git pack-object" on the *exact* same ref
> and the *exact* same object-ref, will produce the *exact* same
> pack-object that was used by "git unpack", as long as the exact same
> arguments are used? if not, why not, and if not under _some_
> circumstances, under what circumstances _can_ the exact same
> pack-object be retrieved that was just used?

Write your own "packer" is really the best answer.

> if there is absolutely absolutely no way to guarantee that the
> pack-objects can be the same, under no circumstances or combinations
> of arguments or by forcing only compatible versions to communicate
> etc. etc., a rather awful work-around can be applied which is to share
> and permanently cache every single pack-object, rather than use what's
> gone into the repo.

Sounds ugly and inefficient.
Jeff King
2010-09-02 15:58:13 UTC
Permalink
On Thu, Sep 02, 2010 at 04:42:55PM +0100, Luke Kenneth Casson Leighton =
wrote:

> >> =C2=A0it seems to be. =C2=A0as long as each version of git produce=
s the exact
> >> same pack object, off of the command "git pack-objects --all --std=
out
> >> --thin {ref}< =C2=A0{objref}"
> >
> > This is not guaranteed.
> [...]
> * under what circumstances is it not guaranteed? and, crucially, is
> it necessary to care? i.e. if someone does a shallow git clone, i
> couldn't give a stuff.

pack-objects will reuse previously found deltas. So the deltas you have
in your existing packs matter. The deltas you have in your existing
packs depend on many things. At least:

1. Options you used when packing (e.g., --depth and --window).

2. Probably exactly _when_ you packed. You could find a good delta
from A to B. Later, object C comes into existence, and would
provide a better delta base for B. I don't think we will ever try =
A
against C, unless --no-reuse-delta is set.

You have a different pack than somebody who packed after A, B, and
C all existed.

In practice, this tends not to happen much because the best deltas
are usually going backwards in time to a previous version. But it
can happen.

-Peff
Nicolas Pitre
2010-09-02 16:41:45 UTC
Permalink
On Thu, 2 Sep 2010, Jeff King wrote:

> On Thu, Sep 02, 2010 at 04:42:55PM +0100, Luke Kenneth Casson Leighton wrote:
>
> > >>  it seems to be.  as long as each version of git produces the exact
> > >> same pack object, off of the command "git pack-objects --all --stdout
> > >> --thin {ref}<  {objref}"
> > >
> > > This is not guaranteed.
> > [...]
> > * under what circumstances is it not guaranteed? and, crucially, is
> > it necessary to care? i.e. if someone does a shallow git clone, i
> > couldn't give a stuff.
>
> pack-objects will reuse previously found deltas. So the deltas you have
> in your existing packs matter. The deltas you have in your existing
> packs depend on many things. At least:
>
> 1. Options you used when packing (e.g., --depth and --window).
>
> 2. Probably exactly _when_ you packed. You could find a good delta
> from A to B. Later, object C comes into existence, and would
> provide a better delta base for B. I don't think we will ever try A
> against C, unless --no-reuse-delta is set.
>
> You have a different pack than somebody who packed after A, B, and
> C all existed.
>
> In practice, this tends not to happen much because the best deltas
> are usually going backwards in time to a previous version. But it
> can happen.

I would go as far as stating that this is never guaranteed by design.
And I will oppose any attempt to introduce such restrictions as this
will only prevent future enhancements to packing heuristics.

For example, right now you already can't rely on having the exact same
pack output even on the same machine using the same arguments and the
same inputs simply by using threads. As soon as you're using more than
one thread (most people do these days) then your pack output becomes non
deterministic.


Nicolas
A Large Angry SCM
2010-09-02 17:09:06 UTC
Permalink
On 09/02/2010 12:41 PM, Nicolas Pitre wrote:

[...]

> I would go as far as stating that this is never guaranteed by design.
> And I will oppose any attempt to introduce such restrictions as this
> will only prevent future enhancements to packing heuristics.
>
> For example, right now you already can't rely on having the exact same
> pack output even on the same machine using the same arguments and the
> same inputs simply by using threads. As soon as you're using more than
> one thread (most people do these days) then your pack output becomes non
> deterministic.

Finally, the real pack expert weighs in!
Nicolas Pitre
2010-09-02 17:31:39 UTC
Permalink
On Thu, 2 Sep 2010, A Large Angry SCM wrote:

> On 09/02/2010 12:41 PM, Nicolas Pitre wrote:
>
> > For example, right now you already can't rely on having the exact same
> > pack output even on the same machine using the same arguments and the
> > same inputs simply by using threads. As soon as you're using more than
> > one thread (most people do these days) then your pack output becomes non
> > deterministic.
>
> Finally, the real pack expert weighs in!

BTW I just have a little time to quickly scan through my git mailing
list backlog these days, and stumbled on this by luck. So if people
want my opinion on such matters it is safer to CC me directly.


Nicolas
Luke Kenneth Casson Leighton
2010-09-02 19:17:39 UTC
Permalink
On Thu, Sep 2, 2010 at 6:31 PM, Nicolas Pitre <***@fluxnic.net> wrote:
> On Thu, 2 Sep 2010, A Large Angry SCM wrote:
>
>> On 09/02/2010 12:41 PM, Nicolas Pitre wrote:
>>
>> > For example, right now you already can't rely on having the exact =
same
>> > pack output even on the same machine using the same arguments and =
the
>> > same inputs simply by using threads. =C2=A0As soon as you're using=
more than
>> > one thread (most people do these days) then your pack output becom=
es non
>> > deterministic.
>>
>> Finally, the real pack expert weighs in!
>
> BTW I just have a little time to quickly scan through my git mailing
> list backlog these days, and stumbled on this by luck. =C2=A0So if pe=
ople
> want my opinion on such matters it is safer to CC me directly.

appreciated nicolas. will keep it short. ish :)

* based on what you kindly mentioned about "git repack -f", would a
(well-written!) patch to git pack-objects to add a
"--single-thread-only" option be acceptable?

* would you, or anyone else with enough knowledge of how this stuff
reaallly works, be willing to put some low-priority back-of-mind
thought into how to create a "canonical" pack format - one that can be
enabled with a command-line-option? the reason i ask is because if i
even attempted such a task, i'd die of laughing (probably manically)
if it was ever accepted. i'd rather live :)


questions (not necessarily for nicolas) - can anyone think of any
good reasons _other_ than for multiple file-sharing to have a
"canonical" pack-object?

off the top of my head i can think of one: rsync if the transfer is
interrupted. if the pack-objects are large - and not guaranteed to be
the same - then an interrupted rsync transfer would be a bit of a
waste of bandwidth. however if the pack-object could always be made
the same, the partial transfer could carry on. musing a bit
further... mmm... i supooose the same thing applies equally to http
and ftp. it's a bit lame, i know: can anyone think of any better
reasons?

l.
Shawn O. Pearce
2010-09-02 19:29:10 UTC
Permalink
Luke Kenneth Casson Leighton <***@gmail.com> wrote:
>
> * based on what you kindly mentioned about "git repack -f", would a
> (well-written!) patch to git pack-objects to add a
> "--single-thread-only" option be acceptable?

Probably not. I can't think of a good reason to limit the number
of threads that get used. We already have pack.threads as a
configuration variable to support controlling this for the system,
but that's about the only thing that really makes sense.

> * would you, or anyone else with enough knowledge of how this stuff
> reaallly works, be willing to put some low-priority back-of-mind
> thought into how to create a "canonical" pack format

We have. We've even talked about it on the mailing list. Multiple
times. Most times about how to support a p2p Git transport.
That whole Gittorrent thing you are ignoring, we put some effort
into coming up with a pack-like format that would be more stable,
at the expense of being larger in total size.

> questions (not necessarily for nicolas) - can anyone think of any
> good reasons _other_ than for multiple file-sharing to have a
> "canonical" pack-object?

Yes, its called resuming a clone over git://.

Right now if you abort git:// you break the pack stream, and it
cannot be restarted. If we had a more consistent encoding we may
be able to restart an aborted clone.

But we can't solve it. Its a _very_ hard problem.

Nico, myself, and a whole lot of other very smart folks who really
understand how Git works today have failed to identify a way to do
this that we actually want to write, include in git, and maintain
long-term. Sure, anyone can come up with a specification that says
"put this here, that there, break ties this way". But we don't
want to bind our hands and maintain those rules.

> off the top of my head i can think of one: rsync if the transfer is
> interrupted. if the pack-objects are large - and not guaranteed to be
> the same - then an interrupted rsync transfer would be a bit of a
> waste of bandwidth. however if the pack-object could always be made
> the same, the partial transfer could carry on. musing a bit
> further... mmm... i supooose the same thing applies equally to http
> and ftp. it's a bit lame, i know: can anyone think of any better
> reasons?

We already do with this http:// and ftp:// during fetch or clone.
We try to resume with a byte range request, and validate the SHA-1
trailer on the end of the pack file after download. If it doesn't
match, we throw the file away and restart the entire thing.

In general pack files don't change that often, so there are fairly
good odds that resuming an aborted clone only a few hours after
it aborted would succeed by simply resuming the file download.
But every week or two (or even nightly!) its common for packs to
be completely rewritten (when the repository owner does `git gc`),
so we really cannot rely on packs being stable long-term.

--
Shawn.
Luke Kenneth Casson Leighton
2010-09-02 19:51:32 UTC
Permalink
On Thu, Sep 2, 2010 at 8:29 PM, Shawn O. Pearce <***@spearce.org> w=
rote:
> Luke Kenneth Casson Leighton <***@gmail.com> wrote:
>>
>> =C2=A0* based on what you kindly mentioned about "git repack -f", wo=
uld a
>> (well-written!) patch to git pack-objects to add a
>> "--single-thread-only" option be acceptable?
>
> Probably not. =C2=A0I can't think of a good reason to limit the numbe=
r
> of threads that get used. =C2=A0We already have pack.threads as a
> configuration variable to support controlling this for the system,
> but that's about the only thing that really makes sense.
>
>> =C2=A0* would you, or anyone else with enough knowledge of how this =
stuff
>> reaallly works, be willing to put some low-priority back-of-mind
>> thought into how to create a "canonical" pack format
>
> We have. =C2=A0We've even talked about it on the mailing list. =C2=A0=
Multiple
> times. =C2=A0Most times about how to support a p2p Git transport.
> That whole Gittorrent thing you are ignoring,

i'm not ignoring it - it was abandoned and sam created mirrorsync
instead! and i can't ignore something when all the damn information
on it has been withdrawn from the internet! i _have_ been looking,
and just can't darn well find anything. fortunately, i'm reasonably
bright, catch on fast, and listen well. ok. _sometimes_ i listen
well :)

> we put some effort
> into coming up with a pack-like format that would be more stable,
> at the expense of being larger in total size.

ahhh goood.

> Nico, myself, and a whole lot of other very smart folks who really
> understand how Git works today have failed to identify a way to do
> this that we actually want to write, include in git, and maintain
> long-term.

bugger. *sigh* ok. so, scratch that question, nico (the
canonical-pack question but not the --single-thread one)

so, this, and...

> In general pack files don't change that often, so there are fairly

... this, all tend to point towards the idea of sharing packs by
{ref}-{commitref}-{SHA1}.torrent as being a reasonabe and "good
enough" idea. on the basis that anyone who happens to be doing
git-sharing _right now_ is likely to end up sharing the exact same
pack object generated by the same one (original) seed.

i'd better start looking at bittornado in more detail...

l.

p.s. thank you to everyone who's responding, i dunno about you but
this is fascinating.
Luke Kenneth Casson Leighton
2010-09-02 20:06:22 UTC
Permalink
On Thu, Sep 2, 2010 at 8:29 PM, Shawn O. Pearce <***@spearce.org> w=
rote:
> Luke Kenneth Casson Leighton <***@gmail.com> wrote:
>>
>> =C2=A0* based on what you kindly mentioned about "git repack -f", wo=
uld a
>> (well-written!) patch to git pack-objects to add a
>> "--single-thread-only" option be acceptable?
>
> Probably not. =C2=A0I can't think of a good reason to limit the numbe=
r
> of threads that get used.

i can - so that git pack-objects, after "git repack -f", returns a
canonical pack! :)

>=C2=A0We already have pack.threads as a
> configuration variable to support controlling this for the system,
> but that's about the only thing that really makes sense.

ookaaay, so that would work, but would force that particular repo to
be _entirely_ single-threaded, which is non-optimal when all that's
needed is "git pack-objects" to be single-threaded. sure, i could
write a hack with some shell-script which takes the current option /
value for "pack.threads", stores it, changes it to 1, runs "git
pack-objects" and then changes it back again, but... eeuw. (not to
mention race-conditions....)

l.
Nicolas Pitre
2010-09-03 00:36:31 UTC
Permalink
On Thu, 2 Sep 2010, Luke Kenneth Casson Leighton wrote:

> On Thu, Sep 2, 2010 at 8:29 PM, Shawn O. Pearce <***@spearce.org> wrote:
> > Luke Kenneth Casson Leighton <***@gmail.com> wrote:
> >>
> >>  * based on what you kindly mentioned about "git repack -f", would a
> >> (well-written!) patch to git pack-objects to add a
> >> "--single-thread-only" option be acceptable?
> >
> > Probably not.  I can't think of a good reason to limit the number
> > of threads that get used.
>
> i can - so that git pack-objects, after "git repack -f", returns a
> canonical pack! :)

But did you try it? The -f means "don't reuse any existing pack data
and recompute every delta from scratch to find the best matches". This
is a very very costly operation that most people happily live without.


Nicolas
Luke Kenneth Casson Leighton
2010-09-03 10:34:21 UTC
Permalink
On Fri, Sep 3, 2010 at 1:36 AM, Nicolas Pitre <***@fluxnic.net> wrote:

>> =C2=A0i can - so that git pack-objects, after "git repack -f", retur=
ns a
>> canonical pack! :)
>
> But did you try it? =C2=A0The -f means "don't reuse any existing pack=
data
> and recompute every delta from scratch to find the best matches".

yehh, i tried it - on a small 2mb repo. whoopsie. ok. still one or
two ideas left.
Junio C Hamano
2010-09-03 17:03:15 UTC
Permalink
Nicolas Pitre <***@fluxnic.net> writes:

>> i can - so that git pack-objects, after "git repack -f", returns a
>> canonical pack! :)
>
> But did you try it? The -f means "don't reuse any existing pack data
> and recompute every delta from scratch to find the best matches". This
> is a very very costly operation that most people happily live without.

Also "the best matches" will change, hopefully in a better way, with newer
vintage of git. Optimization like c83f032 (apply delta depth bias to
already deltified objects, 2007-07-12) that is based on heuristics derived
from empirical statistics can and should be allowed to happen.
Brandon Casey
2010-09-02 20:28:28 UTC
Permalink
On 09/02/2010 02:29 PM, Shawn O. Pearce wrote:
> Luke Kenneth Casson Leighton <***@gmail.com> wrote:
>>
>> * based on what you kindly mentioned about "git repack -f", would a
>> (well-written!) patch to git pack-objects to add a
>> "--single-thread-only" option be acceptable?
>
> Probably not. I can't think of a good reason to limit the number
> of threads that get used. We already have pack.threads as a
> configuration variable to support controlling this for the system,
> but that's about the only thing that really makes sense.

I think pack-objects already has a --threads option allowing to
specify the number of threads to use.

--threads=1

should do it.

-Brandon
Luke Kenneth Casson Leighton
2010-09-02 20:48:44 UTC
Permalink
On Thu, Sep 2, 2010 at 9:28 PM, Brandon Casey
<***@nrlssc.navy.mil> wrote:
> I think pack-objects already has a --threads option allowing to
> specify the number of threads to use.
>
> =C2=A0 --threads=3D1
>
> should do it.

git pack-objects --help
....
....
--threads=3D<n>
Specifies the number of threads to spawn when searching for =
best
delta matches.

wha-hey! thank youuu brandon.

okaay... so maaaybeee there's a workaround (not involving patches to g=
it. whew)

l.
Jakub Narebski
2010-09-02 20:45:43 UTC
Permalink
"Shawn O. Pearce" <***@spearce.org> writes:
> Luke Kenneth Casson Leighton <***@gmail.com> wrote:

> > * would you, or anyone else with enough knowledge of how this stuff
> > reaallly works, be willing to put some low-priority back-of-mind
> > thought into how to create a "canonical" pack format
>
> We have. We've even talked about it on the mailing list. Multiple
> times. Most times about how to support a p2p Git transport.
> That whole GitTorrent thing you are ignoring, we put some effort
> into coming up with a pack-like format that would be more stable,
> at the expense of being larger in total size.

If I remember it correctly the main idea of GitTorrent was instead of
dividing file into pieces of data like in BitTorrent (pieces being
downloaded in parallel from different peers) it divides set of objects
into "reels" (which are special case of bundles, IIRC).

> > questions (not necessarily for nicolas) - can anyone think of any
> > good reasons _other_ than for multiple file-sharing to have a
> > "canonical" pack-object?
>
> Yes, its called resuming a clone over git://.
>
> Right now if you abort git:// you break the pack stream, and it
> cannot be restarted. If we had a more consistent encoding we may
> be able to restart an aborted clone.
>
> But we can't solve it. Its a _very_ hard problem.
>
> Nico, myself, and a whole lot of other very smart folks who really
> understand how Git works today have failed to identify a way to do
> this that we actually want to write, include in git, and maintain
> long-term. Sure, anyone can come up with a specification that says
> "put this here, that there, break ties this way". But we don't
> want to bind our hands and maintain those rules.

If I remember the discussion stalled (i.e. no working implementation),
and one of the latest proposals was to have some way of recovering
objects from partially downloaded file, and a way to request packfile
without objects that got already downloaded.

IIRC.
--
Jakub Narebski
Poland
ShadeHawk on #git
Luke Kenneth Casson Leighton
2010-09-02 21:10:55 UTC
Permalink
On Thu, Sep 2, 2010 at 9:45 PM, Jakub Narebski <***@gmail.com> wrote:

> If I remember the discussion stalled (i.e. no working implementation),
> and one of the latest proposals was to have some way of recovering
> objects from partially downloaded file, and a way to request packfile
> without objects that got already downloaded.

oo. ouch. i can understand why things stalled, then. you're
effectively adding an extra layer in, and even if you could add a
unique naming scheme on those objects (if one doesn't already exist?),
those object might (or might not!) come up the second time round (for
reasons mentioned already - threads resulting in different deltas
being picked etc.) ... and if they weren't picked for the re-generated
pack, you'd have to _delete_ them from the receiving end so as to
avoid polluting the recipient's object store haaarrgh *spit*, *cough*.

ok.

what _might_ work however iiiiIiis... to split the pack-object into
two parts. or, to add an "extra part", to be more precise:

a) complete list of all objects. _just_ the list of objects.
b) existing pack-object format/structure.

in this way, the sender having done all the hard work already of
determining what objects are to go into a pack-object, transfers that
*first*. _theeen_ you begin transferring the pack-object. theeeen,
if the pack-object transfer is ever interrupted, you simply send back
that list of objects, and ask "uhh, you know that list of objects we
were talking about? well, here it is *splat* - are you able to
recreate the pack-object from that, for me, and if so please gimme
again"

and, 10^N-1 times out of 10^N, for reasons that shawn kindly
explained, i bet you the answer would be "yes".

... um... in fact... um... i believe i'm merely talking about the .idx
index file, aren't i? because... um... the index file contains the
list of object refs in the pack, yes?

sooo.... taking a wild guess, here: if you were to parse the .idx file
and extract the list of object-refs, and then pass that to "git
pack-objects --window=0 --delta=0", would you end up with the exact
same pack file, because you'd forced git pack-objects to only return
that specific list of object-refs?

l.
Luke Kenneth Casson Leighton
2010-09-02 21:19:48 UTC
Permalink
sorry, shouldn't have hit send so quick.

On Thu, Sep 2, 2010 at 10:10 PM, Luke Kenneth Casson Leighton
<***@gmail.com> wrote:

> were talking about? =C2=A0well, here it is *splat* - are you able to
> recreate the pack-object from that, for me, and if so please gimme
> again"

> sooo.... taking a wild guess, here: if you were to parse the .idx fil=
e
> and extract the list of object-refs, and then pass that to "git
> pack-objects --window=3D0 --delta=3D0", would you end up with the exa=
ct
> same pack file, because you'd forced git pack-objects to only return
> that specific list of object-refs?

becauuuuse... if soooo... then that's a solution! you get the .idx
file first, you make sure that that's cached (and it's small, isn't
it, so that would be ok), then you transfer it around the network, and
you ask each repo if they can re-create the pack-object from the
contents of the .idx, if they can, blam, they're a seed/sharer for
that pack-object; if they can't, tough titty, they've probably had git
gc run, or had some more commits added, or whatever, and can't
participate, wow big deal, no great loss.

does that fly? :) if it doesn't fly (with --window=3D0 --delta=3D0) wo=
uld
it be easy to add an option to say "oi, gimme a pack with nothing but
these refs, in exactly this order, no quibbling, no questions, just
gimme"? and you could call it "the same as the original" because,
duh, it would contain exactly the same objects as the original.

am i even vaaaguely along the right lines? *scratches head*.

l.
Nicolas Pitre
2010-09-03 00:29:26 UTC
Permalink
On Thu, 2 Sep 2010, Luke Kenneth Casson Leighton wrote:

> On Thu, Sep 2, 2010 at 9:45 PM, Jakub Narebski <***@gmail.com> wrote:
>
> > If I remember the discussion stalled (i.e. no working implementation),
> > and one of the latest proposals was to have some way of recovering
> > objects from partially downloaded file, and a way to request packfile
> > without objects that got already downloaded.
>
> oo. ouch. i can understand why things stalled, then. you're
> effectively adding an extra layer in, and even if you could add a
> unique naming scheme on those objects (if one doesn't already exist?),
> those object might (or might not!) come up the second time round (for
> reasons mentioned already - threads resulting in different deltas
> being picked etc.) ... and if they weren't picked for the re-generated
> pack, you'd have to _delete_ them from the receiving end so as to
> avoid polluting the recipient's object store haaarrgh *spit*, *cough*.

Well, actually there is no need to delete anything. Git can cope with
duplicated objects just fine. A subsequent gc will get rid of the
duplicates automatically.

> what _might_ work however iiiiIiis... to split the pack-object into
> two parts. or, to add an "extra part", to be more precise:
>
> a) complete list of all objects. _just_ the list of objects.
> b) existing pack-object format/structure.
>
> in this way, the sender having done all the hard work already of
> determining what objects are to go into a pack-object, transfers that
> *first*. _theeen_ you begin transferring the pack-object. theeeen,
> if the pack-object transfer is ever interrupted, you simply send back
> that list of objects, and ask "uhh, you know that list of objects we
> were talking about? well, here it is *splat* - are you able to
> recreate the pack-object from that, for me, and if so please gimme
> again"

Well, it isn't that simple.

First, a resumable clone is useful only when there is a big transfer in
play. Otherwise it isn't worth the trouble.

So, if the clone is big, then this list of objects can be in the
millions. For example my linux kernel repo with a couple branches
currently has:

$ git rev-list --all --objects | wc -l
2808136

So, 2808136 objects, with 20-byte SHA1 for each of them, and you have a
54 MB object list to transfer already. This is a significant overhead
that we prefer to avoid, given the actual pack transfer which is:

$ git pack-objects --all --stdout --progress < /dev/null | wc -c
Counting objects: 2808136, done.
Compressing objects: 100% (384219/384219), done.
645201934
Total 2808136 (delta 2422420), reused 2788225 (delta 2402700)

The output from wc is 645201934 = 615 MB for this repository. Hence the
list of object alone is quite significant.

And even then, what if the transfer crashes during that object list
transfer? On flaky connections this might happen within 54 MB.

> and, 10^N-1 times out of 10^N, for reasons that shawn kindly
> explained, i bet you the answer would be "yes".

For the list of objects, sure. But that isn't a big deal. It is easy
enough to tell the remote about the commits we already have and ask for
the rest. With a commit SHA1, the remote can figure out all the objects
we have. But all is in that determination of the latest commit we have.
If we get a partial pack, it is possible to somehow salvage as many
objects from it, and determine what top commit(s) that correspond to.
It is possible to set your local repo just as if you had requested a
shallow clone and then the resume would simply be a deepening of that
shallow clone.

But usually the very first commit in a pack is huge as it typically
isn't delta compressed (a delta chain has to start somewhere). And this
first commit will roughly represent the same size as a tarball for that
commit. And if you don't get at least that first commit then you are
screwed. Or if you don't get a complete second commit when deepening a
clone you are still screwed.

Another issue is what to do with objects that are themselves huge.

Yet another issue: what to do with all those objects I've got in my
partial pack, but that I can't connect to any commit yet. We don't want
them transferred again but it isn't easy to tell the remote about them.

You could tell the remote: "I have this pack for this commit from this
commit but I got only this amount of bytes from it, please resume
transfer here." But as mentioned before the pack stream is not
deterministic, and we really don't want to make it single-threaded on a
server. Furthermore this is a lot of work for the server as even if the
pack stream is deterministic, then the server still has to recreate the
first part of the pack just to throw it away until the desired offset is
reached. And caching pack results also has all sorts of implications
we've prefered to avoid on a server for security reasons (better keep
serving operations read-only).

> ... um... in fact... um... i believe i'm merely talking about the .idx
> index file, aren't i? because... um... the index file contains the
> list of object refs in the pack, yes?

In one pack, yes. You might have multiple packs. And that doesn't mean
that all the objects from a pack are all relevant to the actual branches
you are willing to export.

> sooo.... taking a wild guess, here: if you were to parse the .idx file
> and extract the list of object-refs, and then pass that to "git
> pack-objects --window=0 --delta=0", would you end up with the exact
> same pack file, because you'd forced git pack-objects to only return
> that specific list of object-refs?

If you do this i.e. turn off delta compression, then the 615 MB
repository above will turn itself into a multi-gigabyte pack!


Nicolas
Nguyen Thai Ngoc Duy
2010-09-03 02:48:21 UTC
Permalink
On Fri, Sep 3, 2010 at 10:29 AM, Nicolas Pitre <***@fluxnic.net> wrote=
:
> But usually the very first commit in a pack is huge as it typically
> isn't delta compressed (a delta chain has to start somewhere). =C2=A0=
And this
> first commit will roughly represent the same size as a tarball for th=
at
> commit. =C2=A0And if you don't get at least that first commit then yo=
u are
> screwed. =C2=A0Or if you don't get a complete second commit when deep=
ening a
> clone you are still screwed.

Elijah's recent work on "rev-list --objects -- pathspec" [1] may help
split a commit into many parts that can be sent separately.

[1] http://mid.gmane.org/1282803711-10253-1-git-send-email-***@gmail=
=2Ecom

> Another issue is what to do with objects that are themselves huge.

=46or big blobs, it's probably best sending them separately so they can
be resumed.
--=20
Duy
Luke Kenneth Casson Leighton
2010-09-03 10:55:57 UTC
Permalink
On Fri, Sep 3, 2010 at 3:48 AM, Nguyen Thai Ngoc Duy <***@gmail.com=
> wrote:
> On Fri, Sep 3, 2010 at 10:29 AM, Nicolas Pitre <***@fluxnic.net> wro=
te:
>> But usually the very first commit in a pack is huge as it typically
>> isn't delta compressed (a delta chain has to start somewhere). =C2=A0=
And this
>> first commit will roughly represent the same size as a tarball for t=
hat
>> commit. =C2=A0And if you don't get at least that first commit then y=
ou are
>> screwed. =C2=A0Or if you don't get a complete second commit when dee=
pening a
>> clone you are still screwed.
>
> Elijah's recent work on "rev-list --objects -- pathspec" [1] may help
> split a commit into many parts that can be sent separately.

thank you nguyen, will take a look later in the day at that. much
appreciated.
Luke Kenneth Casson Leighton
2010-09-03 10:23:41 UTC
Permalink
very quick reply am off out

On Fri, Sep 3, 2010 at 1:29 AM, Nicolas Pitre <***@fluxnic.net> wrote:
>> sooo.... taking a wild guess, here: if you were to parse the .idx file
>> and extract the list of object-refs, and then pass that to "git
>> pack-objects --window=0 --delta=0", would you end up with the exact
>> same pack file, because you'd forced git pack-objects to only return
>> that specific list of object-refs?
>
> If you do this i.e. turn off delta compression, then the 615 MB
> repository above will turn itself into a multi-gigabyte pack!

sorry nicolas, i believe you may have misunderstood. first obtain
.idx which will return a delta-compressed list of objects, yes? then
use --window=0 --delta=0 with exact same list, surely you will end up
with the exact same list which you gave to the *previous* command,
yes?

$ git pack-objects
>> .pack and .idx
$ mv .pack .pack2
$ extract_pack_objects_from_idx.sh .idx > foo
$ git pack-objects --window=0 --delta=0 < foo
$ diff .pack .pack2

so no, of _course_ not just ask --window=0 --delta=0 for the very
first run: if that's what i'd said, that would indeed be dumb.

l.
Luke Kenneth Casson Leighton
2010-09-03 10:54:12 UTC
Permalink
have a bit more time.

On Fri, Sep 3, 2010 at 1:29 AM, Nicolas Pitre <***@fluxnic.net> wrote:
>> pack, you'd have to _delete_ them from the receiving end so as to
>> avoid polluting the recipient's object store haaarrgh *spit*, *cough=
*.
>
> Well, actually there is no need to delete anything. =C2=A0Git can cop=
e with
> duplicated objects just fine. =C2=A0A subsequent gc will get rid of t=
he
> duplicates automatically.

excellent. good to hear.

>> =C2=A0what _might_ work however iiiiIiis... to split the pack-object=
into
>> two parts. =C2=A0or, to add an "extra part", to be more precise:
>>
>> a) complete list of all objects. =C2=A0_just_ the list of objects.
>> b) existing pack-object format/structure.
>>
>> in this way, the sender having done all the hard work already of
>> determining what objects are to go into a pack-object, transfers tha=
t
>> *first*. =C2=A0_theeen_ you begin transferring the pack-object. =C2=A0=
theeeen,
>> if the pack-object transfer is ever interrupted, you simply send bac=
k
>> that list of objects, and ask "uhh, you know that list of objects we
>> were talking about? =C2=A0well, here it is *splat* - are you able to
>> recreate the pack-object from that, for me, and if so please gimme
>> again"
>
> Well, it isn't that simple.
>
> First, a resumable clone is useful only when there is a big transfer =
in
> play. =C2=A0Otherwise it isn't worth the trouble.
>
> So, if the clone is big, then this list of objects can be in the
> millions. =C2=A0For example my linux kernel repo with a couple branch=
es
> currently has:
>
> $ git rev-list --all --objects | wc -l
> 2808136
>
> So, 2808136 objects, with 20-byte SHA1 for each of them, and you have=
a
> 54 MB object list to transfer already.

ok:

a) that's fine. first time, you have to do that, you have to do that.

b) i have some ideas in mind, to say things like "i already have the
following objects up to here, please give me a list of everything
since then".

> And even then, what if the transfer crashes during that object list
> transfer? =C2=A0On flaky connections this might happen within 54 MB.

that's fine: i envisage the object list being cached at the remote
end (by the first seed), and also being a "shared file", such that
there may even be complete copies of that "file" out there already,
such that resumption is a non-issue.

>> and, 10^N-1 times out of 10^N, for reasons that shawn kindly
>> explained, i bet you the answer would be "yes".
>
> For the list of objects, sure. =C2=A0But that isn't a big deal. =C2=A0=
It is easy
> enough to tell the remote about the commits we already have and ask f=
or
> the rest. =C2=A0With a commit SHA1, the remote can figure out all the=
objects
> we have. But all is in that determination of the latest commit we hav=
e.
> If we get a partial pack, it is possible to somehow salvage as many
> objects from it, and determine what top commit(s) that correspond to.
> It is possible to set your local repo just as if you had requested a
> shallow clone and then the resume would simply be a deepening of that
> shallow clone.

i'll need to re-read this when i have more time. apologies.

> Another issue is what to do with objects that are themselves huge.

that's fine, too: in fact, that's the perfect scenario where a
file-sharing protocol excels.

> Yet another issue: what to do with all those objects I've got in my
> partial pack, but that I can't connect to any commit yet. =C2=A0We do=
n't want
> them transferred again but it isn't easy to tell the remote about the=
m.
>
> You could tell the remote: "I have this pack for this commit from thi=
s
> commit but I got only this amount of bytes from it, please resume
> transfer here."

ok i have a couple of ideas/thoughts

a) one of which was to send the commit index list back to the remote
end, but that would be baaaaad as it could be 54mb as you say, so it
would be necessary to say "here is the SHA1 of the index file you gave
me earlier, do you still have it, if so please can we resume

b) as long as _somebody_ has a complete copy, distributed throughout
the file-sharing network, of that complete pack, "resume transfer
here" isn't .... the concept is moot. the bittorrent protocol covers
that concept of "resume" very very easily.

>=C2=A0But as mentioned before the pack stream is not
> deterministic,

one a one-off basis, it is; and even then, i believe that it could be
made to "not matter". so you ask a server for a pack object and get a
different SHA-1? so what, you just make that part of the
file-sharing-network unique key: {ref}-{objref}-{SHA-1} instead of
just {ref}-{objref}. if the connection's lost, wow big deal, you just
ask again and you end up with a different SHA-1. you're back to a
situation which is actually no different from and no less efficient
than the present http transfer system.

... but i'd rather avoid this scenario, if possible.

> and we really don't want to make it single-threaded on a
> server. =C2=A0Furthermore this is a lot of work for the server as eve=
n if the
> pack stream is deterministic, then the server still has to recreate t=
he
> first part of the pack just to throw it away until the desired offset=
is
> reached. =C2=A0And caching pack results also has all sorts of implica=
tions
> we've prefered to avoid on a server for security reasons (better keep
> serving operations read-only).

i've already thrown out the idea of cacheing the pack objects
themselves, but am still exploring the concept of cacheing the .idx
file, even for short periods of time.

so the server does a lot of work creating that .idx file, but it
contains the complete list of all objects, which you _could_ just
obtain again by just asking explicitly for each and every single one
of those objects, no more, no less, no deltas, no windows - the list,
the whole list and nothing but the list.

>> ... um... in fact... um... i believe i'm merely talking about the .i=
dx
>> index file, aren't i? =C2=A0because... um... the index file contains=
the
>> list of object refs in the pack, yes?
>
> In one pack, yes. =C2=A0You might have multiple packs. =C2=A0And that=
doesn't mean
> that all the objects from a pack are all relevant to the actual branc=
hes
> you are willing to export.

yes that's fine. multiple packs are considered to be independent
files of the "VFS layer" in the file-sharing network. that's taken
care of. what i need to know is: can you recreate a pack object given
the list of objects in its .idx file?

>> sooo.... taking a wild guess, here: if you were to parse the .idx fi=
le
>> and extract the list of object-refs, and then pass that to "git
>> pack-objects --window=3D0 --delta=3D0", would you end up with the ex=
act
>> same pack file, because you'd forced git pack-objects to only return
>> that specific list of object-refs?
>
> If you do this i.e. turn off delta compression, then the 615 MB
> repository above will turn itself into a multi-gigabyte pack!

ok this was covered in my previous post, hope it's clearer. perhaps
"git pack-objects --window=3D0 --delta=3D0 <
{list-of-objects-extracted-from-the-idx-file}" isn't the way to
achieve what i envisage - if not, does anyone have any ideas on how
extracting the exact list of objects as previously given by a .idx
file can be achieved?

l.
Luke Kenneth Casson Leighton
2010-09-02 18:07:07 UTC
Permalink
On Thu, Sep 2, 2010 at 4:58 PM, Jeff King <***@peff.net> wrote:

> pack-objects will reuse previously found deltas. So the deltas you ha=
ve
> in your existing packs matter. The deltas you have in your existing
> packs depend on many things. At least:
>
> =C2=A01. Options you used when packing (e.g., --depth and --window).
>
> =C2=A02. Probably exactly _when_ you packed. You could find a good de=
lta
> =C2=A0 =C2=A0 from A to B. Later, object C comes into existence, and =
would
> =C2=A0 =C2=A0 provide a better delta base for B. I don't think we wil=
l ever try A
> =C2=A0 =C2=A0 against C, unless --no-reuse-delta is set.
>
> =C2=A0 =C2=A0 You have a different pack than somebody who packed afte=
r A, B, and
> =C2=A0 =C2=A0 C all existed.
>
> =C2=A0 =C2=A0 In practice, this tends not to happen much because the =
best deltas
> =C2=A0 =C2=A0 are usually going backwards in time to a previous versi=
on. But it
> =C2=A0 =C2=A0 can happen.

jeff, thanks for explaining (and to nicolas, i see, since beginning th=
is)

mrhmfffh. just been reading Documentation/technical/pack-heuristics.t=
xt.

so... options include:

* writing an alternative "canonical" pack-object algorithm. i'm
inclined to select "git format-patch"! :) but that would be the lazy
way....

* taking the seeder's pack-objects as the "canonical" ones,
regardless. cacheing of the results would, sadly, be virtually
unavoidable, given the situation (multi-threading etc.)

* throw away bittorrent entirely as a transport mechanism.

* force-feed one peer to be "the" provider of a particular given
pack. doesn't matter whom you contact to _obtain_ a pack from, as
long as you solely and exclusively get the pack from that particular
peer.

* slight improvement / variation on the above: if two peers just
coincidentally happen to create or have the same pack (as can be shown
by having the same SHA-1, and/or by having the same data in their
cache) then ta-daaa, you have a file-sharing network for that
particular pack.

i think.... i think i miiight be able, hmmm... i believe it would be
possible to implement this last option by creating separate .torrents
for packs (one each!). by splitting things down, so that pack objects
are named as {ref}-{objref}-{SHA-1}.torrent and by providing a "top
level" torrent which contains the refs/heads/* and the associate
rev-list(s)... each set of rev-lists would have the SHA-1 of the
pack-object that happened to be created (and shared) at that
particular time, from that particular client: you then genuinely don't
give a stuff about who has what, it's all the same, and...

hmmm, i feel a modification to / deviation from the bittorrent
protocol coming on :)

l.
Casey Dahlin
2010-09-02 18:23:07 UTC
Permalink
On Thu, Sep 02, 2010 at 07:07:07PM +0100, Luke Kenneth Casson Leighton wrote:
> On Thu, Sep 2, 2010 at 4:58 PM, Jeff King <***@peff.net> wrote:
> i think.... i think i miiight be able, hmmm... i believe it would be
> possible to implement this last option by creating separate .torrents
> for packs (one each!). by splitting things down, so that pack objects
> are named as {ref}-{objref}-{SHA-1}.torrent and by providing a "top
> level" torrent which contains the refs/heads/* and the associate
> rev-list(s)... each set of rev-lists would have the SHA-1 of the
> pack-object that happened to be created (and shared) at that
> particular time, from that particular client: you then genuinely don't
> give a stuff about who has what, it's all the same, and...
>

You seem to have some misconceptions about refs too. refs are absolutely
not common between two machines. The branch "master" on my box might
match the branch "linux-2.6" on git.kernel.org. And it might be from two
days ago.

A ref is just an annotation in /one particular repository/ that a
particular commit is interesting to it for some reason.

--CJD
A Large Angry SCM
2010-09-02 16:58:14 UTC
Permalink
On 09/02/2010 11:42 AM, Luke Kenneth Casson Leighton wrote:
> On Thu, Sep 2, 2010 at 4:33 PM, A Large Angry SCM<***@gmail.com> wrote:
>> On 09/02/2010 09:37 AM, Luke Kenneth Casson Leighton wrote:
>>>
>>> On Wed, Sep 1, 2010 at 11:04 PM, Nguyen Thai Ngoc Duy<***@gmail.com>
>>> wrote:
>>
>> [...]
>>>>
>>>> There were discussions whether a pack is stable enough to
>>>> be shared like this,
>>>
>>> it seems to be. as long as each version of git produces the exact
>>> same pack object, off of the command "git pack-objects --all --stdout
>>> --thin {ref}< {objref}"
>>
>> This is not guaranteed.
>
> ok. greeeat.
>
> so, some sensible questions:
>
> * what _can_ be guaranteed?
>
> * diffs?

Given a pre-image and a post-image, the diff/delta created will recreate
the post-image from the pre-image. The bit level representation of the
diff/delta is not guaranteed.

> * git-format-patches? (which i am aware can do binary files and also
> rms)?

See above.

> * individual files in the .git/objects directory?

Uncompressed: yes. Compressed: no.

> and, asking perhaps some silly questions:
>
> * why is it not guaranteed?

The pack format was created to move objects from one repository to
another. To do that efficiently, it uses many heuristics to decide how
much information is _sufficient_ to do the job but but leaves it to the
implementation and user to decide the various trade offs. For instance,
there is no canonical the order of the object information in a pack or
that the pack must be minimal. This also allows for the multi-threaded
pack implementation.

> * under what circumstances is it not guaranteed? and, crucially, is
> it necessary to care? i.e. if someone does a shallow git clone, i
> couldn't give a stuff.

Pretty much all of the time.

> * is it possible to _make_ the repository guaranteed to produce
> identical pack objects?

Identical code on identical systems with identical repositories without
multi-threading _might_ work.

> * does for example "git gc" change the object store in such a way such
> that one git repo will produce a different pack-object from the same
> ref? if so, can running "git gc" prior to producing the pack-objects
> gurantee that the pack-objects will be the same?

See above.

> * is it a versioning issue? is it because there are different
> versions (2 and 3)? if so, that's ok, you just force people to use
> the same pack-object versions.

No.

> etc. etc.
>
> l.
>
Nicolas Pitre
2010-09-02 17:21:24 UTC
Permalink
On Thu, 2 Sep 2010, Luke Kenneth Casson Leighton wrote:

> On Thu, Sep 2, 2010 at 4:33 PM, A Large Angry SCM <***@gmail.com> wrote:
> > On 09/02/2010 09:37 AM, Luke Kenneth Casson Leighton wrote:
> >>
> >> On Wed, Sep 1, 2010 at 11:04 PM, Nguyen Thai Ngoc Duy<***@gmail.com>
> >>  wrote:
> >
> > [...]
> >>>
> >>> There were discussions whether a pack is stable enough to
> >>> be shared like this,
> >>
> >>  it seems to be.  as long as each version of git produces the exact
> >> same pack object, off of the command "git pack-objects --all --stdout
> >> --thin {ref}<  {objref}"
> >
> > This is not guaranteed.
>
> ok. greeeat.
>
> so, some sensible questions:
>
> * what _can_ be guaranteed?

You can guarantee that if the SHA1 name of different packs is the same
then they contain the same set of objects. Obviously their packed
encoding will be different, and even the pack sizes might be quite
different too.

> * diffs?

Again that depends. Over the evolution of Git, its diff library was
modified resulting in slightly different but valid equivalent diff
outputs.

> * git-format-patches? (which i am aware can do binary files and also
> rms)?

Same as above.

> * individual files in the .git/objects directory?

Well, even then you can't guarantee they will be identical from one
system to another. That may depend on the zlib library version used for
example.

> and, asking perhaps some silly questions:
>
> * why is it not guaranteed?

Because it doesn't need to.

> * under what circumstances is it not guaranteed? and, crucially, is
> it necessary to care? i.e. if someone does a shallow git clone, i
> couldn't give a stuff.

Like I said, even repeating some repacking on the same machine with same
input is likely to produce slightly different packs because of
threading. This is because the work set is divided between threads, and
since thread scheduling is not deterministic then some threads might not
have the same amount of CPU cycles given to them in relation with the
other threads. And when a thread is done with its work set, it will go
and steal half of the work set from another thread with the most
amount of work
still left. This has the effect of changing the delta pairing outcome
on the workset edges.

> * is it possible to _make_ the repository guaranteed to produce
> identical pack objects?

Sure, but performance will suck.

> * does for example "git gc" change the object store in such a way such
> that one git repo will produce a different pack-object from the same
> ref? if so, can running "git gc" prior to producing the pack-objects
> gurantee that the pack-objects will be the same?

No. The gc operation will combine multiple small packs into one and try
to reuse as much data from those existing packs as possible without
recomputing it. So you'll end up reusing whatever delta pairing you
were given from your peer the last time you cloned a repo or fetched an
update. And of course that clone/fetch was the result of a pack
combining operation on the sending end which itself tried to reuse as
much of the existing data from different packs without recomputing it
too. Only the edges between different packs will be delta compressed in
those cases, using the particular heuristics that happen to be
implemented in the involved Git versions. So you may end up with a
totally different pack content containing data segments that originated
from wildly random places on the net.

The only way to get a bit-for-bit reproducible pack one one specific
system is to use 'git repack' with the -f switch, and limit it to only
one thread.

> * is it a versioning issue? is it because there are different
> versions (2 and 3)? if so, that's ok, you just force people to use
> the same pack-object versions.

Not at all. FYI version 3 never was actually deployed so there is
effectively only version 2 in play. There are "features" such as
OFS_DELTA that are negotiated when a pack is transferred over the git
protocol and if the receiver doesn't advertise them then the sender will
convert them on the fly into a compatible form.

But as the actual pack bitstream goes, it is totally unstable for all
the reasons I've stated so far. Of course, Git being distributed must
rely on some stable and universal representation of object content,
hence their SHA1 references. But their encoding doesn't have to be when
all peers can cope with all the variations.

I'm sorry as this isn't going to help you much unfortunately.










>
> etc. etc.
>
> l.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to ***@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Luke Kenneth Casson Leighton
2010-09-02 19:41:57 UTC
Permalink
nicolas, thanks for responding: you'll see this some time in the
future when you catch up, it's not a high priority, nothing new, just
thinking out loud, for benefit of archives.

On Thu, Sep 2, 2010 at 6:21 PM, Nicolas Pitre <***@fluxnic.net> wrote:
>> =C2=A0* what _can_ be guaranteed?
>
> You can guarantee that if the SHA1 name of different packs is the sam=
e
> then they contain the same set of objects. =C2=A0Obviously their pack=
ed
> encoding will be different, and even the pack sizes might be quite
> different too.

ack. ok, so the idea of creating lots and lots of 2nd level
=2Etorrents by name {ref}-{commitref}-{SHA-1}.torrent is about the only
way to get around that.

@begin lots of no, no and hell no...
> [...]
@end

dang. diffs, versions, threads and zlibs as well, all conspiring agai=
nst me :)

>> * is it possible to _make_ the repository guaranteed to produce
>> identical pack objects?
>
> Sure, but performance will suck.

that's fiiine :) as i've learned on the pyjamas project, it's rare
that you have speed and interoperability at the same time...

> The only way to get a bit-for-bit reproducible pack one one specific
> system is to use 'git repack' with the -f switch, and limit it to onl=
y
> one thread.

whew - a way out, at last. you had me going, for a minute :)

>> * is it a versioning issue? =C2=A0is it because there are different
>> versions (2 and 3)? =C2=A0if so, that's ok, you just force people to=
use
>> the same pack-object versions.
>
> Not at all. =C2=A0FYI [....]

appreciated.


> I'm sorry as this isn't going to help you much unfortunately.

neeh, i'm flexible. it looks like i'm going to need to deviate from
bittorrent, after all, start adding new commands over which the git
rev-list gets transferred, rather than as a VFS layer. the reason is
that bittorrent depends on the files and the data in the files being
all the same, so that a hash can be taken of the whole lot and the
end-result verified.

if the pack-objects are going to vary, then the VFS layer idea is
blown completely out the water, except for the absolute basic
meta-info such as "refs/heads/*". so i might as well just use
"actual" bittorrent to transfer packs via
{ref}-{commitref}-{SHA-1}.torrent.

ho hum, drawing board we come...


l.
A Large Angry SCM
2010-09-02 19:52:02 UTC
Permalink
On 09/02/2010 03:41 PM, Luke Kenneth Casson Leighton wrote:

[...]

> neeh, i'm flexible. it looks like i'm going to need to deviate from
> bittorrent, after all, start adding new commands over which the git
> rev-list gets transferred, rather than as a VFS layer. the reason is
> that bittorrent depends on the files and the data in the files being
> all the same, so that a hash can be taken of the whole lot and the
> end-result verified.
>
> if the pack-objects are going to vary, then the VFS layer idea is
> blown completely out the water, except for the absolute basic
> meta-info such as "refs/heads/*". so i might as well just use
> "actual" bittorrent to transfer packs via
> {ref}-{commitref}-{SHA-1}.torrent.
>
> ho hum, drawing board we come...

You can treat the git object store (both loose and packed objects) as a
VFS and use the refs of interest (the I-needs and the I-haves) to create
a view...
Nicolas Pitre
2010-09-02 23:09:45 UTC
Permalink
On Thu, 2 Sep 2010, Luke Kenneth Casson Leighton wrote:

> nicolas, thanks for responding: you'll see this some time in the
> future when you catch up, it's not a high priority, nothing new, just
> thinking out loud, for benefit of archives.

Well, I might as well pay more attention to *you* now. :-)

> >> * is it possible to _make_ the repository guaranteed to produce
> >> identical pack objects?
> >
> > Sure, but performance will suck.
>
> that's fiiine :) as i've learned on the pyjamas project, it's rare
> that you have speed and interoperability at the same time...

Well, did you hear about this thing called Git? It appears that those
Git developers are performance freaks. :-) Yet, Git is interoperable
across almost all versions ever released because we made sure that only
fundamental things are defined and relied upon. And that excludes
actual delta pairing and pack object ordering. That's why a pack file
may have many different byte sequences and yet still represent the same
canonical data.

> if the pack-objects are going to vary, then the VFS layer idea is
> blown completely out the water, except for the absolute basic
> meta-info such as "refs/heads/*". so i might as well just use
> "actual" bittorrent to transfer packs via
> {ref}-{commitref}-{SHA-1}.torrent.

For the archive benefit, here's what I think on the whole idea.

The BitTorrent model is simply unappropriate for Git. It doesn't fit to
the Git model at all as BitTorrent works on stable and static data, and
requires a lot of people wanting that same data.

When you perform a fetch, Git does actually negociate with the server to
figure out what's missing locally, and the server does produce a custom
pack for you that is optimized so that only what's needed for you to be
up to date is transferred.

Even if you try to cache a set of packs to suit the BitTorrent static
data model, you'll need so many packs to cover all the possible gaps
between a server and a random number of clients each with a random
repository state. Of course it is possible to have bigger packs
covering larger gaps, but then you lose the biggest advantage that the
smart Git protocol has. And with smaller, more fine grained packs,
you'll end up with so many of them that finding a live torrent for the
actual one you need is going to be difficult.

> ho hum, drawing board we come...

Yep. Instead of transferring packs, a BitTorrent-alike transfer should
be based on the transfer of _objects_. Therefore you can make the
correspondance between file chunks in BitTorrent with objects in a Git
aware system. So, when contacting a peer, you could negociate what is
the set of objects that the peer has that you don't, and vice versa.
Objects in Git are stable and immutable, and they all have a unique SHA1
signature. And to optimize the negociation, the pack index content can
be used, first by exchanging the content of the first level
fan-out table and ignoring those entries that are equal. This for each
peer.

Then, each peer make requests to connected peers for objects that those
peers have but that isn't available locally, just like chunks in
BitTorrent.

But here's the twist to make this scale well. Since the object sender
knows what objects the receiver already has, it therefore can choose the
object encoding. Meaning that the sender can simply *reuse* a delta
encoding for an object it is requested to send if the requestor already
has the base object for this delta.

So in most cases, the object to send will be small, especially if it is
a delta object. That should fit the
chunk model. But if an object is bigger than a certain treshold, then
its transfer could be chunked across multiple peers just like classic
BitTorrent. In this case, the chunking would need to be done on
the non delta uncompressed object data as this is the only thing that is
universally stable (doesn't mean that the _transfer_ of those chunks
can't be compressed).

Now this design has many open questions, such as finding out what is the
latest set of refs amongst all the peers, whether or not what we have
locally are ancestors of the remote refs, etc.

And of course, while this will make for a speedy object transfer, the
resulting mess on the receiver's end will have to be validated
and repacked in the end. So overall this might not end up being faster
overall for the fetcher.


Nicolas
Theodore Tso
2010-09-03 10:37:12 UTC
Permalink
On Sep 2, 2010, at 7:09 PM, Nicolas Pitre wrote:

> The BitTorrent model is simply unappropriate for Git. It doesn't fit to
> the Git model at all as BitTorrent works on stable and static data, and
> requires a lot of people wanting that same data.

I wonder if this would work. Assume for the moment that once every N days (where N might be one month), the "Gittorrent master" generates a "canonical pack construction file" that contains information about the objects, their delta pairing, the version of diff and zlib required, etc., and the SHA1 of the expected resulting encoding of this pack file. This would contain the canonical pack for the entire repository, and for the Linux kernel, it might be released on www.kernel.org, and contain all of the objects from the beginning of time to the latest commit in Linus Torvalds' repository.

Although everybody's git repository will have different packs that they use and store natively, by following the "canonical pack construction file" they will be able to build up a canonical encoding for the pack that can then be used to more quickly allow newbie developers who are pulling the full clone of Linus's tree for the first time to download using a peer2peer Bittorrent style download. So people who are willing to participate as part of the peer2peer network can download the instructions for how to make the canonical pack once a month, and use it to create the canonical pack. If the "Gittorrent master" has spent a lot of time to carefully compute the most efficient set of delta pairings, they will get the slight benefit of a more efficient pack which they could use instead of th
eir local one without having to use large values of --window and --depth to "git repack".

This allows the peer2peer download to be used where it most matters --- for the bulk download for people who are cloning from the "canonical repository" for the first time. After that, they will no doubt find it far more efficient to download incremental uploads using the git protocol.

-- Ted
Luke Kenneth Casson Leighton
2010-09-03 11:04:52 UTC
Permalink
On Fri, Sep 3, 2010 at 11:37 AM, Theodore Tso <***@mit.edu> wrote:

> So people who are willing to participate as part of the peer2peer
> network can download the instructions for how to make the
> canonical pack once a month, and use it to create the canonical pack.

yes.

also it's already becoming clear that people mayy need to run a
"front" copy of the peertopeer git repository, to which they locally
perform "git push", for various reasons including not wishing to
expose "random experimentation commits" out onto the wider internet
until they're actually ready to do so. i realise that merging can get
round this (flattening many patches into one big commit) but there is
also the issue of having to run a daemon on the peertopeer repo - you
miiight not want to interfere with that.

so, yes, running a "special" git repo is probably a good idea.

thanks theo.

l.
Junio C Hamano
2010-09-03 17:12:29 UTC
Permalink
Theodore Tso <***@MIT.EDU> writes:

> ... So people who are willing
> to participate as part of the peer2peer network can download the
> instructions for how to make the canonical pack once a month, and use it
> to create the canonical pack. If the "Gittorrent master" has spent a
> lot of time to carefully compute the most efficient set of delta
> pairings, they will get the slight benefit of a more efficient pack
> which they could use instead of th eir local one without having to use
> large values of --window and --depth to "git repack".

Hmm, is the idea essentially to tell people "Here is a snapshot of Linus
repository as of a few weeks ago, carefully repacked. Instead of running
"git clone" yourself, please bootstrap your repository by copying it over
bittorrent and then "git pull" to update it"?
Ted Ts'o
2010-09-03 18:31:20 UTC
Permalink
On Fri, Sep 03, 2010 at 10:12:29AM -0700, Junio C Hamano wrote:
> Theodore Tso <***@MIT.EDU> writes:
>
> > ... So people who are willing
> > to participate as part of the peer2peer network can download the
> > instructions for how to make the canonical pack once a month, and use it
> > to create the canonical pack. If the "Gittorrent master" has spent a
> > lot of time to carefully compute the most efficient set of delta
> > pairings, they will get the slight benefit of a more efficient pack
> > which they could use instead of th eir local one without having to use
> > large values of --window and --depth to "git repack".
>
> Hmm, is the idea essentially to tell people "Here is a snapshot of Linus
> repository as of a few weeks ago, carefully repacked. Instead of running
> "git clone" yourself, please bootstrap your repository by copying it over
> bittorrent and then "git pull" to update it"?

Essentially, yes. I just don't think bittorrent makes sense for
anything else, because the git protocol is so much more efficient for
tiny incremental updates...

So the only other part of my idea is that we could construct a special
set of instructions that would allow them to recreate the carefully
repacked snapshot of Linus's repository without having to download it
from a central seed site. Instead, they could download a small set of
instructions, and use that in combination with the objects already in
their repository, to create a bit-identical version of the carefully
repacked Linus repository. It's basically rip-off of jigdo, but
applied to git repositories instead of Debian .iso files.

- Ted
Nicolas Pitre
2010-09-03 19:41:26 UTC
Permalink
On Fri, 3 Sep 2010, Ted Ts'o wrote:

> On Fri, Sep 03, 2010 at 10:12:29AM -0700, Junio C Hamano wrote:
> > Theodore Tso <***@MIT.EDU> writes:
> >
> > > ... So people who are willing
> > > to participate as part of the peer2peer network can download the
> > > instructions for how to make the canonical pack once a month, and use it
> > > to create the canonical pack. If the "Gittorrent master" has spent a
> > > lot of time to carefully compute the most efficient set of delta
> > > pairings, they will get the slight benefit of a more efficient pack
> > > which they could use instead of th eir local one without having to use
> > > large values of --window and --depth to "git repack".
> >
> > Hmm, is the idea essentially to tell people "Here is a snapshot of Linus
> > repository as of a few weeks ago, carefully repacked. Instead of running
> > "git clone" yourself, please bootstrap your repository by copying it over
> > bittorrent and then "git pull" to update it"?
>
> Essentially, yes. I just don't think bittorrent makes sense for
> anything else, because the git protocol is so much more efficient for
> tiny incremental updates...
>
> So the only other part of my idea is that we could construct a special
> set of instructions that would allow them to recreate the carefully
> repacked snapshot of Linus's repository without having to download it
> from a central seed site. Instead, they could download a small set of
> instructions, and use that in combination with the objects already in
> their repository, to create a bit-identical version of the carefully
> repacked Linus repository. It's basically rip-off of jigdo, but
> applied to git repositories instead of Debian .iso files.

Small? Well...

Let's see what such instructions for how to make the canonical pack
might look like:

First you need the full ordered list of objects. That's a 20-byte SHA1
per object. The current Linux repo has 1704556 objects, therefore this
list is 33MB already.

Then you need to identify which of those objects are deltas, and against
which object. Assuming we can index in the list of objects, that means,
say, one bit to identify a delta, and 31 bits for indexing the base. In
my case this is currently 1393087 deltas, meaning 5.3 MB of additional
information.

But then, the deltas themselves can have variations in their encoding.
And we did change the heuristics for the actual delta encoding in the
past too (while remaining backward compatible), but for a canonical pack
creation we'd need to describe that in order to make things totally
reproducible.

So there are 2 choices here: Either we specify the Git version to make
sure identical delta code is used, but that will put big pressure on
that code to remain stable and not improve anymore as any behavior
change will create a compatibility issue forcing people to upgrade their
Git version all at the same time. That's not something I want to see
the world rely upon.

The other choice is to actually provide the delta output as part of the
instruction for the canonical pack creation.

In my case, the delta output represents:

$ git verifi-pack -v .git/objects/pack/*.pack | \
awk --posix '/^[0-9a-f]{40}/ && $6 { tot += 1; size += $4 } \
END { print tot, size }'
1393087 155022247

We therefore have 148 MB of purely delta data here.

So that makes for a grand total of 33 MB + 148 MB = 181 MB of data just
to be able to unambiguously reproduce a pack with a full guarantee of
perfect reproducibility.

But even with the presumption of stable delta code, the recipee would
still take 38 MB that everyone would have to download every month which
is far more than what a monthly incremental update of a kernel repo
requires. Of course you could create a delta between consecutive
recipees, but that is becoming rather awkward.

I still think that if someone really want to apply the P2P principle à
la BitTorrent to Git, then it should be based on the distributed
exchange of _objects_ as I outlined in a previous email, and not file
chunks like BitTorrent does. The canonical Git _objects_ are fully
defined, while their actual encoding may change.


Nicolas
Luke Kenneth Casson Leighton
2010-09-03 21:11:22 UTC
Permalink
On Fri, Sep 3, 2010 at 8:41 PM, Nicolas Pitre <***@fluxnic.net> wrote:

> I still think that if someone really want to apply the P2P principle =
=C3=A0
> la BitTorrent to Git, then it should be based on the distributed
> exchange of _objects_ as I outlined in a previous email, and not file
> chunks like BitTorrent does. =C2=A0The canonical Git _objects_ are fu=
lly
> defined, while their actual encoding may change.

ok - missed it. let's go back... ah _ha_ - with this:

"Yep. Instead of transferring packs, a BitTorrent-alike transfer shoul=
d
be based on the transfer of _objects_. Therefore you can make the
correspondance between file chunks in BitTorrent with objects in a Git
aware system. So, when contacting a peer, you could negociate what is
the set of objects that the peer has that you don't, and vice versa.
Objects in Git are stable and immutable, and they all have a unique SHA=
1
signature. And to optimize the negociation, the pack index content can
be used, first by exchanging the content of the first level
fan-out table and ignoring those entries that are equal. This for each
peer."

ok, so, great! it does actually seem that, despite us using different
terminologies, we're thinking along the same sort of lines. i'm
marginally hampered by being unfamiliar with git, for which i
apologise.

so, when i mentioned extracting the objects from the index file of
"git pack-object", i was debating whether to then use that to
re-create the pack object (in some nebulous way) - that's sort-of the
same thing. i was also debating whether to mention the idea of using
git pack-object to extract one and only one object ( there is likely a
more efficient way of doing that ). but, yes: i was thinking of
making the vfs-layer expose individual objects, i just hadn't
mentioned it yet (and missed your earlier reply, nicolas, for which i
apologise).

btw the idea of parsing the fan-out table would not have occurred to
me in a miiiilllion years :)

i'll take a look at that. but whilst i'm doing that, the main
question i really need to know is: how do you get one single explicit
object out of git?

tia,

l.
Nguyen Thai Ngoc Duy
2010-09-04 00:24:24 UTC
Permalink
On Sat, Sep 4, 2010 at 7:11 AM, Luke Kenneth Casson Leighton
<***@gmail.com> wrote:
> =C2=A0i'll take a look at that. =C2=A0but whilst i'm doing that, the =
main
> question i really need to know is: how do you get one single explicit
> object out of git?

git cat-file <type> <sha-1>

However if you are going to send objects, one by one, it is extremely
inefficient. I think Nico has pointed that out. Individual object
sending should only be done for large blobs.
--=20
Duy
Nguyen Thai Ngoc Duy
2010-09-04 00:57:24 UTC
Permalink
On Sat, Sep 4, 2010 at 10:24 AM, Nguyen Thai Ngoc Duy <***@gmail.co=
m> wrote:
> On Sat, Sep 4, 2010 at 7:11 AM, Luke Kenneth Casson Leighton
> <***@gmail.com> wrote:
>> =C2=A0i'll take a look at that. =C2=A0but whilst i'm doing that, the=
main
>> question i really need to know is: how do you get one single explici=
t
>> object out of git?
>
> git cat-file <type> <sha-1>
>
> However if you are going to send objects, one by one, it is extremely
> inefficient. I think Nico has pointed that out. Individual object
> sending should only be done for large blobs.

On second thought, I don't know, maybe it could work if this sort of
individual object sending is based on bup. Bup splits files into small
pieces. Common parts of files are likely shared, reducing the need to
delta. But then you would deal with a huge number of small blobs.
--=20
Duy
Artur Skawina
2010-09-04 01:52:59 UTC
Permalink
Hmm, taking a few steps back, what is the expected usage of git-p2p?
Note it's a bit of a trick question; what i'm really asking is what _else_,
other than pulling/tracking Linus' kernel tree will/can be done with it?

Because once you accept that all peers are equal, but some peers are more
equal than others, deriving a canonical representation of the object store
becomes relatively simple. Then, it's just a question of fetching the missing
bits, whether using a dumb (rsync-like) transport, or a git-aware protocol.
(I've no idea why you'd want to base a transfer protocol on the unstable packs,
building it on top of objects seems to be the only sane choice)

I'm mostly git-ignorant and i'm assuming the following two things -- if someone
more familiar w/ git internals could confirm/deny, that would be great:

1) "git pull git:..." would (or could be made to) work w/ a client that asks for
"A..E", but also tells the server to omit "B,C and D" from the wire traffic.

2) Git doesn't use chained deltas. IOW given commits "A --d1-> B --d2-> C",
"C" can be represented as a delta against "A" or "B", but _not_ against "d1".
(Think of the case where "C" reverts /part of/ "B")


Then there are security implications... Which pretty much mandate having "special"
peers anyway, at least for transferring heads (branches/tags etc). Which means
the second paragraph above applies. And as the "special peer" in practice can be
just a signed tag/commit, like "v2.6.35", it's not such a big limitation like it
may seem at first...

artur
Nicolas Pitre
2010-09-04 04:39:56 UTC
Permalink
On Sat, 4 Sep 2010, Artur Skawina wrote:

> Hmm, taking a few steps back, what is the expected usage of git-p2p?
> Note it's a bit of a trick question; what i'm really asking is what _else_,
> other than pulling/tracking Linus' kernel tree will/can be done with it?

Dunno.

> Because once you accept that all peers are equal, but some peers are more
> equal than others, deriving a canonical representation of the object store
> becomes relatively simple.

That depends what you consider a canonical representation. I don't
think the actual object store should ever be "canonicalized".

> Then, it's just a question of fetching the missing
> bits, whether using a dumb (rsync-like) transport, or a git-aware protocol.

But Git does that already.

> (I've no idea why you'd want to base a transfer protocol on the unstable packs,
> building it on top of objects seems to be the only sane choice)

There seems to be quite some confusion around objects and packs.

The Git "database" is _only_ a big pile of objects that is content
addressable i.e. each object has a name which is derived from its
content. This is the 40 hexadecimal string.

There are only 4 types of objects. Roughly they are:

1) A "blob" object contains plain data, usually used for file content.

2) A "tree" object contains a list of entries made of a file or
directory name, and the object name that corresponds to it. For
files, the referenced objects are "blobs". For directories, the
referenced objects are some other "trees". This is how the file and
directory hierarchy are represented.

3) A "commit" object contains a reference to the top tree object
corresponding to the root directory of the project, a reference to
the previous "commit" object, and a text message to describe this
commit. If this commit represents a merge, then there will be more
than one reference to previous commits. This is how the commit
history is represented.

4) And finally a "tag" object contains a reference to any other object
and a text message. Most of the time, only commit objects are
referenced that way. This is used to identify some particular
commits.

And finally, there are a few files, one for each "branch", used to
contain a reference to the latest commit object for each of those
branches.

That's it! Here you have the *whole* architecture of Git!

Now... one way to store those objects on disk is to simply deflate them
with zlib and put the result in a file, one file per object. The first
2 chars from the object name are used to create ssubdirectories under
.git/objects/ and the remaining 38 chars are used for the actual file
name within those subdirectories. This is the "loose" object format or
encoding.

Another way to store those objects is to cram them together in one or
multiple (or many) pack files. The advantage with the pack file is that
we can encode any object as a delta against any other object in the same
pack file. This is the "packed" object format or encoding.

> I'm mostly git-ignorant and i'm assuming the following two things -- if someone
> more familiar w/ git internals could confirm/deny, that would be great:
>
> 1) "git pull git:..." would (or could be made to) work w/ a client that asks for
> "A..E", but also tells the server to omit "B,C and D" from the wire traffic.

What Git does when transferring data on the wire is actually to create a
special pack file that contains _only_ those objects that the sender has
but that the receiver doesn't, and stream that over the net. So if the
client tells the server that it already has commit A, then the server
will create a pack that contains only those objects that were created
after commit A, and omit all the objects that can be reached through
commit A that are also used by later commits (think unchanged files).
If you also have commits B, C and D, then the server will also exclude
all the objects that are reachable through those commits from that
special pack.

On the receiving end, Git simply writes the received pack into a file
along with the other existing packs, and compute a pack index for it.

> 2) Git doesn't use chained deltas. IOW given commits "A --d1-> B --d2-> C",
> "C" can be represented as a delta against "A" or "B", but _not_ against "d1".
> (Think of the case where "C" reverts /part of/ "B")

Git does use chained deltas indeed. But deltas are used only at the
object level within a pack file. Any blob object can be represented as
a delta against any other blob in the pack, regardless of the commit(s)
those blob objects belong to. Same thing for tree objects. So you can
have deltas going in total random directions if you look them from a
commit perspective. So "C" can have some of its objects being deltas
against objects from "B", or "A", or any other commit for that matter,
or even objects belonging to the same commit "C". And some other objects
from "B" can delta against objects from "C" too. There is simply no
restrictions at all on the actual delta direction. The only rule is
that an object may only delta against another object of the same type.

Of course we don't try to delta each object against all the other
available objects as that would be a O(n^2) operation (imagine with n =
1.7 million objects). So we use many heuristics to make this delta
packing efficient without taking an infinite amount of time.

For example, if we have objects X and Y that need to be packed together
and sent to a client over the net, and we find that Y is already a delta
against X in one pack that exists locally, then we simply and literally
copy the delta representation of Y from that local pack file and send it
out without recomputing that delta.

> Then there are security implications... Which pretty much mandate having "special"
> peers anyway, at least for transferring heads (branches/tags etc). Which means
> the second paragraph above applies.

Well... Actually, all you need is only one trusted peer to provide those
heads i.e. the top commit SHA1 name for each branches you need. From
that one SHA1 name per branch, you can validate the entire repository as
every object reference throughout is based on the content of the object
it refers to. For example, to validate the authenticity of everything
from a random copy of the Linux kernel repository, I need only 20 bytes
from a trusted source. No need to have this information distributed
amongst multiple peers.

And even if the delta encoding is different from the one used in Linus'
repository, or even if the packing is done differently (different number
of packs, etc.) then the final SHA1 will always be the same. This is
because the actual content from all referenced objects is the same
regardless of their effective encoding or format.


Nicolas
Artur Skawina
2010-09-04 05:42:12 UTC
Permalink
>> 2) Git doesn't use chained deltas. IOW given commits "A --d1-> B --d2-> C",
>> "C" can be represented as a delta against "A" or "B", but _not_ against "d1".
>> (Think of the case where "C" reverts /part of/ "B")
>
> Git does use chained deltas indeed. But deltas are used only at the
> object level within a pack file. Any blob object can be represented as
> a delta against any other blob in the pack, regardless of the commit(s)
> those blob objects belong to. Same thing for tree objects. So you can
> have deltas going in total random directions if you look them from a
> commit perspective. So "C" can have some of its objects being deltas
> against objects from "B", or "A", or any other commit for that matter,
> or even objects belonging to the same commit "C". And some other objects
> from "B" can delta against objects from "C" too. There is simply no
> restrictions at all on the actual delta direction. The only rule is
> that an object may only delta against another object of the same type.
>
> Of course we don't try to delta each object against all the other
> available objects as that would be a O(n^2) operation (imagine with n =
> 1.7 million objects). So we use many heuristics to make this delta
> packing efficient without taking an infinite amount of time.
>
> For example, if we have objects X and Y that need to be packed together
> and sent to a client over the net, and we find that Y is already a delta
> against X in one pack that exists locally, then we simply and literally
> copy the delta representation of Y from that local pack file and send it
> out without recomputing that delta.

What i meant by 'chained deltas' is a representation that takes delta#1 and
applies delta#2 to the first delta, and applies the result to the source of
delta#1. Which could be a more compact representation of eg. a partial revert.

IOW, if I have commits A..Y, ask (via git pull) for commits X and Z, then I'm
guaranteed to receive them either raw, or as a delta vs commits A..X, right?
What I'm really asking is, if a (modified) git-upload-pack skips transferring
commit X, and just sends me commit Z (possibly as delta vs 'X'), _and_ I
obtain commit 'X" in some other way, I will be able to reconstruct 'Z', correct?

TIA,

artur
Nicolas Pitre
2010-09-04 06:13:23 UTC
Permalink
On Sat, 4 Sep 2010, Artur Skawina wrote:

> >> 2) Git doesn't use chained deltas. IOW given commits "A --d1-> B --d2-> C",
> >> "C" can be represented as a delta against "A" or "B", but _not_ against "d1".
> >> (Think of the case where "C" reverts /part of/ "B")
> >
> > Git does use chained deltas indeed. But deltas are used only at the
> > object level within a pack file. Any blob object can be represented as
> > a delta against any other blob in the pack, regardless of the commit(s)
> > those blob objects belong to. Same thing for tree objects. So you can
> > have deltas going in total random directions if you look them from a
> > commit perspective. So "C" can have some of its objects being deltas
> > against objects from "B", or "A", or any other commit for that matter,
> > or even objects belonging to the same commit "C". And some other objects
> > from "B" can delta against objects from "C" too. There is simply no
> > restrictions at all on the actual delta direction. The only rule is
> > that an object may only delta against another object of the same type.
> >
> > Of course we don't try to delta each object against all the other
> > available objects as that would be a O(n^2) operation (imagine with n =
> > 1.7 million objects). So we use many heuristics to make this delta
> > packing efficient without taking an infinite amount of time.
> >
> > For example, if we have objects X and Y that need to be packed together
> > and sent to a client over the net, and we find that Y is already a delta
> > against X in one pack that exists locally, then we simply and literally
> > copy the delta representation of Y from that local pack file and send it
> > out without recomputing that delta.
>
> What i meant by 'chained deltas' is a representation that takes delta#1 and
> applies delta#2 to the first delta, and applies the result to the source of
> delta#1. Which could be a more compact representation of eg. a partial revert.

You can have deltas on top of deltas. And they can be in any direction
i.e. object #1 can be a delta that only takes half of a bigger object
#2, or object #2 can be a delta copying a smaller object #1 and adding
more data to it. In the end both representation will take more or less
the same amount of space.

> IOW, if I have commits A..Y, ask (via git pull) for commits X and Z, then I'm
> guaranteed to receive them either raw, or as a delta vs commits A..X, right?

In such case you will receive only those new objects that commits X and
Z introduced, and those new objects may indeed be encoded either as
whole objects, or as deltas against either objects that you already
have, or even as deltas against objects that are part of the transfer.

> What I'm really asking is, if a (modified) git-upload-pack skips transferring
> commit X, and just sends me commit Z (possibly as delta vs 'X'), _and_ I
> obtain commit 'X" in some other way, I will be able to reconstruct 'Z', correct?

Yes. Although it is 'git pack-objects' that decides what objects to
send, not 'git-upload-pack'.


Nicolas
Luke Kenneth Casson Leighton
2010-09-04 11:58:31 UTC
Permalink
ok, there's one other question i need some info on (thank you to
nguyen for answering about git cat-file): is there a way to make
git-pack-objects _just_ give the index file, rather than go to all the
trouble of creating the pack file as well?

the reason i ask is because i would like to get the index file, parse
it for objects (as nicolas recommends) then present the contents of
"git cat-file" as files within subdirectories, via this
pseudo-VFS-layer over bittorrent.

tia,

l.
Luke Kenneth Casson Leighton
2010-09-04 13:14:36 UTC
Permalink
On Sat, Sep 4, 2010 at 12:58 PM, Luke Kenneth Casson Leighton
<***@gmail.com> wrote:
> ok, there's one other question i need some info on (thank you to
> nguyen for answering about git cat-file): is there a way to make
> git-pack-objects _just_ give the index file, rather than go to all the
> trouble of creating the pack file as well?
>
> the reason i ask is because i would like to get the index file, parse
> it for objects (as nicolas recommends) then present the contents of
> "git cat-file" as files within subdirectories, via this
> pseudo-VFS-layer over bittorrent.

pack-objects.c - write_pack_file():
...
if (pack_to_stdout) {
sha1close(f, sha1, CSUM_CLOSE);
} else if (nr_written == nr_remaining) {
sha1close(f, sha1, CSUM_FSYNC);
} else {
int fd = sha1close(f, sha1, 0);
fixup_pack_header_footer(fd, sha1, pack_tmp_name,
nr_written, sha1, offset);
close(fd);
}

if (!pack_to_stdout) {
struct stat st;
const char *idx_tmp_name;
char tmpname[PATH_MAX];

idx_tmp_name = write_idx_file(NULL, written_list,
nr_written, sha1);

....

ahh, whoops. any advances on that?

grep write_idx_file *.c */*.c

* git-index-pack requires a pack file in order to re-create the index:
i don't want that
* git-pack-objects appears to have no way of telling it "just gimme
index file please"
* fast-import.c appears not to be what's needed either.

so - any other methods for just getting the index file (exclusively?)
any other commands i've missed? if not, are there any other ways of
getting a pack's index of objects without err... getting the index
file? (i believe the answer to be no, but i'm just making sure) and
on that basis i believe it is safe to ask: any objections to a patch
which adds "--index-only" to builtin/pack-objects.c?

l.
Nicolas Pitre
2010-09-05 02:16:02 UTC
Permalink
On Sat, 4 Sep 2010, Luke Kenneth Casson Leighton wrote:

> * git-index-pack requires a pack file in order to re-create the index:
> i don't want that
> * git-pack-objects appears to have no way of telling it "just gimme
> index file please"
> * fast-import.c appears not to be what's needed either.
>
> so - any other methods for just getting the index file (exclusively?)
> any other commands i've missed? if not, are there any other ways of
> getting a pack's index of objects without err... getting the index
> file? (i believe the answer to be no, but i'm just making sure) and
> on that basis i believe it is safe to ask: any objections to a patch
> which adds "--index-only" to builtin/pack-objects.c?

No patch is needed.

First, what you want is an index of objects you are willing to share,
and not the index of whatever pack file you might have on your disk,
especially if you have multiple packs which is typical.

Try this instead:

git rev-list --objects HEAD | cut -c -40 | sort

That will give you a sorted list of all objects reachable from the
current branch. With the Linux repo, you may replace "HEAD" with
"v2.6.34..v2.6.35" if you wish, and that would give you the list of the
new objects that were introduced between v2.6.34 and v2.6.35. This will
provide you with 84642 objects instead of the 1.7 million objects that
the Linux repo contains (easier when testing stuff).

That sorted list of objects is more or less what the pack index file
contains, plus an offset in the pack for each entry. It is used to
quickly find the offset for a given object in the corresponding pack
file, and the fanout is only a way to cut 3 iterations in the binary
search.

But anyway, what you want is really to select the precise set of objects
you wish to share, and not blindly using the pack index file. If you
have a public branch and a private branch in your repository, then
objects from both branches may end up in the same pack and you probably
don't want to publish those objects from the private branch. The only
reliable way to generate a list of object is to use the output from 'git
rev-list'. Those objects may come from one or multiple packs, or be
loose in the object subdirectories, or even borrowed from another
repository through the alternates mechanism. But rev-list will dig
those object SHA1s for you and only those you asked for.

You should look at the Git documentation for plumbing commands. The
plumbing is actually a toolset that allows you to manipulate and extract
information from a Git repository. This is really handy for prototyping
new functionalities. Initially, the Git user interface was all
implemented in shell scripts on top of that plumbing.

Back to that rev-list output... OK, you want the equivalent of a fanout
table. You may do something like this then:

git rev-list --objects v2.6.34..v2.6.35 | cut -c -2 | sort | uniq -c

And so on.


Nicolas
Luke Kenneth Casson Leighton
2010-09-05 18:05:57 UTC
Permalink
On Sun, Sep 5, 2010 at 3:16 AM, Nicolas Pitre <***@fluxnic.net> wrote:
> On Sat, 4 Sep 2010, Luke Kenneth Casson Leighton wrote:
>
>> * git-index-pack requires a pack file in order to re-create the inde=
x:
>> i don't want that
>> * git-pack-objects appears to have no way of telling it "just gimme
>> index file please"
>> * fast-import.c appears not to be what's needed either.
>>
>> so - any other methods for just getting the index file (exclusively?=
)
>> any other commands i've missed? =C2=A0if not, are there any other wa=
ys of
>> getting a pack's index of objects without err... getting the index
>> file? =C2=A0(i believe the answer to be no, but i'm just making sure=
) and
>> on that basis i believe it is safe to ask: any objections to a patch
>> which adds "--index-only" to builtin/pack-objects.c?
>
> No patch is needed.
>
> First, what you want is an index of objects you are willing to share,
> and not the index of whatever pack file you might have on your disk,
> especially if you have multiple packs which is typical.

blast. so *sigh* ignoring the benefits that can be obtained by the
delta-compression thing, somewhat; ignoring the fact that perhaps less
traffic miight be transferred by happening to borrow objects from
another branch (which is the situation that, i believe, happens with
"git pull" over http:// or git://); ignoring the fact that i actually
implemented using the .idx file yesterday ... :)

... there is a bit of a disadvantage to using pack index files that
it goes all the way down (if i am reading things correctly) and cannot
be told "give me just the objects related to a particular commit"....


> Try this instead:
>
> =C2=A0 =C2=A0git rev-list --objects HEAD | cut -c -40 | sort
>
> That will give you a sorted list of all objects reachable from the
> current branch. =C2=A0With the Linux repo, you may replace "HEAD" wit=
h
> "v2.6.34..v2.6.35" if you wish, and that would give you the list of t=
he
> new objects that were introduced between v2.6.34 and v2.6.35.

... unlike this, which is in fact much more along the lines of what i
was looking for (minus the loveliness of the delta compression oh
well)

>=C2=A0This will
> provide you with 84642 objects instead of the 1.7 million objects tha=
t
> the Linux repo contains (easier when testing stuff).

hurrah! :) [but, then if you actually want to go back and get alll
commits, that's ... well, we'll not worry about that too much, given
the benefits of being able to get smaller chunks.]

> That sorted list of objects is more or less what the pack index file
> contains, plus an offset in the pack for each entry. =C2=A0It is used=
to
> quickly find the offset for a given object in the corresponding pack
> file, and the fanout is only a way to cut 3 iterations in the binary
> search.
>
> But anyway, what you want is really to select the precise set of obje=
cts
> you wish to share, and not blindly using the pack index file. =C2=A0I=
f you
> have a public branch and a private branch in your repository, then
> objects from both branches may end up in the same pack

slightly confused: are you of the belief that i intend to ignore
refs/branches/* starting points?

> and you probably
> don't want to publish those objects from the private branch.

ahh, i wondered where i'd seen the bit about "confusing" two
branches, i thought it was in another message. so many flying back &
forth :) from what i can gather, this is exactly what happens with
git fetch from http:// or git:// so what's the big deal about that?
why stop gitp2p from benefitting from the extra compression that could
result from "borrowing" bits of another branch's objects, neh?

or .. have i misunderstood?

> The only
> reliable way to generate a list of object is to use the output from '=
git
> rev-list'. =C2=A0Those objects may come from one or multiple packs, o=
r be
> loose in the object subdirectories, or even borrowed from another
> repository through the alternates mechanism. =C2=A0But rev-list will =
dig
> those object SHA1s for you and only those you asked for.

excellent. that's proobably what i need right now.

> You should look at the Git documentation for plumbing commands. =C2=A0=
The
> plumbing is actually a toolset that allows you to manipulate and extr=
act
> information from a Git repository. =C2=A0This is really handy for pro=
totyping
> new functionalities. Initially, the Git user interface was all
> implemented in shell scripts on top of that plumbing.

i'm using gitdb (ok don't need that any more, if i don't walk the
pack-index file *sigh*) and python-git - am quite happy with the speed
at which i can knock stuff together, using it. the only tricky wobbly
moment i had was not being able to pass in a file-handle to stdin (git
pack-objects) and i got round that with "input =3D os.tmpfile();
input.write(objref+"\n"); input.seek(0)".

> Back to that rev-list output... OK, you want the equivalent of a fano=
ut
> table. =C2=A0You may do something like this then:
>
> =C2=A0 =C2=A0git rev-list --objects v2.6.34..v2.6.35 | cut -c -2 | so=
rt | uniq -c

ack. got it.

thanks nicolas.

l.
Nicolas Pitre
2010-09-05 23:52:23 UTC
Permalink
On Sun, 5 Sep 2010, Luke Kenneth Casson Leighton wrote:

> On Sun, Sep 5, 2010 at 3:16 AM, Nicolas Pitre <***@fluxnic.net> wrote:
> > On Sat, 4 Sep 2010, Luke Kenneth Casson Leighton wrote:
> >
> >> * git-index-pack requires a pack file in order to re-create the index:
> >> i don't want that
> >> * git-pack-objects appears to have no way of telling it "just gimme
> >> index file please"
> >> * fast-import.c appears not to be what's needed either.
> >>
> >> so - any other methods for just getting the index file (exclusively?)
> >> any other commands i've missed?  if not, are there any other ways of
> >> getting a pack's index of objects without err... getting the index
> >> file?  (i believe the answer to be no, but i'm just making sure) and
> >> on that basis i believe it is safe to ask: any objections to a patch
> >> which adds "--index-only" to builtin/pack-objects.c?
> >
> > No patch is needed.
> >
> > First, what you want is an index of objects you are willing to share,
> > and not the index of whatever pack file you might have on your disk,
> > especially if you have multiple packs which is typical.
>
> blast. so *sigh* ignoring the benefits that can be obtained by the
> delta-compression thing, somewhat; ignoring the fact that perhaps less
> traffic miight be transferred by happening to borrow objects from
> another branch (which is the situation that, i believe, happens with
> "git pull" over http:// or git://); ignoring the fact that i actually
> implemented using the .idx file yesterday ... :)

Please, let's get it slow.

There are 2 concepts you really need to master in order to come up with
a solution. And those concepts are completely independent from
each other, but at the moment you are blending them up together and
that's not good.

The first one is all about object enumeration. And object enumeration
is all about 'git rev-list'. This is important when offering objects to
the outside world that you actually do offer _all_ the needed objects,
but _only_ the needed objects. If some objects are missing you get a
broken repository. But more objects can also be a security problem as
those extra objects may contain confidential data that you never
intended to publish.

And object enumeration has absolutely nothing to do with packs, nor .idx
files for that matter. As I said, the objects you want might be split
across multiple packs, and also in loose form, and also in some
alternate location that is shared amongst many repositories on the same
filesystem. But a single pack may also contain more than what you want
to offer, and it is extremely important that you do _not_ offer those
objects that are not reachable from the branch you want to publish.

Following me so far?

The second concept is all about object _representation_ or _encoding_.
That's where the deltas come into play. So the idea is to grab the list
of objects you want to publish, and then look into existing packs to see
if you could find them in delta form. So, for each object, if you do
find them in delta form, and the objec the delta is made against is 1)
also part of the list of objects you want to send, or 2) is already
available at the remote end, then you may simply reuse that delta data
as is from the pack. Finding if a particular pack has the wanted object
is easy: you just need to look it up in the .idx file. Then, in the
corresponding pack file you parse the object header to find out if it is
a delta, and what its base object is.

> ... there is a bit of a disadvantage to using pack index files that
> it goes all the way down (if i am reading things correctly) and cannot
> be told "give me just the objects related to a particular commit"....

Exact. The .idx file gives you a list of objects that exists in the
corresponding pack. That list of object might belong to a totally
random number of random commits. You may also have a random number of
packs across which some or all objects are distributed. Because, of
course, not all the objects you need are always packed.

So... I hope you understand now that there is no relation between
commits and .idx files. The only exception is when you do create a
custom pack with 'git pack-objects'.

> > Try this instead:
> >
> >    git rev-list --objects HEAD | cut -c -40 | sort
> >
> > That will give you a sorted list of all objects reachable from the
> > current branch.  With the Linux repo, you may replace "HEAD" with
> > "v2.6.34..v2.6.35" if you wish, and that would give you the list of the
> > new objects that were introduced between v2.6.34 and v2.6.35.
>
> ... unlike this, which is in fact much more along the lines of what i
> was looking for (minus the loveliness of the delta compression oh
> well)

Again, delta compression is a _separate_ issue.

> > This will
> > provide you with 84642 objects instead of the 1.7 million objects that
> > the Linux repo contains (easier when testing stuff).
>
> hurrah! :) [but, then if you actually want to go back and get alll
> commits, that's ... well, we'll not worry about that too much, given
> the benefits of being able to get smaller chunks.]

If you want all commits then you just need --all instead of HEAD.

> > That sorted list of objects is more or less what the pack index file
> > contains, plus an offset in the pack for each entry.  It is used to
> > quickly find the offset for a given object in the corresponding pack
> > file, and the fanout is only a way to cut 3 iterations in the binary
> > search.
> >
> > But anyway, what you want is really to select the precise set of objects
> > you wish to share, and not blindly using the pack index file.  If you
> > have a public branch and a private branch in your repository, then
> > objects from both branches may end up in the same pack
>
> slightly confused: are you of the belief that i intend to ignore
> refs/branches/* starting points?

I don't know what your exact understanding of Git is, and although I
know one or two things about the Git storage model, I get confused
myself by some of your comments, such as this one above.

> > and you probably
> > don't want to publish those objects from the private branch.
>
> ahh, i wondered where i'd seen the bit about "confusing" two
> branches, i thought it was in another message. so many flying back &
> forth :) from what i can gather, this is exactly what happens with
> git fetch from http:// or git:// so what's the big deal about that?
> why stop gitp2p from benefitting from the extra compression that could
> result from "borrowing" bits of another branch's objects, neh?

No. git:// will _never_ ever transfer any object that is not part of
the published branch(es). If an object that does get transmitted is
actually a delta against an object that is only part of a branch that is
not published, then the delta will be expanded and redone against
another suitable object before transmission.


Nicolas
Luke Kenneth Casson Leighton
2010-09-06 13:23:48 UTC
Permalink
On Mon, Sep 6, 2010 at 12:52 AM, Nicolas Pitre <***@fluxnic.net> wrote=
:

>> another branch (which is the situation that, i believe, happens with
>> "git pull" over http:// or git://); ignoring the fact that i actuall=
y
>> implemented using the .idx file yesterday ... :)
>
> Please, let's get it slow.

ack :)

> There are 2 concepts you really need to master in order to come up wi=
th
> a solution. =C2=A0And those concepts are completely independent from
> each other, but at the moment you are blending them up together and
> that's not good.

i kinda get it - but i realise that's not good enough: i need to be
able to _say_ i get it, in a way that satisfies you.

> The first one is all about object enumeration. =C2=A0And object enume=
ration
> is all about 'git rev-list'. =C2=A0This is important when offering ob=
jects to
> the outside world that you actually do offer _all_ the needed objects=
,
> but _only_ the needed objects. =C2=A0If some objects are missing you =
get a
> broken repository. =C2=A0But more objects can also be a security prob=
lem as
> those extra objects may contain confidential data that you never
> intended to publish.

ack.

> And object enumeration has absolutely nothing to do with packs, nor .=
idx
> files for that matter.

mmm packs not being to do with object enumeration i get. i
understand that .idx files contain "lists of objects" which isn't the
same thing (and also happen to contain pointers/offsets to the objects
of its associated .pack)

at some point i'd really like to know what the object list is (not
the objects themselves) that comes out of "git pack-objects --thin"
but my curiosity can wait.

>=C2=A0As I said, the objects you want might be split
> across multiple packs, and also in loose form, and also in some
> alternate location that is shared amongst many repositories on the sa=
me
> filesystem.

ok - this tells me (and it's confirmed, below) that you're describing
the situation based on what can be found in .git - _not_ what comes
out of "git pack-objects". i wouldn't _dream_ of digging around in a
=2Egit/ location looking for packs or idx files, but because i have
mentioned them _without_ prefixing every mention with "the
custom-generated .idx and/or .pack as generated by git pack-objects",
you may have got the wrong impression, for which i apologise.

> =C2=A0But a single pack may also contain more than what you want
> to offer, and it is extremely important that you do _not_ offer those
> objects that are not reachable from the branch you want to publish.
>
> Following me so far?

yep :)

> The second concept is all about object _representation_ or _encoding_=
=2E
> That's where the deltas come into play. =C2=A0So the idea is to grab =
the list
> of objects you want to publish, and then look into existing packs to =
see
> if you could find them in delta form. =C2=A0So, for each object, if y=
ou do
> find them in delta form, and the objec the delta is made against is 1=
)
> also part of the list of objects you want to send, or 2) is already
> available at the remote end, then you may simply reuse that delta dat=
a
> as is from the pack. =C2=A0Finding if a particular pack has the wante=
d object
> is easy: you just need to look it up in the .idx file. =C2=A0Then, in=
the
> corresponding pack file you parse the object header to find out if it=
is
> a delta, and what its base object is.

ok. all of this makes sense - but it's enough for me to be able to
ask questions, rather than "do", if you know what i mean.

>> =C2=A0... there is a bit of a disadvantage to using pack index files=
that
>> it goes all the way down (if i am reading things correctly) and cann=
ot
>> be told "give me just the objects related to a particular commit"...=
=2E
>
> Exact. =C2=A0The .idx file gives you a list of objects that exists in=
the
> corresponding pack. =C2=A0That list of object might belong to a total=
ly
> random number of random commits. =C2=A0You may also have a random num=
ber of
> packs across which some or all objects are distributed. =C2=A0Because=
, of
> course, not all the objects you need are always packed.
>
> So... I hope you understand now that there is no relation between
> commits and .idx files. =C2=A0The only exception is when you do creat=
e a
> custom pack with 'git pack-objects'.

yes. ahh... that's what i've been doing: using "git pack-objects
--thin". and the reason for that is because i've seen it used in the
http implementation of "git fetch".

so, my questions up until now regarding .pack and .idx have all been
targetted at that, and based on that context, _not_ the packs+idx
files that are in .git/

>> > Try this instead:
>> >
>> > =C2=A0 =C2=A0git rev-list --objects HEAD | cut -c -40 | sort
>> >
>> > That will give you a sorted list of all objects reachable from the
>> > current branch. =C2=A0With the Linux repo, you may replace "HEAD" =
with
>> > "v2.6.34..v2.6.35" if you wish, and that would give you the list o=
f the
>> > new objects that were introduced between v2.6.34 and v2.6.35.
>>
>> =C2=A0... unlike this, which is in fact much more along the lines of=
what i
>> was looking for (minus the loveliness of the delta compression oh
>> well)
>
> Again, delta compression is a _separate_ issue.
>
>> >=C2=A0This will
>> > provide you with 84642 objects instead of the 1.7 million objects =
that
>> > the Linux repo contains (easier when testing stuff).
>>
>> =C2=A0hurrah! :) =C2=A0[but, then if you actually want to go back an=
d get alll
>> commits, that's ... well, we'll not worry about that too much, given
>> the benefits of being able to get smaller chunks.]
>
> If you want all commits then you just need --all instead of HEAD.

no, i want commits separated and individual and "compoundable". the p=
lan is:

* to get the ref associated with refs/heads/master
* to get the list of all commits associated with that master ref
* to work out how far local deviates from remote along that list of com=
mits
* to get the objects which will make up the missing commits (if they
aren't already in the local store)
* to apply those commits in the correct order

in other words, the plan is to follow what git http fetch and/org git
git:// fetch does as much as possible (ok, perhaps not).

the reason for getting the objects individually (blobs etc.) should be
clear: prior commits _could_ have resulted in that exact object having
been obtained already.

so far i have implemented:

* get the master ref using git for-each-ref
* get the list of all commits using git rev-list
* enumerate the list of objects associated with an individual commit by=
:
i) creating a CUSTOM pack+idx using git pack-objects {ref}
ii) *parsing* the idx file using gitdb's FileIndex to get the list
of objects
iii) transferring that list to the local machine
* requesting *individual* objects from the enumerated list out of the i=
dx file
by using a CUSTOM "git pack-objects --thin {ref} < {ref}" command

that's as far as i've got, before you mentioned that it would be
better to use "git rev-list --objects commit1..commit2" and to use
"git cat-file" to obtain the actual object [what's not clear in this
plan is how to store that cat'ed file at the local end, hence the
continued use of git pack-objects --thin {ref} < {ref}]

the prior implementation was to treat the custom pack-object as if it
was "the atomic leaf-node operation" instead of individual objects
(blobs, trees).

>> > That sorted list of objects is more or less what the pack index fi=
le
>> > contains, plus an offset in the pack for each entry. =C2=A0It is u=
sed to
>> > quickly find the offset for a given object in the corresponding pa=
ck
>> > file, and the fanout is only a way to cut 3 iterations in the bina=
ry
>> > search.
>> >
>> > But anyway, what you want is really to select the precise set of o=
bjects
>> > you wish to share, and not blindly using the pack index file. =C2=A0=
If you
>> > have a public branch and a private branch in your repository, then
>> > objects from both branches may end up in the same pack
>>
>> =C2=A0slightly confused: are you of the belief that i intend to igno=
re
>> refs/branches/* starting points?
>
> I don't know what your exact understanding of Git is, and although I
> know one or two things about the Git storage model, I get confused
> myself by some of your comments, such as this one above.

soorree. i believe the source of the confusion is that you believed
that i intend to "blindly use a pack index file" as in "blindly go
rummaging around in .git/ at the remote end" when i have absolutely no
intention of doing so.

what i _have_ been doing however is custom-generating pack-objects
and associated pack-indexes (just like git http fetch) _including_
using the --thin option because that's what git http fetch does.

i believe that this results in the concerns that you raised (about
having access to unauthorised data) being dealt with.

>> > don't want to publish those objects from the private branch.
>>
>> =C2=A0ahh, i wondered where i'd seen the bit about "confusing" two
>> branches, i thought it was in another message. =C2=A0so many flying =
back &
>> forth :) =C2=A0from what i can gather, this is exactly what happens =
with
>> git fetch from http:// or git:// so what's the big deal about that?
>> why stop gitp2p from benefitting from the extra compression that cou=
ld
>> result from "borrowing" bits of another branch's objects, neh?
>
> No. =C2=A0git:// will _never_ ever transfer any object that is not pa=
rt of
> the published branch(es).

... because it uses, from what i can gather, git pack-objects --thin

>=C2=A0If an object that does get transmitted is
> actually a delta against an object that is only part of a branch that=
is
> not published, then the delta will be expanded and redone against
> another suitable object before transmission.

and that's handled by git pack-objects --thin (am i right?)

ok.

so. we have a hierarchical plan: get the commit list, get a
per-commit object-list, get the objects (if needed), store the
objects.

problem: despite looking through virtually every single builtin/*.c
file which uses write_sha1_file (which i believe i have correctly
identified, from examining git unpack-objects, as being the function
which stores actual objects, including their type), i do not see a git
command (yet) which performs the reverse operation of "git cat-file".

builtin/apply.c - that's for patches
builtin/checkout.c - that's for the merge result.
builtin/notes.c - creating a note
builtin/tag.c - creating a tag
builtin/mktree.c - creating a tree object but *only* from a text listin=
g

ok - maybe this is one of the ones that i need, but only if i use "git
cat-file -p" to pretty-print the output of tree objects but i don't
think that's a good idea.

what else...

cache_tree.c - nope.
commit.c - nope.
read-cache.c - beh? nope. has blank args "", 0

so... um... unless i actually manually create a pack object (perhaps
using python-gitdb to construct it) out of the data obtained by "git
cat-file" i don't see how this would work.

l.
Nicolas Pitre
2010-09-06 16:51:47 UTC
Permalink
On Mon, 6 Sep 2010, Luke Kenneth Casson Leighton wrote:

> On Mon, Sep 6, 2010 at 12:52 AM, Nicolas Pitre <***@fluxnic.net> wrote:
>
> > And object enumeration has absolutely nothing to do with packs, nor .idx
> > files for that matter.
>
> mmm packs not being to do with object enumeration i get. i
> understand that .idx files contain "lists of objects" which isn't the
> same thing (and also happen to contain pointers/offsets to the objects
> of its associated .pack)
>
> at some point i'd really like to know what the object list is (not
> the objects themselves) that comes out of "git pack-objects --thin"

You need to feed 'git pack-objects' a list of objects in the first
place for it to pack anything. So you must have that list even before
pack-objects can produce any output. And that list is usually generated
by 'git rev-list'. So, typically, you'd do:

git rev-list --objects <commit_range> | git pack-objects foo

But these days the ability to enumerate objects was integrated into
pack-objects directly, so you can do:

echo "<commit_range>" | git pack-objects --revs foo

But you should get the idea.

> > So... I hope you understand now that there is no relation between
> > commits and .idx files.  The only exception is when you do create a
> > custom pack with 'git pack-objects'.
>
> yes. ahh... that's what i've been doing: using "git pack-objects
> --thin". and the reason for that is because i've seen it used in the
> http implementation of "git fetch".

Well, the HTTP implementation is a rather tricky example as there are
actually two implementations: one that is dumb and only slurps packs and
loose objects out of a remote .git/ directory, and another that is smart
enough to carry the smarter Git protocol across HTTP requests.

When using the "smart" Git protocol, the client tells the server what it
already has, and then the server uses pack-objects to produce a pack
with only those objects that the client doesn't have, and stream that
pack directly without even storing it on disk. The server doesn't even
produce a .idx file in that case. It is up to the client to store the
pack on disk, and feed it through 'git index-pack' to construct a .idx
file for it locally.

> so, my questions up until now regarding .pack and .idx have all been
> targetted at that, and based on that context, _not_ the packs+idx
> files that are in .git/

Tell me if the above clears them up.

> > If you want all commits then you just need --all instead of HEAD.
>
> no, i want commits separated and individual and "compoundable". the plan is:
>
> * to get the ref associated with refs/heads/master

You can do:

git rev-parse refs/heads/master

> * to get the list of all commits associated with that master ref

Just use (without the --objects argument):

git rev-list refs/heads/master

> * to work out how far local deviates from remote along that list of commits

That's an operation that only the peer with the most recent commits can
do, unless you transfer that huge list of commits from above across the
network. So, on a server (i.e. the peer sending objects) you'd do:

git rev-list <refs_that_I_publish> --not <refs_that_the_remote_has>

> * to get the objects which will make up the missing commits (if they
> aren't already in the local store)

Again, that's a task for the peer with objects to offer. It just has to
use the above rev-list invocation and add the --objects argument to it
(or feed the equivalent ref specifications to pack-objects directly as
shown previously).

> * to apply those commits in the correct order

Why would you care about this? There is nothing to "apply" as all you
have to do is simply transfer objects.

> in other words, the plan is to follow what git http fetch and/org git
> git:// fetch does as much as possible (ok, perhaps not).

Well... I don't think it would be easy to do the same in a P2P context.
Those fetch operations are totally stream oriented between 2 peers, and
not many to many.

> the reason for getting the objects individually (blobs etc.) should be
> clear: prior commits _could_ have resulted in that exact object having
> been obtained already.

Sure. But objects known to exist on the remote side won't be listed by
rev-list.

> so far i have implemented:
>
> * get the master ref using git for-each-ref
> * get the list of all commits using git rev-list

So far so good.

> * enumerate the list of objects associated with an individual commit by:
> i) creating a CUSTOM pack+idx using git pack-objects {ref}
> ii) *parsing* the idx file using gitdb's FileIndex to get the list
> of objects

That's where you're going so much out of your way to give you trouble.
A simple rev-list would give you that list:

git rev-list --objects <this_commit> --not <this_commit''s_parents>

That's it.

> iii) transferring that list to the local machine
> * requesting *individual* objects from the enumerated list out of the idx file
> by using a CUSTOM "git pack-objects --thin {ref} < {ref}" command

That's where you'll have to get your hands real dirty and write actual
code to serve individual objects but not through cat-file. In a P2P
setup you'd want to transfer as little amount of data as possible,
meaning that you'd want to serve deltas as much as possible. It's then
a matter of finding if the requested object exists already in delta
form, if so then whether or not its base object is something that the
other end has in which case you send that as is, otherwise figuring out
if that would be worth creating a delta against another object known to
exist at the other end.

On the receiving end, you'd simply have to store those objects in
the .git/objects/ directories as loose objects after expanding the
deltas, or even stuff everything
into a pack and run 'git index-pack' on it when the transfer is
complete (and run 'git repack' to optimize the pack eventually).

Once the transfer is complete, you do a reachability and validity check
on the refs you are supposed to have received the objects for, and if
everything is OK then you update the refs and you're done.

> that's as far as i've got, before you mentioned that it would be
> better to use "git rev-list --objects commit1..commit2" and to use
> "git cat-file" to obtain the actual object [what's not clear in this
> plan is how to store that cat'ed file at the local end, hence the
> continued use of git pack-objects --thin {ref} < {ref}]
>
> the prior implementation was to treat the custom pack-object as if it
> was "the atomic leaf-node operation" instead of individual objects
> (blobs, trees).

Well, OK. But suppose that you have only 2 new commits with a big
amount of objects for each. Typically the very first commit of a
project corresponds to the import of that project into Git, and it is
equivalent to the whole work tree. Don't you want to spread the request
for those objects across as many peers as possible?

> what i _have_ been doing however is custom-generating pack-objects
> and associated pack-indexes (just like git http fetch) _including_
> using the --thin option because that's what git http fetch does.

Well, let's get back to that HTTP fetch which has a double personality.
The "smart" HTTP fetch doesn't involve any pack index at all. It ends
up streaming pack-objects stdout's output over the net and the pack
index is recreated on the other end.

However, what the _dumb_ HTTP fetch does (and that is the same idea for
the FTP fetch, or the rsync fetch) is to dig into the remote's .git
directory and grab those .idx files, look into them to see if the
corresponding .pack file actually contain the wanted objects, so to only
downloads the needed packs afterwards. And those dumb protocols are
what they are: dumb. They usually end up transferring way more data
than actually necessary.

> > If an object that does get transmitted is
> > actually a delta against an object that is only part of a branch that is
> > not published, then the delta will be expanded and redone against
> > another suitable object before transmission.
>
> and that's handled by git pack-objects --thin (am i right?)

Right. But the --thin flag here is unrelated to this.

What --thin does is to tell pack-objects that it can produce deltas
against objects that will _not_ be included in the produced pack. That
is OK only if the consumer of that pack is 1) aware of that fact and
2) is going to "fix" the pack by appending those objects to the pack
from a local copy. For a pack to be "valid" in your .git directory, it
has to be self contained with regards to deltas. It is not allowed to
have deltas across different packs as this makes the issue of delta
loops extremely difficult to deal with in the context of incremental
repacks.

> so. we have a hierarchical plan: get the commit list, get a
> per-commit object-list, get the objects (if needed), store the
> objects.
>
> problem: despite looking through virtually every single builtin/*.c
> file which uses write_sha1_file (which i believe i have correctly
> identified, from examining git unpack-objects, as being the function
> which stores actual objects, including their type), i do not see a git
> command (yet) which performs the reverse operation of "git cat-file".

It is 'git hash-object'.


Nicolas
Luke Kenneth Casson Leighton
2010-09-06 22:33:57 UTC
Permalink
nicolas, thank you very brief reply (busy for 2 days)

On Mon, Sep 6, 2010 at 5:51 PM, Nicolas Pitre <***@fluxnic.net> wrote:

>> * to work out how far local deviates from remote along that list of =
commits
>
> That's an operation that only the peer with the most recent commits c=
an
> do, unless you transfer that huge list of commits from above across t=
he
> network.

i Have A Plan for dealing with that.

> So, on a server (i.e. the peer sending objects) you'd do:
>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0git rev-list <refs_that_I_publish> --not <=
refs_that_the_remote_has>

sadly that involves telling the sender what the recipient has.

>> =C2=A0problem: despite looking through virtually every single builti=
n/*.c
>> file which uses write_sha1_file (which i believe i have correctly
>> identified, from examining git unpack-objects, as being the function
>> which stores actual objects, including their type), i do not see a g=
it
>> command (yet) which performs the reverse operation of "git cat-file"=
=2E
>
> It is 'git hash-object'.

ah _haa_ - thank you! ok, so i have to create a pack-object-like
format, putting the object type at the beginning of the format, then
put the contents of "git cat-file" after it.

so - apologies, will be dealing with some work-related stuff for a
day or so. thank you for everything so far nicolas.

l.
Junio C Hamano
2010-09-06 23:34:00 UTC
Permalink
Nicolas Pitre <***@fluxnic.net> writes:

>> * enumerate the list of objects associated with an individual commit by:
>> i) creating a CUSTOM pack+idx using git pack-objects {ref}
>> ii) *parsing* the idx file using gitdb's FileIndex to get the list
>> of objects
>
> That's where you're going so much out of your way to give you trouble.
> A simple rev-list would give you that list:
>
> git rev-list --objects <this_commit> --not <this_commit''s_parents>
>
> That's it.

I didn't want to get into this discussion, but where in the above picture
does the usual "want/ack" exchange fit?

The biggest trouble before object transfer actually happens is that the
sending end needs to find a set of commits that are known to exist at the
receiving end, but it needs to do that starting from a state where the tip
commits the receiving end has are not known by it. That is why the
receiver must go back in his history and keep asking the sender "I have
this, this, this, this...; now have you heard enough?" until the sender
sees a commit that it knows about. After that happens, it can list them
on its "git rev-list --objects <my tips> --not <he has these>" command
line in order to enumerate objects it needs to send.

If the receiver is purely following the sender, not doing any work on its
own, the tips the receiver has may always be known by the sender, but that
is not an interesting case at all.
Nicolas Pitre
2010-09-06 23:57:43 UTC
Permalink
On Mon, 6 Sep 2010, Junio C Hamano wrote:

> Nicolas Pitre <***@fluxnic.net> writes:
>
> >> * enumerate the list of objects associated with an individual commit by:
> >> i) creating a CUSTOM pack+idx using git pack-objects {ref}
> >> ii) *parsing* the idx file using gitdb's FileIndex to get the list
> >> of objects
> >
> > That's where you're going so much out of your way to give you trouble.
> > A simple rev-list would give you that list:
> >
> > git rev-list --objects <this_commit> --not <this_commit''s_parents>
> >
> > That's it.
>
> I didn't want to get into this discussion, but where in the above picture
> does the usual "want/ack" exchange fit?

Before object enumeration obviously. But I think that Luke has enough
to play with already by only assuming the easy case for now. If Git P2P
is to be viable, it has to prove itself at least with the easy case
first.


Nicolas
Luke Kenneth Casson Leighton
2010-09-07 00:17:24 UTC
Permalink
On Tue, Sep 7, 2010 at 12:57 AM, Nicolas Pitre <***@fluxnic.net> wrote=
:
> On Mon, 6 Sep 2010, Junio C Hamano wrote:
>
>> Nicolas Pitre <***@fluxnic.net> writes:
>>
>> >> * enumerate the list of objects associated with an individual com=
mit by:
>> >> =C2=A0 =C2=A0 i) creating a CUSTOM pack+idx using git pack-object=
s {ref}
>> >> =C2=A0 =C2=A0 ii) *parsing* the idx file using gitdb's FileIndex =
to get the list
>> >> of objects
>> >
>> > That's where you're going so much out of your way to give you trou=
ble.
>> > A simple rev-list would give you that list:
>> >
>> > =C2=A0 =C2=A0 git rev-list --objects <this_commit> --not <this_com=
mit''s_parents>
>> >
>> > That's it.
>>
>> I didn't want to get into this discussion, but where in the above pi=
cture
>> does the usual "want/ack" exchange fit?
>
> Before object enumeration obviously. =C2=A0But I think that Luke has =
enough
> to play with already by only assuming the easy case for now. =C2=A0If=
Git P2P
> is to be viable, it has to prove itself at least with the easy case
> first.

:) yes. worry about that later. optimisation. time to think.
idea earlier (from 2 hours ago) unworkable, thought of another one,
split commit list into multi-level "virtual hierarchical
subdirectories" of say 256 entries each. can therefore easily trip
down each "subdirectory" which will quickly get you to the right place
where the commits are different, with only a few roundtrips. sort-of
binary search but 256-way search. binary search not optimal here
because of multiple network round-trips. sorry very obtuse will write
up better.

l.
Luke Kenneth Casson Leighton
2010-09-07 00:29:25 UTC
Permalink
On Tue, Sep 7, 2010 at 12:57 AM, Nicolas Pitre <***@fluxnic.net> wrote=
:
> =C2=A0But I think that Luke has enough
> to play with already by only assuming the easy case for now.

um, yes.

i only just noticed that bittorrent's multi-file mode has blocking
that doesn't line up with the bloody file beginnings and ends:

| block 0 256k | block 1 256k | block N 11bytes |
|file1 | file2 | file3 | file4 |

this is why cameron dale threw his hands up in horror at bittorrent
when he created debtorrent, but i didn't understand why, fully, as i
thought that he was creating one .torrent per .deb _anyway_ so if he
had it would have been moot.

so i have to do some tests to see if multiple individual torrents can
be added using the BitTornado API to the same server, and redesign the
damn code so that it adds new "files" to be downloaded based on the
previous one completing.

so, the first quotes file quotes to be requested will be the
"rev-list", and the "finished" callback function will result in the
next layer of "files" as .torrents to be requested, and then finally
the files representing the actual "objects" - generated as git
cat-files on the remote end - get added and requested.

so - all event-driven. eek! fuun. about as understandable as nmbd
in samba (*) :)

l.

(*) i can say that because i did its 1st major rewrite ha ha
Artur Skawina
2010-09-04 13:42:05 UTC
Permalink
On 09/04/10 08:13, Nicolas Pitre wrote:
> On Sat, 4 Sep 2010, Artur Skawina wrote:
>> What I'm really asking is, if a (modified) git-upload-pack skips transferring
>> commit X, and just sends me commit Z (possibly as delta vs 'X'), _and_ I
>> obtain commit 'X" in some other way, I will be able to reconstruct 'Z', correct?
>
> Yes. Although it is 'git pack-objects' that decides what objects to
> send, not 'git-upload-pack'.

Thank you very much for the detailed answers.

AFAIU both previously mentioned assumptions hold, so here's an example of
git-p2p-v3 use, simplified and with most boring stuff (p2p,ref and error
handling omitted.
(the first version made a canonical, shared, virtual representation of the
object store, the second added more git-awareness to the transport, and then
I started wondering if all of that is actually necessary; hence...).

Let's say I'm a git repo tracking Linus' tree, right now the newest commit that
i have is "v2.6.33" (but it could be anything, including "" for a fresh, empty
clone) and I want to become up to date.

1) I fetch a list of IPs of well known seeds, eg from kernel.org.

2) I send an UDP packet to some of them, containing the repo
("git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6.git"),
the ref that I'm interested in ("master") and the hash of the last commit
that I have ("60b341b7").
This is enough to start participating in the cloud by serving ".."60b341b7",
but we'll skip the server part in this example.

3) I receive answers to some of the above queries, containing the status of
these peers wrt to the given repo and ref, ie the same data I sent above.
Plus a list of random other live peers known to be tracking this ref, which
I'll use to repeat step #2 and #3 until I have a list of enough peers to
continue.

4) Now i know of 47 peers that already have the tag or commit "v2.6.37" (either
I already knew that I wanted this one, or determined it during #3 and/or #1;
ref handling omitted from this example for brevity).

So i connect to one of the peers, and basically ask for the equivalent of
"git fetch peer01 v2.6.37".
But that would pull all new objects from that one peer, and that isn't what
i want. So i need to make it not only send me a thin pack, but also to omit
some of the objects. As at this point i don't actually know anything about
the objects in between "v2.6.33" and "v2.6.37" I can not split the request
into smaller ones.

So I'll cheat -- I'll take the number of available peers ("47") and the
number of this peer ("0"), send these two integers over and ask the other
side to skip transferring me any object whose
(HASH%available_peers)!=this_peer .

5) for (int this_peer=1; this_peer<available_peers; this_peer++)
Repeat#4(this_peer);
/* in parallel until i saturate the link */

6) Now i have 47 different packs, which probably do not make any sense
individually, because they contain deltas vs nonexisting objects, but
as a whole can be used to reconstruct the full tree.
6a) Except of course if there are circular dependencies, which can
occur eg. if peer#1 decided to encode object A as delta(B) and
peer#2 did B=delta(A), but this will be rare, and I'll just need
to refetch either A or B to break the cycle, this time with real
I-HAVES, hence this is guaranteed to succeed.

What am I missing?

artur
Artur Skawina
2010-09-04 17:23:38 UTC
Permalink
On 09/04/10 17:56, Bernhard R. Link wrote:
> While your approach looks like it could work with one commonly looked
> for branch/tag, it might be worthwhile to look at some more complicated
> cases and see if the protocol can be extended to also support those.
>
> Assume you are looking to update abranch B built on top of some
> say arm-specific branch A based on top of the main-line kernel L.
>
> There you have 6 types of peers:
>
> 1) those that have the current branch B
> 2) those that have an older state of B
> 3) those that have the current branch A
> 4) those that have an older state of A
> 5) those that have the current branch L
> 6) those that have an older state of L.
>
> Assuming you want a quite obscure branch, type 1 and 2 peers will not
> be that many, so would be nice if there was some way to also get stuff
> from the others.
>
> Peers of type 6 do not interest you, as there will be enough of type 5.
>
> But already peers of type 3 might not be much and even less of type 1
> so types 2 and 4 get interesting.
>
> If you get first a tree of all commit-ids you still miss (you only
> need that information once and every peer of type 1 can give it to you)
> and have somehow a way to look at the heads of each peer, it should be
> streight forward to split the needed objects using your way (not specifying
> the head you have but the one you hope to get from others), they should be
> able to send you a partial pack in the way you describe.

I doubt git-p2p would work well w/ any "quite obscure branch";
obviously, the p2p approach works best for popular content...

But there's the case of _new_ content, that has not already propagated
thrugh the swarm. And this was the very reason I chose to have 'ref' in
the protocol.

Let's say Linus releases a new kernel, I want to fetch it, and find 200
peers of which only three have already updated.
Here another field in the initial UDP protocol comes in, that i omitted
in the original description in order to keep things simple (it's just
an optimization).
The UDP response from the peer, in addition to the latest commit_id that
it has, also contains an integer "commits_ahead" that says how many
commits ahead of the _my_ id (that I've sent in the request) this peer is.
So in the above situation I immediately find out that eg I'm missing 2000
commits. It would be extremely stupid to try to update using just the
three seeds, of course.
But by looking at all the commit_aheads of all peers I can see
(make an educated guess, really) that 150 of them already have 1500 of
the new commits. So what i do is pick one of the IDs given by one of the
peers, such that a sufficient number of other sources are likely to
already have it (ie they all have a lower 'commit_ahead'). And start
fetching that commit from the 150 sources, instead of my real target commit.
Once I'm done with this, I'll repeat the process again; by now hopefully
more peers have already updated. If the target commit is still rare I can
again look for an intermediate commit, that has a sufficient numbers of
seeds.
No extra traffic, and it also discourages abuse, as leeching from just the
few seeds will likely result in overall slower download.

This works best if the ref isn't rewound, obviously, but i think that would
be the common case.

Does this address at least part of your concerns above? The 'obscure branch'
case I'm not sure is easily solvable; I don't know if having a lot of rarely
used content widely distributed in the swarm would be a great idea...

One other interesting case is reusing objects needed by (large) merges, that
could already be available in the cloud. I'm not sure it happens often enough
to care though...

artur
Artur Skawina
2010-09-04 18:46:43 UTC
Permalink
On 09/04/10 17:56, Bernhard R. Link wrote:
> Assume you are looking to update abranch B built on top of some
> say arm-specific branch A based on top of the main-line kernel L.
>
> There you have 6 types of peers:
>
> 1) those that have the current branch B
> 2) those that have an older state of B
> 3) those that have the current branch A
> 4) those that have an older state of A
> 5) those that have the current branch L
> 6) those that have an older state of L.
>
> Assuming you want a quite obscure branch, type 1 and 2 peers will not
> be that many, so would be nice if there was some way to also get stuff
> from the others.
>
> Peers of type 6 do not interest you, as there will be enough of type 5.
>
> But already peers of type 3 might not be much and even less of type 1
> so types 2 and 4 get interesting.
>
> If you get first a tree of all commit-ids you still miss (you only
> need that information once and every peer of type 1 can give it to you)
> and have somehow a way to look at the heads of each peer, it should be
> streight forward to split the needed objects using your way (not specifying
> the head you have but the one you hope to get from others), they should be
> able to send you a partial pack in the way you describe.

Actually, since the seeds of this quite obscure branch know that this branch
is rare and are most likely also sharing other, much more popular, refs, they
can just add another note to the initial reply, saying
"this_ref_is_descendant_of($popular_branch, $common_base)". Then the client
can fetch old..$common_base and $common_base..$rare_head independently (even
in parallel).

artur
Theodore Tso
2010-09-04 01:57:10 UTC
Permalink
On Sep 3, 2010, at 3:41 PM, Nicolas Pitre wrote:

>=20
> Let's see what such instructions for how to make the canonical pack=20
> might look like:

But we don't need to replicate any particular pack. We just need to pr=
ovide instructions that can be replicated everywhere to provide *a* can=
onical pack.

>=20
> First you need the full ordered list of objects. That's a 20-byte SH=
A1
> per object. The current Linux repo has 1704556 objects, therefore th=
is
> list is 33MB already.

Assume the people creating this "gitdo" pack (i.e., much like jigdo) ha=
ve a superset of Linus's objects. So if we have all of the branches in=
Linus's repository, we can construct all of the necessary objects goin=
g back in time to constitute his repository. If Linus has only one br=
anch in his repo, we only need a single 20-byte SHA1 branch identifier.=
For git, presumbly we would need three (one for next, maint, and mas=
ter).

What abort the order of the objects in the pack? Well, ordering doesn'=
t matter, right? So let's assume the pack is sorted by hash id. Is =
there any downside to that? I can't think of any, but you're the pack =
expert...

If we do that, we would thus only need to send 20 bytes instead of 33MB=
=2E =20

> Then you need to identify which of those objects are deltas, and agai=
nst
> which object. Assuming we can index in the list of objects, that mea=
ns,
> say, one bit to identify a delta, and 31 bits for indexing the base. =
In
> my case this is currently 1393087 deltas, meaning 5.3 MB of additiona=
l
> information.

OK, this we'll need which means 5.3MB.


>=20
> But then, the deltas themselves can have variations in their encoding=
=2E
> And we did change the heuristics for the actual delta encoding in the
> past too (while remaining backward compatible), but for a canonical p=
ack
> creation we'd need to describe that in order to make things totally
> reproducible.
>=20
> So there are 2 choices here: Either we specify the Git version to mak=
e=20
> sure identical delta code is used, but that will put big pressure on=20
> that code to remain stable and not improve anymore as any behavior=20
> change will create a compatibility issue forcing people to upgrade th=
eir=20
> Git version all at the same time. That's not something I want to see=
=20
> the world rely upon.

I don't think the choice is that stark. It does mean that in addition =
to whatever pack encoding format is used by git natively, the code woul=
d also need to preserve one version of the delta hueristics for "Canoni=
cal pack version 1". After this version is declared, it's true that you=
might come up with a stunning new innovation that saves some disk spac=
e. How much is that likely to be? 3%? 5%? Worst case, it means tha=
t (1) the bittorent-distributed packs might not be as efficient, and (2=
) the code would be made more complex because we would either need to (=
a) keep multiple versions of the code, or (b) the code might need to ha=
ve some conditionals:

if (canonical pack v1)
do_this_code;
else
do_this_more_clever_code;

Is that really that horrible? And certainly we should be able to set t=
hings up so that it won't be a brake on innovation...

>=20
> The other choice is to actually provide the delta output as part of t=
he=20
> instruction for the canonical pack creation.
>=20
> So that makes for a grand total of 33 MB + 148 MB =3D 181 MB of data =
just
> to be able to unambiguously reproduce a pack with a full guarantee of
> perfect reproducibility.

So if we use the methods I've suggested, we would only need to send 5.3=
MB instead of 33MB or 181MB....

>=20
> But even with the presumption of stable delta code, the recipee would=
=20
> still take 38 MB that everyone would have to download every month whi=
ch=20
> is far more than what a monthly incremental update of a kernel repo=20
> requires. Of course you could create a delta between consecutive=20
> recipees, but that is becoming rather awkward.

The "recipee" would only need to download this if they are willing to p=
articipate as being one of the "seeders" in the BitTorrent network. P=
eople who are willing to do this are presumably willing to transmit man=
y more megabytes of data than 5MB or 33MB or 181MB. Given the draconia=
n policies of various ISP such as Comcast, it's not clear to me how man=
y people will be willing to be seeders. But if they are, I don't thin=
k downloading 5.3MB of instructions to generate a 600MB canonical pack =
to be distributed to hundreds or thousands of strangers will stop them.=
:-)

> I still think that if someone really want to apply the P2P principle =
=E0=20
> la BitTorrent to Git, then it should be based on the distributed=20
> exchange of _objects_ as I outlined in a previous email, and not file=
=20
> chunks like BitTorrent does. The canonical Git _objects_ are fully=20
> defined, while their actual encoding may change.

The advantages of sending a canonical pack is that it's relatively less=
code to write, since we can reuse the standard BitTorrent clients and =
servers to transmit the git repository. The downsides are that it's ma=
inly useful for downloading the entire repository, but I think that's t=
he most useful place for peer2peer anyway.

The advantage of a distributed exchange of _objects_ is you can use it =
to update a random repository --- but normally it's so efficient to dow=
nload an incremental set of objects from github or kernel.org, so I'm n=
ot sure what would be the point. It would also require exchanging a l=
ot more metadata, since presumably the client would first have to recei=
ve all of the object id's (which would be 33MB), and then use that to d=
ecide how to distribute asking for those objects from his/her peers. =
Which means sending object lists to different servers. If these peers=
do not yet have a complete set of objects, they'll have to nack some o=
f the object requests. Furthermore with only a partial set of objects=
downloaded, how will the client do the delta compression? Which mean=
s the client will need to store all of the objects in a decompressed fo=
rm, and only when it has received all of the objects will it be able t=
o compress them. So it's going to be fairly inefficient in terms of d=
isk space. I suppose the server could send the delta information (anot=
her 5.3MB) and then the client could use that to prioritize its object =
request lists. But still, this is quite different form the git protoc=
ol, and a lot will have to be written from scratch.

-- Ted
Kyle Moffett
2010-09-04 05:23:54 UTC
Permalink
Ted,

I think your "canonical pack" idea has value, but I'd be inclined to
try to optimize more for the "common case" of developing on a fast
local network with many local checkouts, where you occasionally
push/fetch external sources via a slow link.

Specifically, let's look at the very reasonable scenario of a
developer working over a slow DSL or dialup connection. He's probably
got many copies of various GIT repositories cloned all over the place
(hey, disk is cheap!), but right now he just wants a fresh clean copy
of somebody else's new tree with whatever its 3 feature branches are.
Furthermore, he's probably even got 80% of the commit objects from
that tree archived in his last clone from linux-next.

In theory he could very carefully arrange his repositories with
judicious use of alternate object directories. From personal
experience, though, such arrangements are *VERY* prone to accidentally
purging wanted objects; unless you *never* ever delete a branch in the
"reference" repository.

So I think the real problem to solve would be: Given a collection of
local computers each with many local repositories, what is the best
way to optimize a clone of a "new" remote repository (over a slow
link) by copying most of the data from other local repositories
accessible via a fast link?

The goal would be to design a P2P protocol capable of rapidly and
efficiently building distributed searchable indexes of ordered commits
that identify which peer(s) contain that each commit.

When you attempt to perform a "git fetch --peer" from a repository, it
would quickly connect to a few of the metadata index nodes in the P2P
network and use them to negotiate "have"s with the upstream server.
The client would then sequentially perform the local "fetch"
operations necessary to obtain all the objects it used to minimize the
commit range with the server. Once all of those "fetch" operations
completed, it could proceed to fetch objects from the server normally.

Some amount of design and benchmarking would need to be done in order
to figure out the most efficient indexing algorithm for finding a
minimal set of "have"s of potentially thousands of refs, many with
independent root commits. For example if the index was grouped
according to "root commit" (of which there may be more than one), you
*should* be able to quickly ask the server about a small list of root
commits and then only continue asking about commits whose roots are
all known to the server.

The actual P2P software would probably involve 2 different daemon
processes. The first would communicate with each other and with the
repositories, maintaining the ref and commit indexes. These daemons
would advertise themselves with Avahi, or alternatively in an
enterprise environment they would be managed by your sysadmins and be
automatically discovered using DNS-SD. Clients looking to perform a
P2P fetch would first ask these.

The second daemon would be a modified git-daemon that connects to the
advertised "index" daemons and advertises its own refs and commit
lists, as well as its IP address and port.

My apologies if there are any blatant typos or thinkos, it's a bit
later here than I would normally be writing about technical topics.

Cheers,
Kyle Moffett
Theodore Tso
2010-09-04 11:46:15 UTC
Permalink
On Sep 4, 2010, at 1:23 AM, Kyle Moffett wrote:

>
> Specifically, let's look at the very reasonable scenario of a
> developer working over a slow DSL or dialup connection. He's probably
> got many copies of various GIT repositories cloned all over the place
> (hey, disk is cheap!), but right now he just wants a fresh clean copy
> of somebody else's new tree with whatever its 3 feature branches are.
> Furthermore, he's probably even got 80% of the commit objects from
> that tree archived in his last clone from linux-next.
>
> In theory he could very carefully arrange his repositories with
> judicious use of alternate object directories. From personal
> experience, though, such arrangements are *VERY* prone to accidentally
> purging wanted objects; unless you *never* ever delete a branch in the
> "reference" repository.

This is a reasonable problem, but there's a much easier solution that's worked for me, and it's very simple. Simply use Linus's repository as the reference repository. IOW, whenever I start working on a new computer, I'll do keep a bare repository containing only Linus's repo. I might pull that down fresh from kernel.org, or if I'm on a local computer, it's usually much easier just to scp -r over a local copy of /usr/projects/linux/bare to my local machine. That's something like 98% of all of the needed commits right there. And since Linus never rewinds any branches on his repo. I'm done. It works for me. :-)

For a repo like git which does have a 'pu' branch which rewinds, life is a bit more tricky, I'll admit. OTOH, the fact that there is a 'pu' branch means that there are few people who keep externally published trees, since it's much easier to let Junio collect people's various patchsets into the 'pu' branch, at which point it's all nicely integrated. In some ways, the 'pu' branch is really the equivalent of the linux-next tree. :-)

OK, suppose we have a project which is large enough to have large numbers of downstream trees, but which also has one or more branches that happen to be rewinding. I can't think of any project which has those characteristics, but OK, it's worth exploring. Now what? Well, if I'm doing this all on one machine (the use case is my laptop), and I have a *very* slow external connection (the use case I can think of it is when I am on a cruise ship, where not only the bandwidth slow, but there is also a very long latency since it's a satellite link), what I'd do is find all my trees that might be related to the tree that I want to clone, and hard link them into my target repository. I'd then create a temp .git/refs/hack directory, and create branch pointers to the tips of all of the branches
from the donating repositories. I'd then just do a straightforward git pull, which would transfer over just the objects I need, then delete the .git/refs/hack directory, and then do a git gc.

Ok, so how do we generalize this if we have a large number of local machines? The first insight is that we want to treat local machines very differently from remote machines, since we obviously want to do as many of the object transfers from the local machines. The second is to note the reason why I had to hackishly create the .git/refs/hack directory. That's because git fetch already optimizes things based on branch heads to minimize the need to send vast object sets back and forth to figure out what objects are required.

So I suspect we might be able to do something where the peer2peer downloader first contacts all of its local "buddies", and gets their branch heads, and then, using the standard git protocol, contacts the remote git server (i.e., github, git.kernel.org, etc.), and sends it all of its available branch heads. It will then be able to assemble what it needs from remote server and from its local buddies. In effect, this is basically an "octopus fetch", and it seems like most of the machinery can be reused from the git protocol, with only some minor modifications.

I agree this is a very different use case than the "initial download" case. At least for the Linux kernel, I have a solution that works pretty much well enough. I suspect for the git tree, I could just as easily solve it by just doing a brute force copy of Junio's tree and then pulling down the extra objects using "git fetch". The reality is most projects have a canonical tree, and 99% of the time people's subtrees contain very few objects that aren't already in the canonical tree. So while we could create some complicated design to try to optimize picking a few commits that weren't in Linus's tree, but were in linux-next, when downloading someone's subsystem development tree, how many commits do we expect will be unique in someone's subsystem tree? A hundred, at most? Even wit
h a slow link, simply doing a git clone --bare of a local copy of Linus's tree, and then doing a straightforward "git fetch" is probably going to good enough, even if you are stuck on a cruise ship with a laptop that you smuggled aboard without your partner/spouse knowing so you can get that hacking fix. :-)

-- Ted
Luke Kenneth Casson Leighton
2010-09-04 14:06:29 UTC
Permalink
On Sat, Sep 4, 2010 at 6:23 AM, Kyle Moffett <***@moffetthome.net> wro=
te:
> So I think the real problem to solve would be: =C2=A0Given a collecti=
on of
> local computers each with many local repositories, what is the best
> way to optimize a clone of a "new" remote repository (over a slow
> link) by copying most of the data from other local repositories
> accessible via a fast link?

the most immediate algorithm that occurs to me that would be ideal
would be - rsync! whilst i had the privilege of being able to listen
10 years ago to tridge describe rsync in detail, so am aware that it
is parallelisable (which was the whole point of his thesis), a) i'm
not sure how to _distribute_ it b) i wouldn't have a clue where to
start!

so, i believe that a much simpler algorithm is to follow nicolas' advic=
e, and:

* split up a pack-index file by its fanout (1st byte of SHAs in the idx=
)
* create SHA1s of the list of object-refs within an individual fanout
* compare the per-fanout SHA1s remote and local
* if same, deduce "oh look, we have that per-fanout list already"
* grab the per-fanout object-ref list using standard p2p filesharing

in this way you'd end up breaking down e.g. 50mb of pack-index (for
e.g. linux-2.6.git) into rouughly 200k chunks, and you'd exchange
rouughly 50k of network traffic to find out that you'd got some of
those fanout object-ref-lists already. which is nice.

(see Documentation/technical/pack-format.txt, "Pack Idx File" for
description of fanouts, but according to gitdb/pack.py it's just the
1st byte of the SHA1s it points to)

> The goal would be to design a P2P protocol capable of rapidly and
> efficiently building distributed searchable indexes of ordered commit=
s
> that identify which peer(s) contain that each commit.

yyyyup.

> When you attempt to perform a "git fetch --peer" from a repository, i=
t
> would quickly connect to a few of the metadata index nodes in the P2P
> network and use them to negotiate "have"s with the upstream server.
> The client would then sequentially perform the local "fetch"
> operations necessary to obtain all the objects it used to minimize th=
e
> commit range with the server. =C2=A0Once all of those "fetch" operati=
ons
> completed, it could proceed to fetch objects from the server normally=
=2E

why stop at just fetching objects only from the server? why not have
the objects distributed as well? after all, if one peer has just gone
to all the trouble of getting an object, surely it can share it, too?

or am i misunderstanding what you're describing?

> Some amount of design and benchmarking would need to be done in order
> to figure out the most efficient indexing algorithm for finding a
> minimal set of "have"s of potentially thousands of refs, many with
> independent root commits. =C2=A0For example if the index was grouped
> according to "root commit" (of which there may be more than one), you
> *should* be able to quickly ask the server about a small list of root
> commits and then only continue asking about commits whose roots are
> all known to the server.

intuitively i follow what you're saying.

> The actual P2P software would probably involve 2 different daemon
> processes. =C2=A0The first would communicate with each other and with=
the
> repositories, maintaining the ref and commit indexes. =C2=A0These dae=
mons
> would advertise themselves with Avahi,

NO.

ok. more to the point: you want to waste time forcing people to
install a pile of shite called d-bus, just so that people can use git,
go ahead.

can we gloss quickly over the mention of avahi as that *sweet-voice*
most delightful be-all and solve-all solution *normal-voice* and move
on?

> or alternatively in an
> enterprise environment they would be managed by your sysadmins and be
> automatically discovered using DNS-SD.

and what about on the public hostile internet? no - i feel that the
basis should be something that's proven already, that's had at least
ten years to show for itself. deviating from that basis: fine - at
least there's not a massive amount of change required for
re-implementors to rewrite a compatible version (in c, or *shudder*
java)

> The second daemon would be a modified git-daemon that connects to the
> advertised "index" daemons and advertises its own refs and commit
> lists, as well as its IP address and port.

yes, there are definitely two distinct purposes. i'm not sure it's
necessary - or a good idea - to split the two out into separate
daemons, for the reason that you may, just like bittorrent can use a
single port to share multiple torrents comprising multiple files, wish
to use a single daemon to serve multiple git repositories.

if you start from the basis of splitting things out then you have a
bit of a headache on your hands wrt publishing multiple git repos.

l.
Nicolas Pitre
2010-09-05 01:32:04 UTC
Permalink
On Sat, 4 Sep 2010, Luke Kenneth Casson Leighton wrote:

> so, i believe that a much simpler algorithm is to follow nicolas' advice, and:
>
> * split up a pack-index file by its fanout (1st byte of SHAs in the idx)
> * create SHA1s of the list of object-refs within an individual fanout
> * compare the per-fanout SHA1s remote and local
> * if same, deduce "oh look, we have that per-fanout list already"
> * grab the per-fanout object-ref list using standard p2p filesharing
>
> in this way you'd end up breaking down e.g. 50mb of pack-index (for
> e.g. linux-2.6.git) into rouughly 200k chunks, and you'd exchange
> rouughly 50k of network traffic to find out that you'd got some of
> those fanout object-ref-lists already. which is nice.

Scrap that idea -- this won't work. The problem is that, by nature,
SHA1 is totally random. So if you have, say, 256 objects to transfer
(and 256 objects is not that much) then, statistically, the probability
that the SHA1s for those objects end up uniformly distributed across all
the 256 fanouts is quite high. the algorithm I mentioned completely
breaks down in that case.


Nicolas
Luke Kenneth Casson Leighton
2010-09-05 17:16:26 UTC
Permalink
On Sun, Sep 5, 2010 at 2:32 AM, Nicolas Pitre <***@fluxnic.net> wrote:
> On Sat, 4 Sep 2010, Luke Kenneth Casson Leighton wrote:
>
>> so, i believe that a much simpler algorithm is to follow nicolas' ad=
vice, and:
>>
>> * split up a pack-index file by its fanout (1st byte of SHAs in the =
idx)
>> * create SHA1s of the list of object-refs within an individual fanou=
t
>> * compare the per-fanout SHA1s remote and local
>> * if same, deduce "oh look, we have that per-fanout list already"
>> * grab the per-fanout object-ref list using standard p2p filesharing
>>
>> in this way you'd end up breaking down e.g. 50mb of pack-index (for
>> e.g. linux-2.6.git) into rouughly 200k chunks, and you'd exchange
>> rouughly 50k of network traffic to find out that you'd got some of
>> those fanout object-ref-lists already. =C2=A0which is nice.
>
> Scrap that idea -- this won't work. =C2=A0The problem is that, by nat=
ure,
> SHA1 is totally random. =C2=A0So if you have, say, 256 objects to tra=
nsfer
> (and 256 objects is not that much) then, statistically, the probabili=
ty
> that the SHA1s for those objects end up uniformly distributed across =
all
> the 256 fanouts is quite high. =C2=A0the algorithm I mentioned comple=
tely
> breaks down in that case.

mmm... that's no so baad. requesting a table/pseudo-file with 1
fanout or 256 fanouts is still only one extra round-trip. if i split
it into pseudo-subdirectories _then_ yes you have 256 requests. that
can be avoided with a bit of work. so, no biggie :)

l.
Nicolas Pitre
2010-09-04 05:40:35 UTC
Permalink
On Fri, 3 Sep 2010, Theodore Tso wrote:

>
> On Sep 3, 2010, at 3:41 PM, Nicolas Pitre wrote:
>
> >
> > Let's see what such instructions for how to make the canonical pack
> > might look like:
>
> But we don't need to replicate any particular pack. We just need to
> provide instructions that can be replicated everywhere to provide *a*
> canonical pack.

But that canonical pack could be any particular pack.

> > First you need the full ordered list of objects. That's a 20-byte SHA1
> > per object. The current Linux repo has 1704556 objects, therefore this
> > list is 33MB already.
>
> Assume the people creating this "gitdo" pack (i.e., much like jigdo)
> have a superset of Linus's objects. So if we have all of the branches
> in Linus's repository, we can construct all of the necessary objects
> going back in time to constitute his repository. If Linus has only
> one branch in his repo, we only need a single 20-byte SHA1 branch
> identifier. For git, presumbly we would need three (one for next,
> maint, and master).

Sure, but that's not sufficient. All this 20-byte SHA1 gives you is a
set of objects. That says nothing about their encoding.

> What about the order of the objects in the pack? Well, ordering
> doesn't matter, right? So let's assume the pack is sorted by hash id.
> Is there any downside to that? I can't think of any, but you're the
> pack expert...

Ordering does matter a big deal. Since object IDs are the SHA1 of their
content, those IDs are totally random. So if you store objects
according to their sorted IDs, then the placement of objects belonging
to, say, the top commit will be totally random. And since you are the
filesystem expert, I don't have to tell you what performance impacts
this random access of small segments of data scattered throughout a
400MB file will have on a checkout operation.

> If we do that, we would thus only need to send 20 bytes instead of 33MB.
>
> > Then you need to identify which of those objects are deltas, and against
> > which object. Assuming we can index in the list of objects, that means,
> > say, one bit to identify a delta, and 31 bits for indexing the base. In
> > my case this is currently 1393087 deltas, meaning 5.3 MB of additional
> > information.
>
> OK, this we'll need which means 5.3MB.
>
> >
> > But then, the deltas themselves can have variations in their encoding.
> > And we did change the heuristics for the actual delta encoding in the
> > past too (while remaining backward compatible), but for a canonical pack
> > creation we'd need to describe that in order to make things totally
> > reproducible.
> >
> > So there are 2 choices here: Either we specify the Git version to make
> > sure identical delta code is used, but that will put big pressure on
> > that code to remain stable and not improve anymore as any behavior
> > change will create a compatibility issue forcing people to upgrade their
> > Git version all at the same time. That's not something I want to see
> > the world rely upon.
>
> I don't think the choice is that stark. It does mean that in addition
> to whatever pack encoding format is used by git natively, the code
> would also need to preserve one version of the delta hueristics for
> "Canonical pack version 1". After this version is declared, it's true
> that you might come up with a stunning new innovation that saves some
> disk space. How much is that likely to be? 3%? 5%? Worst case, it
> means that (1) the bittorent-distributed packs might not be as
> efficient, and (2) the code would be made more complex because we
> would either need to (a) keep multiple versions of the code, or (b)
> the code might need to have some conditionals:
>
> if (canonical pack v1)
> do_this_code;
> else
> do_this_more_clever_code;
>
> Is that really that horrible? And certainly we should be able to set things up so that it won't be a brake on innovation...

Well, this would still be a non negligible maintenance cost. And for
what purpose already? What is the real advantage?

> The advantages of sending a canonical pack is that it's relatively
> less code to write, since we can reuse the standard BitTorrent clients
> and servers to transmit the git repository. The downsides are that
> it's mainly useful for downloading the entire repository, but I think
> that's the most useful place for peer2peer anyway.

Sure. But I don't think it is worth making Git less flexible just for
the purpose of ensuring that people could independently create identical
packs. I'd advocate for "no code to write at all" instead, and simply
have one person create and seed the reference pack.

And if you are willing to participate in the seeding of such a torrent,
then you better not be bandwidth limited, meaning that you certainly can
afford to download that reference pack in the first place.

And that reference pack doesn't have to change that often either. If
you update it only on every major kernel releases, then you'll need to
fetch it about once every 3 months. Incremental updates from those
points should be relatively small.

Yet... it should be possible in practice to produce identical packs,
given that the Git version is specified, the zlib version is specified,
the number of threads for the repack is equal to 1, the -f flag is used
meaning a full repack is performed, the delta depth and window size is
specified, and the head branches are specified. Given that torrents are
also identified by a hash of their content, it should be pretty easy to
see if the attempt to reproduce the reference pack worked, and start
seeding right away if it did.

But again, I don't think it is worth freezing the pack format into a
canonical encoding for this purpose.


Nicolas
Theodore Tso
2010-09-04 12:00:40 UTC
Permalink
On Sep 4, 2010, at 1:40 AM, Nicolas Pitre wrote:
>> What about the order of the objects in the pack? Well, ordering
>> doesn't matter, right? So let's assume the pack is sorted by hash id.
>> Is there any downside to that? I can't think of any, but you're the
>> pack expert...
>
> Ordering does matter a big deal. Since object IDs are the SHA1 of their
> content, those IDs are totally random. So if you store objects
> according to their sorted IDs, then the placement of objects belonging
> to, say, the top commit will be totally random. And since you are the
> filesystem expert, I don't have to tell you what performance impacts
> this random access of small segments of data scattered throughout a
> 400MB file will have on a checkout operation.

Does git repack optimize the order so that certain things (like checkouts for example) are really fast? I admit I hadn't noticed. Usually until the core packs are in my page cache, it has always seemed to me that things are pretty slow. And of course, the way objects and grouped together and ordered for "gitk" or "git log" to be fast won't be the safe as a checkout operation...

> Sure. But I don't think it is worth making Git less flexible just for
> the purpose of ensuring that people could independently create identical
> packs. I'd advocate for "no code to write at all" instead, and simply
> have one person create and seed the reference pack.

I don't think it's a matter of making Git "less flexible", it's just simply a code maintenance headache of needing to be able to support encoding both a canonical format as well as the latest bleeding-edge, most efficient encoding format. And how often are you changing/improving the encoding process, anyway? It didn't seem to me like that part fo the code was constantly being tweaked/improved.

Still, you're right, it might not be worth it. To be honest, I was more interested about the fact that this might also be used to give people hints about how to better repack their local repositories so that they didn't have to run git repack with large --window and --depth arguments. But that would only provide very small improvements in storage space in most cases, so it's probably not even worth it for that.

Quite frankly, I'm a little dubious about how critical peer2peer really is, for pretty much any use case. Most of the time, I can grab the base "reference" tree and drop it on my laptop before I go off the grid and have to rely on EDGE or some other slow networking technology. And if the use case is some small, but illegal-in-some-jurisdiction code, such as ebook DRM liberation scripts (the kind which today are typically distributed via pastebin's :-), my guess is that zipping up a git repository and dropping it on a standard bittorrent server run by the Swedish Pirate party is going to be much more effective. :-)

-- Ted
Luke Kenneth Casson Leighton
2010-09-04 12:44:30 UTC
Permalink
On Sat, Sep 4, 2010 at 1:00 PM, Theodore Tso <***@mit.edu> wrote:

> Quite frankly, I'm a little dubious about how critical peer2peer real=
ly is, for pretty much any use case. =C2=A0Most of the time, I can grab=
the base "reference" tree and drop it on my laptop before I go off the=
grid and have to rely on EDGE or some other slow networking technology=
=2E

yes... but if you meet up with other people, where you have a fast
LAN segment or set up your own private wifi mesh, are you able to a)
sync up the mailing lists using git b) sync up the project's bug-list
using git c) sync up the wiki (if there is one) using git d)
seamlessly continue to appear to be talking (with the people you're
meeting) on the "mailing lists" as if you actually had a decent
connection to the server e) report, comment on, change the status of
and share bugs between all of the people you're meeting as if you had
a decent connection to the bugtracker server...

you see how it's not just about "The Source Code"? sure, yes, you or
anyone else _on their own_ can do code development, isolated from
everyone else and the internet...

example: i went to europython, and met the moinmoin developers (nice
people). i wanted to help with the sprint after hours: it turned out
that we'd got the day wrong, so we then went "oh well, let's find
somewhere to do a bit of hacking", and _immediately_ the discussion
turned into "how we can all of us find internet connectivity". i said
that i had a 3G USB but i didn't have ndiswrapper installed for the
bcm4328 on my laptop, so i couldn't offer a mesh; someone else said
that their laptop's WIFI didn't even _do_ master-mode, and we then had
a nice discussion about various little network routers that ran
openwrt and lamented the fact that none of us had brought one along.

needless to say, we didn't do any hacking :)

l.
Luke Kenneth Casson Leighton
2010-09-04 14:50:29 UTC
Permalink
On Sat, Sep 4, 2010 at 1:00 PM, Theodore Tso <***@mit.edu> wrote:

> such as ebook DRM liberation scripts (the kind which today
> are typically distributed via pastebin's :-), my guess is that
> zipping up a git repository and dropping it on a standard
> bittorrent server run by the Swedish Pirate party is going to
> be much more effective. =C2=A0 :-)

:) the legality or illegality isn't interesting - or is a... red
herring, being one of the unfortunate anarchistic-word-associations
with the concept of "file sharing". the robustness and convenience
aspects - to developers not users - is where it gets reaaally
interesting.

i do not know of a single free software development tool - not a
single one - which is peer-to-peer distributed. just... none. what
does that say?? and we have people bitching about how great but
non-free skype is. there seems to be a complete lack of understanding
of the benefits of peer-to-peer infrastructure in the free software
community as a whole, and a complete lack of interest in the benefits,
too - perhaps for reasons no more complex than the tools don't exist
so it's catch-22, and the fact that the word "distributed" is
_already_ associated with the likes of SMTP, DNS and "git" so
everybody thinks "we're okay, jack, go play with your nice dreams of
p2p networking, we're gonna write _real_ code now".

=2E.. mmmm :)

l.
Ted Ts'o
2010-09-04 18:14:05 UTC
Permalink
On Sat, Sep 04, 2010 at 03:50:29PM +0100, Luke Kenneth Casson Leighton wrote:
>
> :) the legality or illegality isn't interesting - or is a... red
> herring, being one of the unfortunate anarchistic-word-associations
> with the concept of "file sharing". the robustness and convenience
> aspects - to developers not users - is where it gets reaaally
> interesting.

I ask the question because I think being clear about what goals might
be are critically important. If in fact the goals is to evade
detection be spreading out responsibility for code which is illegal in
some jurisdictions (even if they are commonly used and approved of by
people who aren't spending millions of dollars purchasing
congresscritters), there are many additional requirements that are
imposed on such a system.

If the goal is speeding up git downloads, then we need to be careful
about exactly what problem we are trying to solve.

> i do not know of a single free software development tool - not a
> single one - which is peer-to-peer distributed. just... none. what
> does that say?? and we have people bitching about how great but
> non-free skype is. there seems to be a complete lack of understanding
> of the benefits of peer-to-peer infrastructure in the free software
> community as a whole, and a complete lack of interest in the benefits,
> too...

Maybe it's because the benefits don't exist for many people? At least
where I live, my local ISP (Comcast, which is very common internet
provider in the States) deliberately degrades the transfer of
peer2peer downloads. As a result, it doesn't make sense for me to use
bittorrent to download the latest Ubuntu or Fedora iso image. It's in
fact much faster for me to download it from an ftp site or a web site.

And git is *extremely* efficient about its network usage, since it
sends compressed deltas --- especially if you already have a base
responsitory estlablished. For example, I took a git repository which
I haven't touched since August 4th --- exactly one month ago --- and
did a "git fetch" to bring it up to date by downloading from
git.kernel.org. How much network traffic was required, after being
one month behind? 2.8MB of bytes received, 133k of bytes transmitted.

That's not a lot. And it's well within the capabilities of even a
really busy server to handle. Remember, peer2peer only helps if the
aggregate network bandwidth of the peers is greater than (a) your
download pipe, or (b) a central server's upload pipe. And if we're
only transmitting 2.8MB, and a git.kernel.org has an aggregate
connection of over a gigabit per second to the internet --- it's not
likely that peer2peer would in fact result in a faster download. Nor
is it likely that that git updates are likely to be something which
the kernel.org folks would even notice as a sizeable percentage of
their usable network bandwidth. First of all, ISO image files are
much bigger, and secondly, there are many more users downloading ISO
files than there are developers downloading git updates, and certainly
relatively few developers downloading full git repositories (since
everybody genreally tries really hard to only do this once).

> - perhaps for reasons no more complex than the tools don't exist
> so it's catch-22, and the fact that the word "distributed" is
> _already_ associated with the likes of SMTP, DNS and "git" so
> everybody thinks "we're okay, jack, go play with your nice dreams of
> p2p networking, we're gonna write _real_ code now".

Well, perhaps it's because what we have right now works pretty well. :-)

Which brings me back to my original question --- what problem exactly
are you trying to solve? What's the scenario?

If the answer is "peer2peer is cool technology, and we want to play",
that's fine. Put it would confirm the hypothesis that in this case,
peer2peer is a solution looking for a problem...

- Ted
Luke Kenneth Casson Leighton
2010-09-04 20:00:56 UTC
Permalink
On Sat, Sep 4, 2010 at 7:14 PM, Ted Ts'o <***@mit.edu> wrote:
>=C2=A0At least
> where I live, my local ISP (Comcast, which is very common internet
> provider in the States) deliberately degrades the transfer of
> peer2peer downloads.

if microsoft can add ncacn_http to MSRPC for the exact same sorts of
reasons, and even skype likewise provides a user-config option to
specify "port 80" or port "3128", then it's perfectly possible to do
likewise. ncacn_http actually has HTTP 1.1 headers on it, and, once
you've negotiated enough to look like HTTP, the raw socket is hander
over to MSRPC for it to play with.

> Which brings me back to my original question --- what problem exactly
> are you trying to solve? =C2=A0What's the scenario?

i described those in prior messages. to summarise: they're basically
reduction of dependence on centralised infrastructure, and to allow
developers to carry on doing code-sprints using bugtrackers, wikis and
anything else that can be "git-able" as its back-end, _even_ in the
cases where there is little or absolutely no bandwidth... and _still_
sync up globally once any one of the developers gets back online.

so i'm _not_ just thinking of the "code, code, code" scenario, and
i'm not just thinking in terms of the single developer "i code,
therefore i am" scenario. i'm thinking of scenarios which increase
the productivity of collaborative development even in the face of
unreliable or non-existent connectivity [carrier pigeons...]

l.
Ted Ts'o
2010-09-04 22:41:39 UTC
Permalink
On Sat, Sep 04, 2010 at 09:00:56PM +0100, Luke Kenneth Casson Leighton =
wrote:
> > Which brings me back to my original question --- what problem exact=
ly
> > are you trying to solve? =A0What's the scenario?
>=20
> i described those in prior messages. to summarise: they're basically
> reduction of dependence on centralised infrastructure, and to allow
> developers to carry on doing code-sprints using bugtrackers, wikis an=
d
> anything else that can be "git-able" as its back-end, _even_ in the
> cases where there is little or absolutely no bandwidth... and _still_
> sync up globally once any one of the developers gets back online.

So at all of the code sprints I've been at, the developers all have
locally very good bandwidth between each other. And if they don't
have wifi, what *will* they have? In the example you gave, you never
were able to bring up a local area network, because you had one or two
lamers who couldn't even do wifi in adhoc mode. Hell, even if you had
to hook up someone's laptop using an RS-232 line and PPP, that would
be plenty of bandwidth for git. So you weren't specific enough in
your scenario. How could it happen? And is it really all that realist=
ic?

Even if it did, it wouldn't be hard to just set up a git server on one
of the laptop. What makes peer2peer so critically important in this
use case? (And no, carrier pigeons are not particularly realistic for
a code sprint....)

- Ted
Luke Kenneth Casson Leighton
2010-09-05 17:22:59 UTC
Permalink
On Sat, Sep 4, 2010 at 11:41 PM, Ted Ts'o <***@mit.edu> wrote:
> On Sat, Sep 04, 2010 at 09:00:56PM +0100, Luke Kenneth Casson Leighto=
n wrote:
>> > Which brings me back to my original question --- what problem exac=
tly
>> > are you trying to solve? =C2=A0What's the scenario?
>>
>> i described those in prior messages. =C2=A0to summarise: they're bas=
ically
>> reduction of dependence on centralised infrastructure, and to allow
>> developers to carry on doing code-sprints using bugtrackers, wikis a=
nd
>> anything else that can be "git-able" as its back-end, _even_ in the
>> cases where there is little or absolutely no bandwidth... and _still=
_
>> sync up globally once any one of the developers gets back online.
>
> So at all of the code sprints I've been at, the developers all have
> locally very good bandwidth between each other. =C2=A0And if they don=
't

ted - with respect, much as i'd like to debate the merits or
otherwise of the purpose of this work, i'd far rather actually focus
on actually doing it. can i leave it to you and the other people here
on the list to debate both sides - actually three sides because there
is a case for helping casey to get "git hive" going, as well?

i look forward to seeing lots more ideas and use-cases beyond those
which i can envisage, and am grateful to the people who have been
privately contacting me to express gratitude at potentially having
something which makes software development and free software
involvement easier, in circumstance such as difficult or expensive
internet connectivity.

l.
Jakub Narebski
2010-09-04 20:20:42 UTC
Permalink
Ted Ts'o <***@mit.edu> writes:
> On Sat, Sep 04, 2010 at 03:50:29PM +0100, Luke Kenneth Casson Leighton wrote:
> >
> > :) the legality or illegality isn't interesting - or is a... red
> > herring, being one of the unfortunate anarchistic-word-associations
> > with the concept of "file sharing". the robustness and convenience
> > aspects - to developers not users - is where it gets reaaally
> > interesting.
>
> I ask the question because I think being clear about what goals might
> be are critically important. If in fact the goals is to evade
> detection be spreading out responsibility for code which is illegal in
> some jurisdictions (even if they are commonly used and approved of by
> people who aren't spending millions of dollars purchasing
> congresscritters), there are many additional requirements that are
> imposed on such a system.
>
> If the goal is speeding up git downloads, then we need to be careful
> about exactly what problem we are trying to solve.
>
> > i do not know of a single free software development tool - not a
> > single one - which is peer-to-peer distributed. just... none. what
> > does that say?? and we have people bitching about how great but
> > non-free skype is. there seems to be a complete lack of understanding
> > of the benefits of peer-to-peer infrastructure in the free software
> > community as a whole, and a complete lack of interest in the benefits,
> > too...

Luke, you don't have to be peer-to-peer to be decentralized and
distributed. People from what I understand bitch most about
centralized (and closed) services.

> Maybe it's because the benefits don't exist for many people? At least
> where I live, my local ISP (Comcast, which is very common internet
> provider in the States) deliberately degrades the transfer of
> peer2peer downloads. As a result, it doesn't make sense for me to use
> bittorrent to download the latest Ubuntu or Fedora iso image. It's in
> fact much faster for me to download it from an ftp site or a web site.
>
> And git is *extremely* efficient about its network usage, since it
> sends compressed deltas --- especially if you already have a base
> responsitory established. For example, I took a git repository which
> I haven't touched since August 4th --- exactly one month ago --- and
> did a "git fetch" to bring it up to date by downloading from
> git.kernel.org. How much network traffic was required, after being
> one month behind? 2.8MB of bytes received, 133k of bytes transmitted.

I think the major problem git-p2p wants to solve is if base repository
is *not* established, i.e. the initial fetch / full clone operation.

Note that with --reference argument to git clone, if you have similar
related repository, you don't have to do a full fetch cloning a fork
of repository you already have (e.g. you have Linus repo, and want to
fetch linux-next).

> That's not a lot. And it's well within the capabilities of even a
> really busy server to handle. Remember, peer2peer only helps if the
> aggregate network bandwidth of the peers is greater than (a) your
> download pipe, or (b) a central server's upload pipe. And if we're
> only transmitting 2.8MB, and a git.kernel.org has an aggregate
> connection of over a gigabit per second to the internet --- it's not
> likely that peer2peer would in fact result in a faster download. Nor
> is it likely that that git updates are likely to be something which
> the kernel.org folks would even notice as a sizeable percentage of
> their usable network bandwidth. First of all, ISO image files are
> much bigger, and secondly, there are many more users downloading ISO
> files than there are developers downloading git updates, and certainly
> relatively few developers downloading full git repositories (since
> everybody genreally tries really hard to only do this once).

Well, full initial clone of Linux kernel repository (or any other
large project with long history) is quite large. Also, not all
projects have big upload pipe.

Additional problem is that clone is currrently non-resumable (at all),
so if you have flaky web connection it might be hard to do initial
clone.


One way of solving this problem that (as I have heard some projects
use) is to prepare "initial" bundle; this bundle can be downloaded via
HTTP or FTP resumably, or be shared via ordinary P2P like BitTorrent.

The initial pack could be 'kept' (nt subject to repacking); with some
code it could serve as canonical starting packfile for cloning, I
think.

--
Jakub Narebski
Poland
ShadeHawk on #git
Luke Kenneth Casson Leighton
2010-09-04 20:47:19 UTC
Permalink
On Sat, Sep 4, 2010 at 9:20 PM, Jakub Narebski <***@gmail.com> wrote=
:

> Luke, you don't have to be peer-to-peer to be decentralized and
> distributed. =C2=A0People from what I understand bitch most about
> centralized (and closed) services.

i've covered this in the FAQ i wrote:

=46AQ:

Q: is git a "distributed source control system"?
A: yeees, but the "distribution" part has to be done the hard way,
by setting up servers, forcing developers and users to configure
git to use those [single-point-of-failure] servers. so it's
"more correct" to say that git is a "distributable" source control
system.


so if you believe that git is "distributed" just because people can
set up a server.... mmm :) i'd say that your administrative and
technical skills are way above the average persons' capabilities.
proper peer-to-peer networking infrastructure takes care of things
like firewall-busting, by using UPnP automatically, as part of the
infrastructure.

come on, people - _think_. we're so used to being able to run our own
infrastructure and workaround problems or server down-time, but most
people still use "winzip", if they have any kind of revision control
_at all_.

l.
Jakub Narebski
2010-09-04 21:16:30 UTC
Permalink
On Sat, Sep 4, 2010, Luke Kenneth Casson Leighton wrote:
> On Sat, Sep 4, 2010 at 9:20 PM, Jakub Narebski <***@gmail.com> wro=
te:
>=20
> > Luke, you don't have to be peer-to-peer to be decentralized and
> > distributed. =C2=A0People from what I understand bitch most about
> > centralized (and closed) services.
>=20
> i've covered this in the FAQ i wrote:
>=20
> FAQ:
>=20
> Q: is git a "distributed source control system"?
> A: yeees, but the "distribution" part has to be done the hard way,
> by setting up servers, forcing developers and users to configure
> git to use those [single-point-of-failure] servers. so it's
> "more correct" to say that git is a "distributable" source control
> system.
=20
"Distributed" is not equivalent to "peer to peer".
=20
> so if you believe that git is "distributed" just because people can
> set up a server.... mmm :) i'd say that your administrative and
> technical skills are way above the average persons' capabilities.

Git is distributed at least in the sense that it is opposite
to centralized version control systems (like Subversion)
where you have and can have only single server with full repository.

Setting up server (git, smart HTTP, ssh) is not that hard.

> proper peer-to-peer networking infrastructure takes care of things
> like firewall-busting, by using UPnP automatically, as part of the
> infrastructure.

With "smart" HTTP transport support there is no need for any=20
firewall-busting.
=20
> come on, people - _think_. we're so used to being able to run our ow=
n
> infrastructure and workaround problems or server down-time, but most
> people still use "winzip", if they have any kind of revision control
> _at all_.

In what way this paragraph refers to and is relewant with respect to
discussion in this subthread?

::plonk::
--=20
Jakub Narebski
Poland
Luke Kenneth Casson Leighton
2010-09-04 21:24:37 UTC
Permalink
On Sat, Sep 4, 2010 at 10:16 PM, Jakub Narebski <***@gmail.com> wrot=
e:
> On Sat, Sep 4, 2010, Luke Kenneth Casson Leighton wrote:
>> On Sat, Sep 4, 2010 at 9:20 PM, Jakub Narebski <***@gmail.com> wr=
ote:
>>
>> > Luke, you don't have to be peer-to-peer to be decentralized and
>> > distributed. =C2=A0People from what I understand bitch most about
>> > centralized (and closed) services.
>>
>> i've covered this in the FAQ i wrote:
>>
>> FAQ:
>>
>> Q: is git a "distributed source control system"?
>> A: yeees, but the "distribution" part has to be done the hard way,
>> =C2=A0 =C2=A0by setting up servers, forcing developers and users to =
configure
>> =C2=A0 =C2=A0git to use those [single-point-of-failure] servers. =C2=
=A0so it's
>> =C2=A0 =C2=A0"more correct" to say that git is a "distributable" sou=
rce control
>> =C2=A0 =C2=A0system.
>
> "Distributed" is not equivalent to "peer to peer".

correct. exactly.

> Setting up server (git, smart HTTP, ssh) is not that hard.

for you and me, and for the majority of people developing git, this is=
correct.

>> proper peer-to-peer networking infrastructure takes care of things
>> like firewall-busting, by using UPnP automatically, as part of the
>> infrastructure.
>
> With "smart" HTTP transport support there is no need for any
> firewall-busting.

the assumption is that the users are capable of deploying a server
(at all), and are happy to reconfigure to use that server.

ok.

this is getting off-topic and is distracting, both for me and for
people wishing to read about git. with respect, jakob, i'm going to
do something which i don't normally do, and that's begin to be
selective about what i reply to. apologies, but i'm on very short
timescales to get this code working, for financial reasons.

l.
Ted Ts'o
2010-09-04 22:47:26 UTC
Permalink
On Sat, Sep 04, 2010 at 09:47:19PM +0100, Luke Kenneth Casson Leighton wrote:
>
> Q: is git a "distributed source control system"?
> A: yeees, but the "distribution" part has to be done the hard way,
> by setting up servers, forcing developers and users to configure
> git to use those [single-point-of-failure] servers. so it's
> "more correct" to say that git is a "distributable" source control
> system.

Is typing the command "git daemon" on the command line really that
hard? For bonus points it could register with Avahi (i.e., the
Zeroconf protocol), which would make it easier to set up adhoc sharing
arrangements. But if that's your goal, I'd suggest making "git
daemon" simpler to set up --- since it's pretty trivial to set up as
it is. I'd then suggest adding to "git gui" an option to allow a user
to browse local git servers who have advertised themselves using the
Zeroconf protocol and do a git fetch from the local server.

But if you want to work on peer2peer because it's "cool", feel free.
But I suspect telling people that all if you have to do is run "git
daemon" on one machine, and using "git gui" to browse through the
available git servers on the other machines, it really isn't going to
get any easier....

- Ted
Tomas Carnecky
2010-09-05 01:43:38 UTC
Permalink
On 9/5/10 12:47 AM, Ted Ts'o wrote:
> hard? For bonus points it could register with Avahi (i.e., the
> Zeroconf protocol), which would make it easier to set up adhoc sharing
> arrangements. But if that's your goal, I'd suggest making "git

http://github.com/toolmantim/bananajour, and there's another project
with similar goals, but I forgot its name and can't find it right now.

tom
Nicolas Pitre
2010-09-05 01:18:16 UTC
Permalink
On Sat, 4 Sep 2010, Theodore Tso wrote:

>
> On Sep 4, 2010, at 1:40 AM, Nicolas Pitre wrote:
> >> What about the order of the objects in the pack? Well, ordering
> >> doesn't matter, right? So let's assume the pack is sorted by hash id.
> >> Is there any downside to that? I can't think of any, but you're the
> >> pack expert...
> >
> > Ordering does matter a big deal. Since object IDs are the SHA1 of their
> > content, those IDs are totally random. So if you store objects
> > according to their sorted IDs, then the placement of objects belonging
> > to, say, the top commit will be totally random. And since you are the
> > filesystem expert, I don't have to tell you what performance impacts
> > this random access of small segments of data scattered throughout a
> > 400MB file will have on a checkout operation.
>
> Does git repack optimize the order so that certain things (like
> checkouts for example) are really fast? I admit I hadn't noticed.

It does indeed. The object ordering in the pack is so that checking out
the most recent commit will perform an almost perfect linear read from
the beginning of the pack file. Then the further back you go in the
commit history the more random the access in the pack will be. But
that's OK as the most accessed commits are the recent ones, so we
optimize object placement for that case.

> Usually until the core packs are in my page cache, it has always
> seemed to me that things are pretty slow.

Cold cache numbers could be much much worse. In theory, checking out
the latest commit after a repack should be similar to extracting the
equivalent source tree from a tarball.

> And of course, the way objects and grouped together and ordered for
> "gitk" or "git log" to be fast won't be the safe as a checkout
> operation...

Indeed. But that case is optimized too, as all the commit objects are
put together at the front of the pack so the walking of the commit
history has good access locality too.

> > Sure. But I don't think it is worth making Git less flexible just for
> > the purpose of ensuring that people could independently create identical
> > packs. I'd advocate for "no code to write at all" instead, and simply
> > have one person create and seed the reference pack.
>
> I don't think it's a matter of making Git "less flexible", it's just
> simply a code maintenance headache of needing to be able to support
> encoding both a canonical format as well as the latest bleeding-edge,
> most efficient encoding format.

Indeed. But what I'm saying is that one would have to put some efforts
into that canonical format by removing the current flexibility that Git
has in producing a pack. For example, right now Git is relying on that
flexibility when using threads. The algorithm for load balancing ends up
making slight differences between different repack invocations even when
they're started with the same input. With multiple threads, the output
from pack-objects is therefore not deterministic.

> And how often are you changing/improving the encoding process, anyway?
> It didn't seem to me like that part of the code was constantly being
> tweaked/improved.

It used to change quite often. It is true that these days things have
slowed down in that area, but I prefer to keep options open unless there
is a really convincing reason not to.

> Still, you're right, it might not be worth it. To be honest, I was
> more interested about the fact that this might also be used to give
> people hints about how to better repack their local repositories so
> that they didn't have to run git repack with large --window and
> --depth arguments. But that would only provide very small
> improvements in storage space in most cases, so it's probably not even
> worth it for that.
>
> Quite frankly, I'm a little dubious about how critical peer2peer
> really is, for pretty much any use case.

I agree. So far it has been an interesting topic for discussion, but in
practice I doubt the actual benefits will justify the required efforts
and/or constraints on the protocol. Otherwise we would have a working
implementation in use already. People tried in the past, and so far
none of those attempts passed the reality test nor kept people motivated
enough to work on them further.

> Most of the time, I can grab
> the base "reference" tree and drop it on my laptop before I go off the
> grid and have to rely on EDGE or some other slow networking
> technology. And if the use case is some small, but
> illegal-in-some-jurisdiction code, such as ebook DRM liberation
> scripts (the kind which today are typically distributed via pastebin's
> :-), my guess is that zipping up a git repository and dropping it on a
> standard bittorrent server run by the Swedish Pirate party is going to
> be much more effective. :-)

Who cares about the development history for such scripts anyway?


Nicolas
Luke Kenneth Casson Leighton
2010-09-05 17:25:42 UTC
Permalink
On Sun, Sep 5, 2010 at 2:18 AM, Nicolas Pitre <***@fluxnic.net> wrote
> I agree. =C2=A0So far it has been an interesting topic for discussion=
, but in
> practice I doubt the actual benefits will justify the required effort=
s
> and/or constraints on the protocol. Otherwise we would have a working
> implementation in use already. =C2=A0People tried in the past, and so=
far
> none of those attempts passed the reality test nor kept people motiva=
ted
> enough to work on them further.

then i'm all the more grateful that you continue to drop technical
hints in my direction. thank you for not judging.

l.
Nicolas Pitre
2010-09-06 00:05:26 UTC
Permalink
On Sun, 5 Sep 2010, Luke Kenneth Casson Leighton wrote:

> On Sun, Sep 5, 2010 at 2:18 AM, Nicolas Pitre <***@fluxnic.net> wrote
> > I agree.  So far it has been an interesting topic for discussion, but in
> > practice I doubt the actual benefits will justify the required efforts
> > and/or constraints on the protocol. Otherwise we would have a working
> > implementation in use already.  People tried in the past, and so far
> > none of those attempts passed the reality test nor kept people motivated
> > enough to work on them further.
>
> then i'm all the more grateful that you continue to drop technical
> hints in my direction. thank you for not judging.

Well, either you'll come to the same conclusion as the other people
before you (myself included), or you'll surprise us all with some clever
solution. But that part is up to you. In either cases, I think it is a
good thing if I can help you grasp the technical limitations and issues
faster so you don't waste your time on false assumptions.


Nicolas
Luke Kenneth Casson Leighton
2010-09-04 12:33:10 UTC
Permalink
On Sat, Sep 4, 2010 at 6:40 AM, Nicolas Pitre <***@fluxnic.net> wrote:

> But again, I don't think it is worth freezing the pack format into a
> canonical encoding for this purpose.

... and, even if different seeds produce different pack index files,
or even if the same seed produces a different pack index file, it
doesn't matter because the objects will still be there, can still be
shared, can still be used to reconstruct the commit.

so even if you end up asking for and sharing different objects
because you're using a different index file, it really really doesn't
matter: all roads lead to rome.

you might end up not _having_ some particular objects (due to git gc)
but that's ok: just re-request the index file again (and go through
the process of evaluating / comparing the index fan-objects again
*sigh* you can't have everything).

l.
Continue reading on narkive:
Loading...