Discussion:
Local clones aka forks disk size optimization
(too old to reply)
Javier Domingo
2012-11-14 23:42:00 UTC
Permalink
Hi,

I have come up with this while doing some local forks for work.
Currently, when you clone a repo using a path (not file:/// protocol)
you get all the common objects linked.

But as you work, each one will continue growing on its way, although
they may have common objects.

Is there any way to avoid this? I mean, can something be done in git,
that it checks for (when pulling) the same objects in the other forks?

Thought this doesn't make much sense in clients, when you have to
maintain 20 forks of very big projects in server side, it eats
precious disk space.

I don't know how if this should have [RFC] in the subject or what. But
here is my idea.

As hardlinking is already done by git, if it checked for how many
links there are for its files, it would be able to find other dirs
where to search. The easier way is checking for the most ancient pack.

Hope you like this idea,

Javier Domingo
Andrew Ardill
2012-11-15 00:18:27 UTC
Permalink
Post by Javier Domingo
Hi,
I have come up with this while doing some local forks for work.
Currently, when you clone a repo using a path (not file:/// protocol)
you get all the common objects linked.
But as you work, each one will continue growing on its way, although
they may have common objects.
Is there any way to avoid this? I mean, can something be done in git,
that it checks for (when pulling) the same objects in the other forks?
How to share objects between existing repositories?
---------------------------------------------------------------------------
Do
echo "/source/git/project/.git/objects/" > .git/objects/info/alternates
and then follow it up with
git repack -a -d -l
where the '-l' means that it will only put local objects in the pack-file
(strictly speaking, it will put any loose objects from the alternate tree
too, so you'll have a fully packed archive, but it won't duplicate objects
that are already packed in the alternate tree).
[1] https://git.wiki.kernel.org/index.php/GitFaq#How_to_share_objects_between_existing_repositories.3F


Regards,

Andrew Ardill
Javier Domingo
2012-11-15 00:40:58 UTC
Permalink
Hi Andrew,

The problem about that, is that if I want to delete the first repo, I
will loose objects... Or does that repack also hard-link the objects
in other repos? I don't want to accidentally loose data, so it would
be nice that althought avoided to repack things, it would also
hardlink them.
Javier Domingo
Post by Andrew Ardill
Post by Javier Domingo
Hi,
I have come up with this while doing some local forks for work.
Currently, when you clone a repo using a path (not file:/// protocol)
you get all the common objects linked.
But as you work, each one will continue growing on its way, although
they may have common objects.
Is there any way to avoid this? I mean, can something be done in git,
that it checks for (when pulling) the same objects in the other forks?
How to share objects between existing repositories?
---------------------------------------------------------------------------
Do
echo "/source/git/project/.git/objects/" > .git/objects/info/alternates
and then follow it up with
git repack -a -d -l
where the '-l' means that it will only put local objects in the pack-file
(strictly speaking, it will put any loose objects from the alternate tree
too, so you'll have a fully packed archive, but it won't duplicate objects
that are already packed in the alternate tree).
[1] https://git.wiki.kernel.org/index.php/GitFaq#How_to_share_objects_between_existing_repositories.3F
Regards,
Andrew Ardill
Andrew Ardill
2012-11-15 00:53:19 UTC
Permalink
Post by Javier Domingo
Hi Andrew,
The problem about that, is that if I want to delete the first repo, I
will loose objects... Or does that repack also hard-link the objects
in other repos? I don't want to accidentally loose data, so it would
be nice that althought avoided to repack things, it would also
hardlink them.
How to stop sharing objects between repositories?
To copy the shared objects into the local repository, repack without the -l flag
git repack -a
Then remove the pointer to the alternate object store
rm .git/objects/info/alternates
(If the repository is edited between the two steps, it could become corrupted
when the alternates file is removed. If you're unsure, you can use git fsck to
check for corruption. If things go wrong, you can always recover by replacing
the alternates file and starting over).
Regards,

Andrew Ardill
Javier Domingo
2012-11-15 01:15:07 UTC
Permalink
Hi Andrew,

Doing this would require I got tracked which one comes from which. So
it would imply some logic (and db) over it. With the hardlinking way,
it wouldn't require anything. The idea is that you don't have to do
anything else in the server.

I understand that it would be imposible to do it for windows users
(but using cygwin), but for *nix ones yes...
Javier Domingo
Post by Andrew Ardill
Post by Javier Domingo
Hi Andrew,
The problem about that, is that if I want to delete the first repo, I
will loose objects... Or does that repack also hard-link the objects
in other repos? I don't want to accidentally loose data, so it would
be nice that althought avoided to repack things, it would also
hardlink them.
How to stop sharing objects between repositories?
To copy the shared objects into the local repository, repack without the -l flag
git repack -a
Then remove the pointer to the alternate object store
rm .git/objects/info/alternates
(If the repository is edited between the two steps, it could become corrupted
when the alternates file is removed. If you're unsure, you can use git fsck to
check for corruption. If things go wrong, you can always recover by replacing
the alternates file and starting over).
Regards,
Andrew Ardill
Andrew Ardill
2012-11-15 01:34:13 UTC
Permalink
Post by Javier Domingo
Hi Andrew,
Doing this would require I got tracked which one comes from which. So
it would imply some logic (and db) over it. With the hardlinking way,
it wouldn't require anything. The idea is that you don't have to do
anything else in the server.
I understand that it would be imposible to do it for windows users
(but using cygwin), but for *nix ones yes...
Javier Domingo
Paraphrasing from git-clone(1):

When cloning a repository, if the source repository is specified with
/path/to/repo syntax, the default is to clone the repository by making
a copy of HEAD and everything under objects and refs directories. The
files under .git/objects/ directory are hardlinked to save space when
possible. To force copying instead of hardlinking (which may be
desirable if you are trying to make a back-up of your repository)
--no-hardlinks can be used.

So hardlinks should be used where possible, and if they are not try
upgrading Git.

I think that covers all the use cases you have?

Regards,

Andrew Ardill
Sitaram Chamarty
2012-11-15 03:44:10 UTC
Permalink
Post by Andrew Ardill
Post by Javier Domingo
Hi Andrew,
Doing this would require I got tracked which one comes from which. So
it would imply some logic (and db) over it. With the hardlinking way,
it wouldn't require anything. The idea is that you don't have to do
anything else in the server.
I understand that it would be imposible to do it for windows users
(but using cygwin), but for *nix ones yes...
Javier Domingo
When cloning a repository, if the source repository is specified with
/path/to/repo syntax, the default is to clone the repository by making
a copy of HEAD and everything under objects and refs directories. The
files under .git/objects/ directory are hardlinked to save space when
possible. To force copying instead of hardlinking (which may be
desirable if you are trying to make a back-up of your repository)
--no-hardlinks can be used.
So hardlinks should be used where possible, and if they are not try
upgrading Git.
I think that covers all the use cases you have?
I am not sure it does. My understanding is this:

'git clone -l' saves space on the initial clone, but subsequent pushes
end up with the same objects duplicated across all the "forks"
(assuming most of the forks keep up with some canonical repo).

The alternates mechanism can give you ongoing savings (as long as you
push to the "main" repo first), but it is dangerous, in the words of
the git-clone manpage. You have to be confident no one will delete a
ref from the "main" repo and then do a gc or let it auto-gc.

He's looking for something that addresses both these issues.

As an additional idea, I suspect this is what the namespaces feature
was created for, but I am not sure, and have never played with it till
now.

Maybe someone who knows namespaces very well will chip in...
Michael J Gruber
2012-11-16 11:25:29 UTC
Permalink
Post by Sitaram Chamarty
Post by Andrew Ardill
Post by Javier Domingo
Hi Andrew,
Doing this would require I got tracked which one comes from which. So
it would imply some logic (and db) over it. With the hardlinking way,
it wouldn't require anything. The idea is that you don't have to do
anything else in the server.
I understand that it would be imposible to do it for windows users
(but using cygwin), but for *nix ones yes...
Javier Domingo
When cloning a repository, if the source repository is specified with
/path/to/repo syntax, the default is to clone the repository by making
a copy of HEAD and everything under objects and refs directories. The
files under .git/objects/ directory are hardlinked to save space when
possible. To force copying instead of hardlinking (which may be
desirable if you are trying to make a back-up of your repository)
--no-hardlinks can be used.
So hardlinks should be used where possible, and if they are not try
upgrading Git.
I think that covers all the use cases you have?
'git clone -l' saves space on the initial clone, but subsequent pushes
end up with the same objects duplicated across all the "forks"
(assuming most of the forks keep up with some canonical repo).
The alternates mechanism can give you ongoing savings (as long as you
push to the "main" repo first), but it is dangerous, in the words of
the git-clone manpage. You have to be confident no one will delete a
ref from the "main" repo and then do a gc or let it auto-gc.
He's looking for something that addresses both these issues.
As an additional idea, I suspect this is what the namespaces feature
was created for, but I am not sure, and have never played with it till
now.
Maybe someone who knows namespaces very well will chip in...
I dunno about namespaces, but a safe route with alternates seems to be:

Provide one "main" clone which is bare, pulls automatically, and is
there to stay (no pruning), so that all others can use that as a
reliable alternates source.

Michael
Enrico Weigelt
2012-11-16 18:04:35 UTC
Permalink
Post by Michael J Gruber
Provide one "main" clone which is bare, pulls automatically, and is
there to stay (no pruning), so that all others can use that as a
reliable alternates source.
The problem here, IMHO, is the assumption, that the main repo will
never be cleaned up. But what to do if you dont wanna let it grow
forever ?

hmm, distributed GC is a tricky problem.

maybe it could be easier having two kind of alternates:

a) classical: gc+friends will drop local objects that are=20
already there
b) fallback: normal operations fetch objects if not accessible
from anywhere else, but gc+friends do not skip objects from there.

And extend prune machinery to put some backup of the dropped objects
to some separate store.

This way we could use some kind of rotating archive:

* GC'ed objects will be stored in the backup repo for some while
* there are multiple active (rotating) backups kept for some time,
each cycle, only the oldest one is dropped (and maybe objects
in a newer backup are removed from the older ones)
* downstream repos must be synced often enough, so removed objects
are fetched back from the backups early enough

You could see this as some heap:

* the currently active objects (directly referenced) are always
on the top
* once they're not referenced, they sink a lever deeper
* when the're referenced again, they immediately jump up to the top
* at some point in time unreferenced objects sink too deep that
they're dropped completely



cu
--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / Kind regards=20

Enrico Weigelt=20
VNC - Virtual Network Consult GmbH=20
Head Of Development=20

Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
=46ax: +49 (30) 3464615-59

***@vnc.biz; www.vnc.de=20
Sitaram Chamarty
2012-11-18 10:42:06 UTC
Permalink
Post by Enrico Weigelt
Post by Michael J Gruber
Provide one "main" clone which is bare, pulls automatically, and is
there to stay (no pruning), so that all others can use that as a
reliable alternates source.
The problem here, IMHO, is the assumption, that the main repo will
never be cleaned up. But what to do if you dont wanna let it grow
forever ?
That's not the only problem. I believe you only get the savings when
the main repo gets the commits first. Which is probably ok most of
the time but it's worth mentioning.
Post by Enrico Weigelt
hmm, distributed GC is a tricky problem.
Except for one little issue (see other thread, subject line "cloning a
namespace downloads all the objects"), namespaces appear to do
everything we want in terms of the typical use cases for alternates,
and/or 'git clone -l', at least on the server side.
Enrico Weigelt
2012-11-18 17:02:37 UTC
Permalink
Hi,
Post by Sitaram Chamarty
That's not the only problem. I believe you only get the savings when
the main repo gets the commits first. Which is probably ok most of
the time but it's worth mentioning.
Well, the saving will just be deferred to the point where the commit
finally went to the main repo and downstreams are gc'ed.
Post by Sitaram Chamarty
Post by Enrico Weigelt
hmm, distributed GC is a tricky problem.
=20
Except for one little issue (see other thread, subject line "cloning a
namespace downloads all the objects"), namespaces appear to do
everything we want in terms of the typical use cases for alternates,
and/or 'git clone -l', at least on the server side.
hmm, not sure about the actual internals, but that namespace filtering
should work in a way that local clone should never see (or consider)
remote refs that are outside of the requested namespace. Perhaps that
should be handled entirely on server side, so all called commands treat
these refs as nonexisting.

By the way: what happens if one tries to clone from an broken repo
(which has several refs pointing to nonexisting objects ?


cu
--=20
Mit freundlichen Gr=C3=BC=C3=9Fen / Kind regards=20

Enrico Weigelt=20
VNC - Virtual Network Consult GmbH=20
Head Of Development=20

Pariser Platz 4a, D-10117 Berlin
Tel.: +49 (30) 3464615-20
=46ax: +49 (30) 3464615-59

***@vnc.biz; www.vnc.de=20
Pyeron, Jason J CTR (US)
2012-11-16 14:55:21 UTC
Permalink
-----Original Message-----
From: Javier Domingo
Sent: Wednesday, November 14, 2012 8:15 PM
Hi Andrew,
Doing this would require I got tracked which one comes from which. So
it would imply some logic (and db) over it. With the hardlinking way,
it wouldn't require anything. The idea is that you don't have to do
anything else in the server.
I understand that it would be imposible to do it for windows users
Not true, it is a file system issue not an os issue. FAT does not support hard links, but ext2,3,4 and NTFS do.
(but using cygwin), but for *nix ones yes...
Javier Domingo
Jörg Rosenkranz
2012-11-18 17:18:56 UTC
Permalink
Post by Javier Domingo
Is there any way to avoid this? I mean, can something be done in git,
that it checks for (when pulling) the same objects in the other forks?
I've been using git-new-workdir
(https://github.com/git/git/blob/master/contrib/workdir/git-new-workdir)
for a similar problem. Maybe that's what you're searching?

Joerg.

Loading...