Discussion:
remove_duplicates() in builtin/fetch-pack.c is O(N^2)
Michael Haggerty
2012-05-21 08:13:33 UTC
Permalink
I just noticed that the remove_duplicates() function in
builtin/fetch-pack.c is O(N^2) in the number of heads. Empirically,
this function takes on the order of 25 seconds to process 100k references.

I know that 100k heads is kindof absurd. Perhaps handling this many
heads is unrealistic for other reasons. But I vaguely recall numbers
like this being mentioned on the mailing list.

It would be pretty trivial to reduce the work to O(N) by using a hash
set to keep track of the references that have already been seen.

I don't plan to work on this, but I thought I would point it out in case
it is causing somebody pain.

Michael
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Junio C Hamano
2012-05-21 09:09:34 UTC
Permalink
Post by Michael Haggerty
It would be pretty trivial to reduce the work to O(N) by using a hash
set to keep track of the references that have already been seen.
Sounds like a sensible thing to do.
demerphq
2012-05-21 09:42:06 UTC
Permalink
I just noticed that the remove_duplicates() function in builtin/fetch=
-pack.c
is O(N^2) in the number of heads. =A0Empirically, this function takes=
on the
order of 25 seconds to process 100k references.
Does "heads" in this context include tags? Sorry if this is a stupid qu=
estion.

Yves


--=20
perl -Mre=3Ddebug -e "/just|another|perl|hacker/"
Jeff King
2012-05-21 17:45:25 UTC
Permalink
Post by Michael Haggerty
I just noticed that the remove_duplicates() function in
builtin/fetch-pack.c is O(N^2) in the number of heads. Empirically,
this function takes on the order of 25 seconds to process 100k
references.
I know that 100k heads is kindof absurd. Perhaps handling this many
heads is unrealistic for other reasons. But I vaguely recall numbers
like this being mentioned on the mailing list.
The rails/rails network repository at GitHub (i.e., a master repo with
all of the objects and refs for all of the forks) has about 400K refs,
and has been the usual impetus for me finding and fixing these sorts of
quadratic problems.

I've never triggered this one, though, because it relies not just on
having a repo with a lot of refs, but actually fetching all of them at
one time (which we don't tend to do).

But it would be nice to fix it. There is a similar case in filter_refs,
which I mentioned here:

http://article.gmane.org/gmane.comp.version-control.git/186994
Post by Michael Haggerty
It would be pretty trivial to reduce the work to O(N) by using a hash
set to keep track of the references that have already been seen.
I don't think there is any reason we can't sort the list of heads, and
then we can get rid of the duplicates with an O(n) traversal, like the
(largely untested) patch below.
Post by Michael Haggerty
I don't plan to work on this, but I thought I would point it out in
case it is causing somebody pain.
I'll clean up the patch and make one for the filter_refs case, too.

-Peff

---
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index b6cc75e..7efcf2f 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -859,25 +859,23 @@ static struct ref *do_fetch_pack(int fd[2],
return ref;
}

+static int compare_heads(const void *a, const void *b)
+{
+ return strcmp(*(const char **)a, *(const char **)b);
+}
+
static int remove_duplicates(int nr_heads, char **heads)
{
int src, dst;

- for (src = dst = 0; src < nr_heads; src++) {
- /* If heads[src] is different from any of
- * heads[0..dst], push it in.
- */
- int i;
- for (i = 0; i < dst; i++) {
- if (!strcmp(heads[i], heads[src]))
- break;
- }
- if (i < dst)
- continue;
- if (src != dst)
- heads[dst] = heads[src];
- dst++;
- }
+ if (!nr_heads)
+ return 0;
+
+ qsort(heads, nr_heads, sizeof(*heads), compare_heads);
+
+ for (src = dst = 1; src < nr_heads; src++)
+ if (strcmp(heads[src], heads[dst-1]))
+ heads[dst++] = heads[src];
return dst;
}
Jeff King
2012-05-21 22:14:17 UTC
Permalink
Post by Jeff King
Post by Michael Haggerty
I don't plan to work on this, but I thought I would point it out in
case it is causing somebody pain.
=20
I'll clean up the patch and make one for the filter_refs case, too.
Here are the patches. I tested these with three repositories:

- the rails/rails network repo, with ~400K refs

- the git/git network repo, with ~220 refs

- a fake repo made from git.git, with 3 numbered refs pointing at eac=
h
commit for a total of ~100K refs (I didn't have a real dataset that
was ~100K).

Before these patches, I never managed to complete a "git clone --mirror
file://$repo" on the first two, as I got impatient after 5-10 minutes
and killed the process. The third one completed in ~110s.

[1/5]: fetch-pack: sort incoming heads
[2/5]: fetch-pack: avoid quadratic behavior in remove_duplicates

After this one, the "fake" case was down to ~63s.

[3/5]: add sorting infrastructure for list refs
[4/5]: fetch-pack: sort the list of incoming refs
[5/5]: fetch-pack: avoid quadratic loop in filter_refs

And after this one, the "fake" case was down to ~32s. Notably, most of
the time was spent on the actual object transfer (i.e., before it would
hang before even getting to "Counting objects...", and now it starts
that almost immediately). Perf corroborates this by showing most of the
time in zlib inflate and delta resolution.

The rails and git cases run in ~28s and ~37s, respectively, again mostl=
y
going to the actual object transfer. So I think this series removes all
of the asymptotically bad behavior from this code path.

One thing to note about all of these repos is that they tend to have
several refs pointing to a single commit. None of the speedups in this
series depends on that fact, but it may be that on a repo with more
independent refs, we may uncover other code paths (e.g., I know that my
fix for mark_complete in ea5f220 improves the case with duplicate refs,
but would not help if you really have 400K refs pointing to unique
commits[1]).

Martin, let me know if this improves things for your many-ref cases (an=
d
if you are still seeing slowness in your repos with many refs, let me
know which operations cause it).

-Peff

[1] At the time of that commit, I proposed a fix with a priority queue
replacing the commit_list, but we deemed it too much code for
a case that was unlikely to happen. But now it sounds like that cas=
e
is less unlikely than we thought, and now that we have Ren=C3=A9's
linked-list merge-sort code, I think we could fix it with a 2-line
change. I'll look into it.
Jeff King
2012-05-21 22:17:02 UTC
Permalink
There's no reason to preserve the incoming order of the
heads we're requested to fetch. By having them sorted, we
can replace some of the quadratic algorithms with linear
ones.

Signed-off-by: Jeff King <***@peff.net>
---
I actually wouldn't be surprised if these were typically sorted already,
as they frequently come from the ref-mapping functions, which in turn
process the lists we get from the remote. But we also might get random
junk on the command-line of fetch-pack, so we need to be careful.

builtin/fetch-pack.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 10db15b..380743e 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -1057,6 +1057,11 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
return ret;
}

+static int compare_heads(const void *a, const void *b)
+{
+ return strcmp(*(const char **)a, *(const char **)b);
+}
+
struct ref *fetch_pack(struct fetch_pack_args *my_args,
int fd[], struct child_process *conn,
const struct ref *ref,
@@ -1076,6 +1081,8 @@ struct ref *fetch_pack(struct fetch_pack_args *my_args,
st.st_mtime = 0;
}

+ qsort(heads, nr_heads, sizeof(*heads), compare_heads);
+
if (heads && nr_heads)
nr_heads = remove_duplicates(nr_heads, heads);
if (!ref) {
--
1.7.10.1.19.g711d603
Junio C Hamano
2012-05-22 20:08:42 UTC
Permalink
Post by Jeff King
There's no reason to preserve the incoming order of the
heads we're requested to fetch. By having them sorted, we
can replace some of the quadratic algorithms with linear
ones.
---
I actually wouldn't be surprised if these were typically sorted already,
as they frequently come from the ref-mapping functions, which in turn
process the lists we get from the remote. But we also might get random
junk on the command-line of fetch-pack, so we need to be careful.
builtin/fetch-pack.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 10db15b..380743e 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -1057,6 +1057,11 @@ int cmd_fetch_pack(int argc, const char **argv, const char *prefix)
return ret;
}
+static int compare_heads(const void *a, const void *b)
+{
+ return strcmp(*(const char **)a, *(const char **)b);
+}
+
struct ref *fetch_pack(struct fetch_pack_args *my_args,
int fd[], struct child_process *conn,
const struct ref *ref,
@@ -1076,6 +1081,8 @@ struct ref *fetch_pack(struct fetch_pack_args *my_args,
st.st_mtime = 0;
}
+ qsort(heads, nr_heads, sizeof(*heads), compare_heads);
+
if (heads && nr_heads)
nr_heads = remove_duplicates(nr_heads, heads);
Hrm, could heads and/or nr_heads be NULL/0 here when we try to run qsort()
in this codepath?
Post by Jeff King
if (!ref) {
Jeff King
2012-05-22 20:23:36 UTC
Permalink
Post by Junio C Hamano
Post by Jeff King
@@ -1076,6 +1081,8 @@ struct ref *fetch_pack(struct fetch_pack_args *my_args,
st.st_mtime = 0;
}
+ qsort(heads, nr_heads, sizeof(*heads), compare_heads);
+
if (heads && nr_heads)
nr_heads = remove_duplicates(nr_heads, heads);
Hrm, could heads and/or nr_heads be NULL/0 here when we try to run qsort()
in this codepath?
Good catch. I had originally put the qsort into remove_duplicates, but
hoisted it out, as the second optimization depends on the sorting, too.
heads can be NULL here (for example, if you run fetch-pack without any
arguments, and without --stdin; though why you would do so is a
mystery, we should protect against it).

A sane qsort would see that its second parameter is 0 and never try to
dereference the array. But I'm not sure all qsort implementations we
will see are sane, so it's probably better to protect it by putting it
inside the conditional block just below.

-Peff
Jeff King
2012-05-24 06:04:51 UTC
Permalink
Post by Jeff King
Post by Junio C Hamano
Post by Jeff King
@@ -1076,6 +1081,8 @@ struct ref *fetch_pack(struct fetch_pack_args *my_args,
st.st_mtime = 0;
}
+ qsort(heads, nr_heads, sizeof(*heads), compare_heads);
+
if (heads && nr_heads)
nr_heads = remove_duplicates(nr_heads, heads);
Hrm, could heads and/or nr_heads be NULL/0 here when we try to run qsort()
in this codepath?
[...]
A sane qsort would see that its second parameter is 0 and never try to
dereference the array. But I'm not sure all qsort implementations we
will see are sane, so it's probably better to protect it by putting it
inside the conditional block just below.
I eye-balled what you queued in pu, and I wonder if you mis-read my
"just below" as "just below the existing line in the conditional" and
not "in the conditional that is just below the code we are talking
about".

I think we need this on top (or squashed in, but it looks like it is
already in next):

-- >8 --
Subject: fetch-pack: sort incoming heads list earlier

Commit 4435968 started sorting heads fed to fetch-pack so
that later commits could use more optimized algorithms;
commit 7db8d53 switched the remove_duplicates function to
such an algorithm.

Of course, the sorting is more effective if you do it
_before_ the algorithm in question.

Signed-off-by: Jeff King <***@peff.net>
---
I suspect that all parts of git feed the refs in sorted order already,
which is why there were no test failures. But we should handle arbitrary
order from the command-line.

builtin/fetch-pack.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 8a72473..b18ba05 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -1082,8 +1082,8 @@ struct ref *fetch_pack(struct fetch_pack_args *my_args,
}

if (heads && nr_heads) {
- nr_heads = remove_duplicates(nr_heads, heads);
qsort(heads, nr_heads, sizeof(*heads), compare_heads);
+ nr_heads = remove_duplicates(nr_heads, heads);
}

if (!ref) {
--
1.7.10.1.25.g7031a0f
Jeff King
2012-05-21 22:17:20 UTC
Permalink
We remove duplicate entries from the list of refs we are
fed in fetch-pack. The original algorithm is quadratic over
the number of refs, but since the list is now guaranteed to
be sorted, we can do it in linear time.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/fetch-pack.c | 21 ++++++---------------
1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 380743e..3522d8e 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -834,21 +834,12 @@ static int remove_duplicates(int nr_heads, char **heads)
{
int src, dst;

- for (src = dst = 0; src < nr_heads; src++) {
- /* If heads[src] is different from any of
- * heads[0..dst], push it in.
- */
- int i;
- for (i = 0; i < dst; i++) {
- if (!strcmp(heads[i], heads[src]))
- break;
- }
- if (i < dst)
- continue;
- if (src != dst)
- heads[dst] = heads[src];
- dst++;
- }
+ if (!nr_heads)
+ return 0;
+
+ for (src = dst = 1; src < nr_heads; src++)
+ if (strcmp(heads[src], heads[dst-1]))
+ heads[dst++] = heads[src];
return dst;
}
--
1.7.10.1.19.g711d603
Jeff King
2012-05-21 22:19:28 UTC
Permalink
Since we store lists of refs as linked lists, we can use
llist_mergesort to efficiently sort them.

Signed-off-by: Jeff King <***@peff.net>
---
Many thanks to Ren=C3=A9 for the llist_mergesort code.

remote.c | 22 ++++++++++++++++++++++
remote.h | 2 ++
2 files changed, 24 insertions(+)

diff --git a/remote.c b/remote.c
index b296d17..6833538 100644
--- a/remote.c
+++ b/remote.c
@@ -7,6 +7,7 @@
#include "dir.h"
#include "tag.h"
#include "string-list.h"
+#include "mergesort.h"
=20
enum map_direction { FROM_SRC, FROM_DST };
=20
@@ -918,6 +919,27 @@ void free_refs(struct ref *ref)
}
}
=20
+int ref_compare_name(const void *va, const void *vb)
+{
+ const struct ref *a =3D va, *b =3D vb;
+ return strcmp(a->name, b->name);
+}
+
+static void *ref_list_get_next(const void *a)
+{
+ return ((const struct ref *)a)->next;
+}
+
+static void ref_list_set_next(void *a, void *next)
+{
+ ((struct ref *)a)->next =3D next;
+}
+
+void sort_ref_list(struct ref **l, int (*cmp)(const void *, const void=
*))
+{
+ *l =3D llist_mergesort(*l, ref_list_get_next, ref_list_set_next, cmp)=
;
+}
+
static int count_refspec_match(const char *pattern,
struct ref *refs,
struct ref **matched_ref)
diff --git a/remote.h b/remote.h
index 9ad8eb6..251d8fd 100644
--- a/remote.h
+++ b/remote.h
@@ -72,6 +72,8 @@ extern const struct refspec *tag_refspec;
struct ref *alloc_ref(const char *name);
struct ref *copy_ref(const struct ref *ref);
struct ref *copy_ref_list(const struct ref *ref);
+void sort_ref_list(struct ref **, int (*cmp)(const void *, const void =
*));
+int ref_compare_name(const void *, const void *);
=20
int check_ref_type(const struct ref *ref, int flags);
=20
--=20
1.7.10.1.19.g711d603
Jeff King
2012-05-21 22:19:51 UTC
Permalink
Having the list sorted means we can avoid some quadratic
algorithms when comparing lists.

These should typically be sorted already, but they do come
from the remote, so let's be extra careful. Our ref-sorting
implementation does a mergesort, so we do not have to care
about performance degrading in the common case that the list
is already sorted.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/fetch-pack.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index 3522d8e..bee329f 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -777,6 +777,8 @@ static struct ref *do_fetch_pack(int fd[2],
struct ref *ref = copy_ref_list(orig_ref);
unsigned char sha1[20];

+ sort_ref_list(&ref, ref_compare_name);
+
if (is_repository_shallow() && !server_supports("shallow"))
die("Server does not support shallow clients");
if (server_supports("multi_ack_detailed")) {
--
1.7.10.1.19.g711d603
Jeff King
2012-05-21 22:23:29 UTC
Permalink
We have a list of refs that we want to compare against the
"match" array. The current code searches the match list
linearly, giving quadratic behavior over the number of refs
when you want to fetch all of them.

Instead, we can compare the lists as we go, giving us linear
behavior.

Signed-off-by: Jeff King <***@peff.net>
---
This is the only actual tricky patch in the series. I think I got the
list traversing right, but more eyeballs are appreciated.

This makes the result O(n). An alternative implementation would be to
simply binary-search the sorted "match" array for each ref. That's O(n
lg n), but we've already spent O(n lg n) sorting in the first place, so
it's not a big deal. The code for that would be a little bit simpler.
However, it's not as simple as you'd like, because we zero-out the match
array to mark entries as "found". So we'd have to add a separate
bit array.

I think this version is simple enough.

builtin/fetch-pack.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/builtin/fetch-pack.c b/builtin/fetch-pack.c
index bee329f..d091865 100644
--- a/builtin/fetch-pack.c
+++ b/builtin/fetch-pack.c
@@ -528,6 +528,7 @@ static void filter_refs(struct ref **refs, int nr_match, char **match)
struct ref **newtail = &newlist;
struct ref *ref, *next;
struct ref *fastarray[32];
+ int match_pos;

if (nr_match && !args.fetch_all) {
if (ARRAY_SIZE(fastarray) < nr_match)
@@ -540,6 +541,7 @@ static void filter_refs(struct ref **refs, int nr_match, char **match)
else
return_refs = NULL;

+ match_pos = 0;
for (ref = *refs; ref; ref = next) {
next = ref->next;
if (!memcmp(ref->name, "refs/", 5) &&
@@ -553,15 +555,20 @@ static void filter_refs(struct ref **refs, int nr_match, char **match)
continue;
}
else {
- int i;
- for (i = 0; i < nr_match; i++) {
- if (!strcmp(ref->name, match[i])) {
- match[i][0] = '\0';
- return_refs[i] = ref;
+ int cmp = -1;
+ while (match_pos < nr_match) {
+ cmp = strcmp(ref->name, match[match_pos]);
+ if (cmp < 0) /* definitely do not have it */
+ break;
+ else if (cmp == 0) { /* definitely have it */
+ match[match_pos][0] = '\0';
+ return_refs[match_pos] = ref;
break;
}
+ else /* might have it; keep looking */
+ match_pos++;
}
- if (i < nr_match)
+ if (!cmp)
continue; /* we will link it later */
}
free(ref);
--
1.7.10.1.19.g711d603
Junio C Hamano
2012-05-22 20:16:34 UTC
Permalink
Post by Jeff King
We have a list of refs that we want to compare against the
"match" array. The current code searches the match list
linearly, giving quadratic behavior over the number of refs
when you want to fetch all of them.
Instead, we can compare the lists as we go, giving us linear
behavior.
Nice. Thanks.
Jeff King
2012-05-21 23:52:19 UTC
Permalink
The rails and git cases run in ~28s and ~37s, respectively, again mostly
going to the actual object transfer. So I think this series removes all
of the asymptotically bad behavior from this code path.
One thing to note about all of these repos is that they tend to have
several refs pointing to a single commit. None of the speedups in this
series depends on that fact, but it may be that on a repo with more
independent refs, we may uncover other code paths (e.g., I know that my
fix for mark_complete in ea5f220 improves the case with duplicate refs,
but would not help if you really have 400K refs pointing to unique
commits[1]).
Hmm. So I started to do some experimentation with this and noticed
something odd.

Try doing "git fetch . refs/*:refs/*" in a repository with a large
number of refs (e.g., 400K). With git v1.7.10, this takes about 9.5s on
my machine. But using the version of git in "next" takes about 16.5s.
Bisection points to your 432ad41 (refs: store references hierarchically,
2012-04-10). Perf shows sort_ref_dir and msort_with_tmp as hot spots.

-Peff
Jeff King
2012-05-22 00:07:47 UTC
Permalink
Post by Jeff King
Try doing "git fetch . refs/*:refs/*" in a repository with a large
number of refs (e.g., 400K). With git v1.7.10, this takes about 9.5s on
my machine. But using the version of git in "next" takes about 16.5s.
Bisection points to your 432ad41 (refs: store references hierarchically,
2012-04-10). Perf shows sort_ref_dir and msort_with_tmp as hot spots.
A few more facts:

- I usually compile with -O0 because I am debugging so frequently.
With -O3, those numbers become 6.0s and 11.2s, respectively. Much
faster, but there is still a performance regression.

- Almost all of the refs are packed. IIRC, the packed-refs file should
already be sorted. I wonder if we did not bother double-checking
that before your patch, and now we do. That could explain the
difference.

- I don't compile with INTERNAL_QSORT. So mentioning msort_with_tmp is
slightly confusing, as it is actually the version from libc, not
ours IOW, it is really just qsort. But the primary culprit remains
sort_ref_dir.

-Peff
Michael Haggerty
2012-05-22 03:59:29 UTC
Permalink
Post by Jeff King
The rails and git cases run in ~28s and ~37s, respectively, again mostly
going to the actual object transfer. So I think this series removes all
of the asymptotically bad behavior from this code path.
One thing to note about all of these repos is that they tend to have
several refs pointing to a single commit. None of the speedups in this
series depends on that fact, but it may be that on a repo with more
independent refs, we may uncover other code paths (e.g., I know that my
fix for mark_complete in ea5f220 improves the case with duplicate refs,
but would not help if you really have 400K refs pointing to unique
commits[1]).
Hmm. So I started to do some experimentation with this and noticed
something odd.
Try doing "git fetch . refs/*:refs/*" in a repository with a large
number of refs (e.g., 400K). With git v1.7.10, this takes about 9.5s on
my machine. But using the version of git in "next" takes about 16.5s.
Bisection points to your 432ad41 (refs: store references hierarchically,
2012-04-10). Perf shows sort_ref_dir and msort_with_tmp as hot spots.
I'm looking into this.

For your test, were the 400k references all in one "directory", or were
they sharded somehow?

A large number of packed references in a single namespace is the
worst-case scenario for the hierarchical refs change. Since the refs in
a namespace are all stored in a single array, there is no benefit from
the hierarchical storage whereas there is some extra cost from
traversing the namespace tree for each lookup. But the performance hit
shouldn't be this large.

Michael
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Jeff King
2012-05-22 04:11:23 UTC
Permalink
Post by Michael Haggerty
Post by Jeff King
Try doing "git fetch . refs/*:refs/*" in a repository with a large
number of refs (e.g., 400K). With git v1.7.10, this takes about 9.5s on
my machine. But using the version of git in "next" takes about 16.5s.
Bisection points to your 432ad41 (refs: store references hierarchically,
2012-04-10). Perf shows sort_ref_dir and msort_with_tmp as hot spots.
I'm looking into this.
For your test, were the 400k references all in one "directory", or
were they sharded somehow?
Sharded. This was the rails network repo, so it basically looks like
this:

refs/remotes/XXXXXXX/heads/master
refs/remotes/XXXXXXX/tags/v1.0
refs/remotes/XXXXXXX/tags/v1.1
...
refs/remotes/XXXXXXX/tags/v3.5
refs/remotes/YYYYYYY/heads/master
refs/remotes/YYYYYYY/tags/v1.0
refs/remotes/YYYYYYY/tags/v1.1
...

Basically the same 90 or so refs you would have in a normal repository
(a branch or two, and a bunch of tags), repeated over and over under
numbered remote hierarchies (i.e., each remote is basically a copy of
the refs from somebody's fork of the repo).

I can provide you with the exact repo if you want, but it is about 100M.

Interestingly, I don't seem to get nearly as much slow-down in my "fake"
many-refs repo, which has all 100K entries directly in refs/heads.

-Peff
Michael Haggerty
2012-05-22 07:15:55 UTC
Permalink
Post by Jeff King
Post by Michael Haggerty
Post by Jeff King
Try doing "git fetch . refs/*:refs/*" in a repository with a large
number of refs (e.g., 400K). With git v1.7.10, this takes about 9.5s on
my machine. But using the version of git in "next" takes about 16.5s.
Bisection points to your 432ad41 (refs: store references hierarchically,
2012-04-10). Perf shows sort_ref_dir and msort_with_tmp as hot spots.
I'm looking into this.
For your test, were the 400k references all in one "directory", or
were they sharded somehow?
Sharded. This was the rails network repo, so it basically looks like
refs/remotes/XXXXXXX/heads/master
refs/remotes/XXXXXXX/tags/v1.0
refs/remotes/XXXXXXX/tags/v1.1
...
refs/remotes/XXXXXXX/tags/v3.5
refs/remotes/YYYYYYY/heads/master
refs/remotes/YYYYYYY/tags/v1.0
refs/remotes/YYYYYYY/tags/v1.1
...
Basically the same 90 or so refs you would have in a normal repository
(a branch or two, and a bunch of tags), repeated over and over under
numbered remote hierarchies (i.e., each remote is basically a copy of
the refs from somebody's fork of the repo).
I can provide you with the exact repo if you want, but it is about 100M.
Interestingly, I don't seem to get nearly as much slow-down in my "fake"
many-refs repo, which has all 100K entries directly in refs/heads.
That could explain why I cannot reproduce your result. I have done all
of my testing in fake repos (with sharded and non-sharded variants) with
very little file content.

If it is not too much trouble, please let me know where I can obtain
your test repo and what commands you used to get your result. (Was the
local repo already a full clone of the remote, including all 400k
references? How was the remote set up--sharing data or not, local or
remote? Warm or cold disk cache?)

Thanks,
Michael
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Jeff King
2012-05-22 07:37:40 UTC
Permalink
Post by Michael Haggerty
If it is not too much trouble, please let me know where I can obtain
your test repo and what commands you used to get your result. (Was
the local repo already a full clone of the remote, including all 400k
references? How was the remote set up--sharing data or not, local or
remote? Warm or cold disk cache?)
I've put the repo up at:

https://gist.github.com/raw/2767328/603926240fabb4d3e3abc3c93a1913d91852cc7e/rails.tar.gz

You can reproduce the slow-down with:

cd rails.git &&
git fetch . refs/*:refs/*

which should be a no-op, as we already have all of the refs. I don't
know if the problem is actually specific to fetch; that was just how I
noticed it.

When I try it with both 'next' and v1.7.10, I see that the latter is
much faster. I did my tests with a warm cache, but the interesting
number is the CPU time, which is quite different.

-Peff
Michael Haggerty
2012-05-22 13:28:32 UTC
Permalink
Post by Jeff King
Post by Michael Haggerty
If it is not too much trouble, please let me know where I can obtain
your test repo and what commands you used to get your result. (Was
the local repo already a full clone of the remote, including all 400k
references? How was the remote set up--sharing data or not, local or
remote? Warm or cold disk cache?)
https://gist.github.com/raw/2767328/603926240fabb4d3e3abc3c93a1913d91852cc7e/rails.tar.gz
cd rails.git&&
git fetch . refs/*:refs/*
which should be a no-op, as we already have all of the refs. I don't
know if the problem is actually specific to fetch; that was just how I
noticed it.
When I try it with both 'next' and v1.7.10, I see that the latter is
much faster. I did my tests with a warm cache, but the interesting
number is the CPU time, which is quite different.
I cannot reproduce anything as big as the performance regression that
you see. I did find a regression 9.5 s -> 10.1 s caused by

5fa044184 find_containing_dir(): use strbuf in implementation of this
function

It is fixed by the patch that I just sent to the mailing list [1], which
sizes the strbuf in that function to strlen(refname) instead of
PATH_MAX. Since your experiments suggest that the performance
regression is related to the size of the repository contents, it could
be that the same test produces more memory pressure on your system and
therefore a larger effect. Please try the patch and tell me if it fixes
the problem for you.

[1] http://article.gmane.org/gmane.comp.version-control.git/198197
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Jeff King
2012-05-22 17:33:55 UTC
Permalink
Post by Michael Haggerty
Post by Jeff King
When I try it with both 'next' and v1.7.10, I see that the latter is
much faster. I did my tests with a warm cache, but the interesting
number is the CPU time, which is quite different.
I cannot reproduce anything as big as the performance regression that
you see. I did find a regression 9.5 s -> 10.1 s caused by
5fa044184 find_containing_dir(): use strbuf in implementation of this
function
It is fixed by the patch that I just sent to the mailing list [1],
which sizes the strbuf in that function to strlen(refname) instead of
PATH_MAX. Since your experiments suggest that the performance
regression is related to the size of the repository contents, it
could be that the same test produces more memory pressure on your
system and therefore a larger effect. Please try the patch and tell
me if it fixes the problem for you.
That patch drops about a second off of the slow case, but the main
problem still remains. Just to be sure we are both doing the exact same
thing, here is a complete reproduction recipe:

GIT=/path/to/your/git/checkout
RAILS=/path/to/unpacked/rails.git

cd $GIT &&
git checkout 432ad41e60cedb87ceec446ab034d46a53f5f9d8^ &&
make &&

cd $RAILS &&
time $GIT/bin-wrappers/git fetch . refs/*:refs/* &&

cd $GIT &&
git checkout 432ad41e60cedb87ceec446ab034d46a53f5f9d8 &&
make &&

cd $RAILS &&
time $GIT/bin-wrappers/git fetch . refs/*:refs/*

produces:

[before]
real 0m9.128s
user 0m9.369s
sys 0m0.976s

[after]
real 0m15.926s
user 0m16.181s
sys 0m0.984s

I don't think memory pressure is involved. The git process maxes out at
~300M, and this machine has 7.5G available.

I wonder why we are getting different results. Could it be compilation
options? As I mentioned, I compile with -O0, but I got similar results
with -O3.

-Peff
Michael Haggerty
2012-05-24 12:05:16 UTC
Permalink
Post by Jeff King
Post by Michael Haggerty
Post by Jeff King
When I try it with both 'next' and v1.7.10, I see that the latter is
much faster. I did my tests with a warm cache, but the interesting
number is the CPU time, which is quite different.
I cannot reproduce anything as big as the performance regression that
you see. I did find a regression 9.5 s -> 10.1 s caused by
5fa044184 find_containing_dir(): use strbuf in implementation of this
function
It is fixed by the patch that I just sent to the mailing list [1],
which sizes the strbuf in that function to strlen(refname) instead of
PATH_MAX. Since your experiments suggest that the performance
regression is related to the size of the repository contents, it
could be that the same test produces more memory pressure on your
system and therefore a larger effect. Please try the patch and tell
me if it fixes the problem for you.
That patch drops about a second off of the slow case, but the main
problem still remains. Just to be sure we are both doing the exact same
GIT=/path/to/your/git/checkout
RAILS=/path/to/unpacked/rails.git
cd $GIT&&
git checkout 432ad41e60cedb87ceec446ab034d46a53f5f9d8^&&
make&&
cd $RAILS&&
time $GIT/bin-wrappers/git fetch . refs/*:refs/*&&
cd $GIT&&
git checkout 432ad41e60cedb87ceec446ab034d46a53f5f9d8&&
make&&
cd $RAILS&&
time $GIT/bin-wrappers/git fetch . refs/*:refs/*
[before]
real 0m9.128s
user 0m9.369s
sys 0m0.976s
[after]
real 0m15.926s
user 0m16.181s
sys 0m0.984s
I don't think memory pressure is involved. The git process maxes out at
~300M, and this machine has 7.5G available.
I wonder why we are getting different results. Could it be compilation
options? As I mentioned, I compile with -O0, but I got similar results
with -O3.
Thanks for the idiot-proof reproduction steps. I don't know what I was
doing differently last time, but now I can reproduce your slowdown.

The thing that triggers the problem is a reference directory
("refs/remotes/8514/pull/") with 3810 sub-namespaces
("refs/remotes/8514/pull/1091", "refs/remotes/8514/pull/1092", etc),
each subnamespace containing only one or two references. It wouldn't be
a problem having so many *references* in a namespace, because references
are just added to the end of the directory unsorted, and typically only
sorted once after all of them have been added. But every time that a
sub-namespace is accessed, the namespace has to be searched to see if
that sub-namespace already exists. The searching requires the namespace
to be sorted. So, for example, when adding the following sequence of
references:

1. refs/remotes/8514/pull/1091/head
2. refs/remotes/8514/pull/1091/merge
3. refs/remotes/8514/pull/1092/head
4. refs/remotes/8514/pull/1092/merge
5. refs/remotes/8514/pull/1093/head
6. refs/remotes/8514/pull/1093/merge

At step 1, sub-namespace "refs/remotes/8514/pull/1091/" is added to
"refs/remotes/8514/pull/". This makes the code think that
"refs/remotes/8514/pull/" is unsorted.

Step 2 is not problematic; the new references is added to
"refs/remotes/8514/pull/1091/", but adding a reference doesn't require
the ref_dir to be sorted.

At step 3, namespace "refs/remotes/8514/pull/" is first checked to see
if sub-namespace "refs/remotes/8514/pull/1092/" already exists. This
search requires namespace "refs/remotes/8514/pull/" to be sorted because
step 1 caused it to be considered unsorted.

Again at step 5, namespace "refs/remotes/8514/pull/" needs to be sorted,
and so on every time a subnamespace is added.

I will submit a patch shortly.

Michael
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Martin Fick
2012-05-25 00:17:45 UTC
Permalink
Post by Jeff King
Martin, let me know if this improves things for your
many-ref cases (and if you are still seeing slowness in
your repos with many refs, let me know which operations
cause it).
Peff,

I have not been ignoring you, I am sorry that I have not
replied yet. Unfortunately, I am having a very hard time
getting conclusive tests with my large repo. I making
plenty of mistakes in what I think I am testing I believe,
but also I am having a hard time getting reproducible
results so far. And some tests take quite a while, so it is
not always obvious what I might be doing wrong.

I will let you know more when I figure out what I am doing
wrong, but please know that I have been doing a lot of
testing and plan to post some results eventually.

Were your tests mostly warm cache tests?

-Martin
--
Employee of Qualcomm Innovation Center, Inc. which is a
member of Code Aurora Forum
Jeff King
2012-05-25 00:39:20 UTC
Permalink
Post by Martin Fick
Post by Jeff King
Martin, let me know if this improves things for your
many-ref cases (and if you are still seeing slowness in
your repos with many refs, let me know which operations
cause it).
I have not been ignoring you, I am sorry that I have not
replied yet. Unfortunately, I am having a very hard time
getting conclusive tests with my large repo. I making
plenty of mistakes in what I think I am testing I believe,
but also I am having a hard time getting reproducible
results so far. And some tests take quite a while, so it is
not always obvious what I might be doing wrong.
No worries. I am pretty confident that the patches I supplied so far are
a good thing whether they help your case or not. So if you come back in
a month and show that they solved all your problems, then great. And if
they don't, then it will just show us what new problems we have to work
on. :)
Post by Martin Fick
Were your tests mostly warm cache tests?
Yes, exclusively warm. And all of the refs were packed, which makes the
warm/cold difference less interesting (it's one 30MB or so file). I
don't think there's much point in thinking about the performance of 400K
loose refs (which would be absolutely horrific cold-cache on most
traditional filesystems). If you have that many, you would want to keep
the bulk of them packed.

-Peff
Martin Fick
2012-05-25 00:54:56 UTC
Permalink
On Thu, May 24, 2012 at 06:17:45PM -0600, Martin Fick
Post by Martin Fick
Were your tests mostly warm cache tests?
Yes, exclusively warm. And all of the refs were packed,
which makes the warm/cold difference less interesting
(it's one 30MB or so file). I don't think there's much
point in thinking about the performance of 400K loose
refs (which would be absolutely horrific cold-cache on
most traditional filesystems). If you have that many,
you would want to keep the bulk of them packed.
Mostly true, except for one strange case still I think?

When cloning a gerrit repo, users to not get the changes
since they are not under refs/heads but refs/changes. So
later, if they choose to fetch refs/changes/*, all of those
new incoming refs are loose. Yes, someone should pack those
refs right away, but I think it actually churns the hell out
of my disk and takes a significant amount of time during the
initial fetch. I am not certain about this, and the
behavior may depend on the filesystem in use, but I think
that this time might even be asynchronous (journals and
all), it feels like my disk keeps churning for a while even
after this is over. I believe that this might still be the
worst case left with refs, and it can be pretty bad,

-Martin
--
Employee of Qualcomm Innovation Center, Inc. which is a
member of Code Aurora Forum
Jeff King
2012-05-25 01:04:34 UTC
Permalink
Post by Martin Fick
Post by Jeff King
Yes, exclusively warm. And all of the refs were packed,
which makes the warm/cold difference less interesting
(it's one 30MB or so file). I don't think there's much
point in thinking about the performance of 400K loose
refs (which would be absolutely horrific cold-cache on
most traditional filesystems). If you have that many,
you would want to keep the bulk of them packed.
Mostly true, except for one strange case still I think?
When cloning a gerrit repo, users to not get the changes
since they are not under refs/heads but refs/changes. So
later, if they choose to fetch refs/changes/*, all of those
new incoming refs are loose.
Hmm. Yeah, clone will always write a packed-refs file, but I think "git
fetch" will always write loose refs, under the assumption that the
former will be getting a lot more refs than the latter. But of course
that is only a guess. It would be nice if fetch could fetch straight
into packed refs if we are getting more than N items.

We'd have to give some thought to potential race conditions, though.
Usually pack-refs isn't modifying the ref, so it can just write out the
value to the packed-refs file, then delete the loose ref if nobody has
touched it since we wrote. But here we're combining it with a
modification, so I suspect there would be a race with another process
trying to modify it.
Post by Martin Fick
Yes, someone should pack those
refs right away, but I think it actually churns the hell out
of my disk and takes a significant amount of time during the
initial fetch. I am not certain about this, and the
behavior may depend on the filesystem in use, but I think
that this time might even be asynchronous (journals and
all), it feels like my disk keeps churning for a while even
after this is over. I believe that this might still be the
worst case left with refs, and it can be pretty bad,
Yeah, I wouldn't be surprised if this thrashes your disk. Writing
hundreds of thousands of 40-byte files is one of the most awful loads
for many filesystems, since each file gets its own inode. I haven't
tried btrfs, but my impression is that it can magically pack the data
from many files into one node.

-Peff
Martin Fick
2012-05-25 01:32:36 UTC
Permalink
On Thu, May 24, 2012 at 06:54:56PM -0600, Martin Fick
Post by Jeff King
Yes, exclusively warm. And all of the refs were
We'd have to give some thought to potential race
conditions, though. Usually pack-refs isn't modifying
the ref, so it can just write out the value to the
packed-refs file, then delete the loose ref if nobody
has touched it since we wrote. But here we're combining
it with a modification, so I suspect there would be a
race with another process trying to modify it.
Yeah, I thought about that. Could you just right the new
packed-ref file with the new refs and the old refs which
were in the file already, then just delete any loose refs
which were ones which were just added by this operation,
only if they have not changed?

This way, if someone else modifies one of the same refs,
they could just win?

-Martin
--
Employee of Qualcomm Innovation Center, Inc. which is a
member of Code Aurora Forum
Jeff King
2012-05-25 06:50:40 UTC
Permalink
Post by Martin Fick
Post by Jeff King
Post by Jeff King
Yes, exclusively warm. And all of the refs were
We'd have to give some thought to potential race
conditions, though. Usually pack-refs isn't modifying
the ref, so it can just write out the value to the
packed-refs file, then delete the loose ref if nobody
has touched it since we wrote. But here we're combining
it with a modification, so I suspect there would be a
race with another process trying to modify it.
Yeah, I thought about that. Could you just right the new
packed-ref file with the new refs and the old refs which
were in the file already, then just delete any loose refs
which were ones which were just added by this operation,
only if they have not changed?
This way, if someone else modifies one of the same refs,
they could just win?
I spent a bit of time thinking on this and trying to come up with an
exact sequence where we could lose the race, but I don't think we can
lose data. Either:

1. The loose ref is not changed afterwards, and we delete it, making
our packed version the official one.

2. The loose ref is changed, and we leave it, which means our packed
version is not interesting to anybody (the loose version takes
precedence). We have lost the race, and must consider our write a
failure.

It's OK to fail in (2); that is no different than the regular loose
case. And likewise, it's worth it to look at the other writer.

They might see our packed-refs file in place before we have a chance to
modify the loose ref. But that's OK, because they must double-check the
existence of the loose ref, and it will still be there. So they will
take that as the real value and ignore the packed value.

All of that is more or less the same set of rules that make "git
pack-refs" work at all. The only novel situation we are introducing here
is that we might be writing a packed-ref without having an existing
loose ref at all (whereas usually we would be packing an existing ref,
either loose or packed). But I don't think it makes a difference.

I might be wrong, though. In the course of writing this email, I kept
thinking "wait, but here's a race", and then when I worked out the
details, it was OK. So either I was not clever enough to find a race, or
I am not clever enough to realize that there is not one. :)

-Peff

Nguyen Thai Ngoc Duy
2012-05-22 12:18:00 UTC
Permalink
Post by Jeff King
The rails/rails network repository at GitHub (i.e., a master repo with
all of the objects and refs for all of the forks) has about 400K refs,
and has been the usual impetus for me finding and fixing these sorts of
quadratic problems.
Off topic and pure speculation. With 400k refs, each one 20 byte in
length, the pathname part only can take 7MB. Perhaps packed-refs
should learn prefix compressing too, like index v4, to reduce size
(and hopefully improve startup speed). Compressing refs/heads/ and
refs/tags/ only could gain quite a bit already.
--
Duy
Michael Haggerty
2012-05-22 13:35:22 UTC
Permalink
Post by Nguyen Thai Ngoc Duy
Post by Jeff King
The rails/rails network repository at GitHub (i.e., a master repo with
all of the objects and refs for all of the forks) has about 400K refs,
and has been the usual impetus for me finding and fixing these sorts of
quadratic problems.
Off topic and pure speculation. With 400k refs, each one 20 byte in
length, the pathname part only can take 7MB. Perhaps packed-refs
should learn prefix compressing too, like index v4, to reduce size
(and hopefully improve startup speed). Compressing refs/heads/ and
refs/tags/ only could gain quite a bit already.
I considered this idea but put it off for now for the reasons explained
in the docstring for struct ref_entry in refs.c:

* Please note that the name field contains the fully-qualified
* reference (or subdirectory) name. Space could be saved by only
* storing the relative names. But that would require the full names
* to be generated on the fly when iterating in do_for_each_ref(), and
* would break callback functions, who have always been able to assume
* that the name strings that they are passed will not be freed during
* the iteration.

Thus, all of the callers of the for_each_ref functions would have to be
audited (and perhaps adjusted) before such a change could be implemented.

Michael
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Jeff King
2012-05-22 17:01:01 UTC
Permalink
Post by Nguyen Thai Ngoc Duy
Post by Jeff King
The rails/rails network repository at GitHub (i.e., a master repo with
all of the objects and refs for all of the forks) has about 400K refs,
and has been the usual impetus for me finding and fixing these sorts of
quadratic problems.
Off topic and pure speculation. With 400k refs, each one 20 byte in
length, the pathname part only can take 7MB. Perhaps packed-refs
should learn prefix compressing too, like index v4, to reduce size
(and hopefully improve startup speed). Compressing refs/heads/ and
refs/tags/ only could gain quite a bit already.
In this case, the packed-refs file is 30MB. Even just gzipping it takes
it down to 2MB. As far as I know, we don't ever do random access on the
file, but instead just stream it into memory.

-Peff
Junio C Hamano
2012-05-22 17:35:50 UTC
Permalink
Post by Jeff King
Post by Nguyen Thai Ngoc Duy
Post by Jeff King
The rails/rails network repository at GitHub (i.e., a master repo with
all of the objects and refs for all of the forks) has about 400K refs,
and has been the usual impetus for me finding and fixing these sorts of
quadratic problems.
Off topic and pure speculation. With 400k refs, each one 20 byte in
length, the pathname part only can take 7MB. Perhaps packed-refs
should learn prefix compressing too, like index v4, to reduce size
(and hopefully improve startup speed). Compressing refs/heads/ and
refs/tags/ only could gain quite a bit already.
In this case, the packed-refs file is 30MB. Even just gzipping it takes
it down to 2MB. As far as I know, we don't ever do random access on the
file, but instead just stream it into memory.
True.

The current code reads the whole thing in upon first use of _any_ element
in the file, just like the index codepath does for the index file.

But the calling pattern to the refs machinery is fairly well isolated and
all happens in refs.c file. Especially thanks to the recent work by
Michael Haggerty, for "I am about to create a new branch 'frotz'; do I
have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?"
kind of callers, it is reasonably easy to design a better structured
packed-refs file format to allow us to read only a subtree portion of
refs/ hierarchy, and plug that logic into the lazy ref population code.
Such a "design a better packed-refs format for scalability to 400k refs"
is a very well isolated project that has high chance of succeeding without
breaking things. No code outside refs.c assumes that there is a flat
array of refs that records what was read from the packed-refs file and can
walk linearly over it, unlike the in-core index.

If you do "for_each_ref()" for everything (e.g. sending 'have' during the
object transfer, or repacking the whole repository), you would end up
needing to read the whole thing for obvious reasons.
Jeff King
2012-05-22 17:46:48 UTC
Permalink
Post by Junio C Hamano
Post by Jeff King
In this case, the packed-refs file is 30MB. Even just gzipping it takes
it down to 2MB. As far as I know, we don't ever do random access on the
file, but instead just stream it into memory.
True.
The current code reads the whole thing in upon first use of _any_ element
in the file, just like the index codepath does for the index file.
But the calling pattern to the refs machinery is fairly well isolated and
all happens in refs.c file. Especially thanks to the recent work by
Michael Haggerty, for "I am about to create a new branch 'frotz'; do I
have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?"
kind of callers, it is reasonably easy to design a better structured
packed-refs file format to allow us to read only a subtree portion of
refs/ hierarchy, and plug that logic into the lazy ref population code.
Such a "design a better packed-refs format for scalability to 400k refs"
is a very well isolated project that has high chance of succeeding without
breaking things. No code outside refs.c assumes that there is a flat
array of refs that records what was read from the packed-refs file and can
walk linearly over it, unlike the in-core index.
Yeah. As gross as I find a 30MB packed-refs file, I don't actually think
that is the bottle-neck at all. Sure, it wastes some disk cache, but
in the grand scheme of things, anybody with 400K refs probably has a
much, much bigger object database.

So there probably isn't much point in clever tricks to make it smaller
on disk, especially if they will make further partial-read optimizations
harder to do. But...
Post by Junio C Hamano
If you do "for_each_ref()" for everything (e.g. sending 'have' during the
object transfer, or repacking the whole repository), you would end up
needing to read the whole thing for obvious reasons.
I think Michael's work is really great for the loose refs, where hitting
each directory is very expensive. But for a repo with packed refs that
is just going to call for_each_ref, I'm worried that breaking apart the
ref list is going to cause noticeable overhead. But still, the
regression I'm seeing seems way too big for that overhead, so I think
something else is going on. We just need to find it.

-Peff
Michael Haggerty
2012-05-24 04:54:07 UTC
Permalink
Post by Junio C Hamano
The current code reads the whole thing in upon first use of _any_ element
in the file, just like the index codepath does for the index file.
But the calling pattern to the refs machinery is fairly well isolated and
all happens in refs.c file. Especially thanks to the recent work by
Michael Haggerty, for "I am about to create a new branch 'frotz'; do I
have 'refs/heads/frotz' or anything that begins with 'refs/heads/frotz/'?"
kind of callers, it is reasonably easy to design a better structured
packed-refs file format to allow us to read only a subtree portion of
refs/ hierarchy, and plug that logic into the lazy ref population code.
Such a "design a better packed-refs format for scalability to 400k refs"
is a very well isolated project that has high chance of succeeding without
breaking things. No code outside refs.c assumes that there is a flat
array of refs that records what was read from the packed-refs file and can
walk linearly over it, unlike the in-core index.
Even with the current file format, it would not be so difficult to
bisect the file, synchronizing on record boundaries by looking for the
next NL character. Because of the way the file is sorted, it would also
be reasonably efficient to read whole subtrees in one slurp (e.g., for
for_each_ref() with a prefix argument). Nontrivial modifications would
of course not be possible without a rewrite.

There would need to be some intelligence built-in; after enough
single-reference accesses come in a row, then the refs module should
probably take it upon itself to read the whole packed-refs file to speed
up further lookups.
Post by Junio C Hamano
If you do "for_each_ref()" for everything (e.g. sending 'have' during the
object transfer, or repacking the whole repository), you would end up
needing to read the whole thing for obvious reasons.
Yes. ISTM that any hope to avoid O(number of refs) problems when
exchanging commits must involve using more intelligence about how
references are related to each other topologically to improve the
negotiation about what needs to be transferred.

Michael
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Nguyen Thai Ngoc Duy
2012-05-23 01:20:21 UTC
Permalink
Post by Jeff King
Post by Nguyen Thai Ngoc Duy
Post by Jeff King
The rails/rails network repository at GitHub (i.e., a master repo with
all of the objects and refs for all of the forks) has about 400K refs,
and has been the usual impetus for me finding and fixing these sorts of
quadratic problems.
Off topic and pure speculation. With 400k refs, each one 20 byte in
length, the pathname part only can take 7MB. Perhaps packed-refs
should learn prefix compressing too, like index v4, to reduce size
(and hopefully improve startup speed). Compressing refs/heads/ and
refs/tags/ only could gain quite a bit already.
In this case, the packed-refs file is 30MB. Even just gzipping it takes
it down to 2MB. As far as I know, we don't ever do random access on the
file, but instead just stream it into memory.
I did not think really hard when I wrote what I wrote. File size is
not really a problem here. In index case, we do expensive sha-1 on the
entire file so size matters. We don't do anything that expensive in
packed-refs (, I think. Should we protect it with some sort of
checksum btw?). gzip can add more computation cost at startup. 30MB is
probably still ok.

Though sending 2MB on the wire is way better than 30MB when we
advertise refs at cloning/fetching.
--
Duy
Junio C Hamano
2012-05-22 16:16:47 UTC
Permalink
Post by Jeff King
I don't think there is any reason we can't sort the list of heads, and
then we can get rid of the duplicates with an O(n) traversal, like the
(largely untested) patch below.
Yeah, I think the only case the order matters in fetch is the traditional
"when multiple non-glob fetch refspecs are configured, only merge the
first one" logic, but it kicks in a much higher layer and the order in
this list should not have any effect on it.

Thanks.
Martin Fick
2012-05-21 18:15:13 UTC
Permalink
Post by Michael Haggerty
I just noticed that the remove_duplicates() function in
builtin/fetch-pack.c is O(N^2) in the number of heads.
Empirically, this function takes on the order of 25
seconds to process 100k references.
I know that 100k heads is kindof absurd. Perhaps
handling this many heads is unrealistic for other
reasons. But I vaguely recall numbers like this being
mentioned on the mailing list.
Yes I have mentioned 100K several times, and I greatly
appreciate the many fixes already made to make git to better
handle large ref counts.

However, I would like to suggest that 100K not really be
viewed as absurd anymore. :) There are many users for whom
it is not absurd, certainly not if you are including tags.
But, I know that some of the tag uses have been brushed off
as abuses, so I will focus on feature branches, of which we
actually have more than tags in our larger repos, we have
around 60K in our kernel repo.

Of course, we use Gerrit, so features tend to be called
changes and each change may get many revisions (patchsets),
so all of these get refs, but I think that it might be wrong
to consider that out of the ordinary anymore. After all,
should a version control system such as git not support 100K
revisions of features developed independently on separate
branches (within Gerrit or not)? 100K is not really that
many when you consider a large project. Even without
Gerrit, if someone wanted to track that many features
(likely over a few years), they will probably use up tons of
refs.

It may be too easy to think that because git is distributed
that features will get developed in a distributed way and
therefor no one would ever want to track them all in one
place, but I suspect that this may be a bad assumption.
That being said, I believe that we are not even tracking
external features, and we have over 60K feature revisions
(gerrit patchsets) in one rep), so if someone were to track
all the changes for the kernel, I can only imagine that this
number would be in the millions.

I am sure that 1M refs is even within the sights of some
individuals already, I know users who actually have 250K. I
hope that 10M or even perhaps 100M refs will be considered
feasible to use long term with git.

Again, I really hope that this will no longer be seen as
absurd, but rather just normal for large projects. After
all the kernel was (still is?) the primary use case of git.
Our largest ref project is the kernel so I don't know that
we should be considered fringe, and I hope that we along
with other larger kernel contributors will be considered
normal to git, :)

-Martin
--
Employee of Qualcomm Innovation Center, Inc. which is a
member of Code Aurora Forum
Jeff King
2012-05-21 19:41:14 UTC
Permalink
Post by Martin Fick
Of course, we use Gerrit, so features tend to be called
changes and each change may get many revisions (patchsets),
so all of these get refs, but I think that it might be wrong
to consider that out of the ordinary anymore. After all,
should a version control system such as git not support 100K
revisions of features developed independently on separate
branches (within Gerrit or not)? 100K is not really that
many when you consider a large project. Even without
Gerrit, if someone wanted to track that many features
(likely over a few years), they will probably use up tons of
refs.
I think the more compelling line of argument is not "is this
reasonable?", but rather that git has been designed from the ground-up
to be efficient, and these are not fundamental design issues with git at
all. They are just silly little spots where we used a quick-to-write
quadratic algorithm instead of something more complex with better
asymptotic behavior. And if we can fix these silly spots easily, then
there's no reason not to. It helps the small-N case a tiny bit, and it
makes the big-N case feasible.

So far, the only quadratic case I have seen that is not easy to fix is
replacing "struct commit_list" with a priority queue or similar. But we
managed to hack around that long ago with fce87ae (Fix quadratic
performance in rewrite_one., 2008-07-12), and I don't think it's
generally a problem in practice.

Anyway, my point is that we don't even have to talk about "reasonable"
or "absurd". Git should be fast even on absurd cases, because 99% of the
work has already been done, and the last 1% is easy.

-Peff
Junio C Hamano
2012-05-21 22:08:46 UTC
Permalink
Post by Jeff King
Anyway, my point is that we don't even have to talk about "reasonable"
or "absurd". Git should be fast even on absurd cases,...
I agree with you 98%.

If we can make Git fast even on "absurd" cases without penalizing less
extreme and boring cases and/or maintainability of the code, we should
strive to do so. And I think the codepaths mentioned in this thread falls
into that category.

The remaining 2% is only to allow me to say that we reserve the right to
label cases "absurd" when it is extremely painful to support without
penalizing the code and performance of less extreme and boring cases ;-)
Jeff King
2012-05-21 22:24:40 UTC
Permalink
Post by Junio C Hamano
Post by Jeff King
Anyway, my point is that we don't even have to talk about "reasonable"
or "absurd". Git should be fast even on absurd cases,...
I agree with you 98%.
If we can make Git fast even on "absurd" cases without penalizing less
extreme and boring cases and/or maintainability of the code, we should
strive to do so. And I think the codepaths mentioned in this thread falls
into that category.
The remaining 2% is only to allow me to say that we reserve the right to
label cases "absurd" when it is extremely painful to support without
penalizing the code and performance of less extreme and boring cases ;-)
Yes, I should be more careful with what I promise. :) I agree very much
that we should reserve that right.

-Peff
Martin Fick
2012-05-22 05:51:16 UTC
Permalink
Post by Jeff King
Post by Martin Fick
Of course, we use Gerrit, so features tend to be called
changes and each change may get many revisions (patchsets),
so all of these get refs, but I think that it might be wrong
to consider that out of the ordinary anymore. After all,
should a version control system such as git not support 100K
revisions of features developed independently on separate
branches (within Gerrit or not)? 100K is not really that
many when you consider a large project. Even without
Gerrit, if someone wanted to track that many features
(likely over a few years), they will probably use up tons of
refs.
...
Post by Jeff King
Anyway, my point is that we don't even have to talk about "reasonable"
or "absurd". Git should be fast even on absurd cases, because 99% of
the work has already been done, and the last 1% is easy.
I hope you are right, but I don't quite completely share your optimism. Some of that last 1% is perhaps last exactly because it is hard. More specificaly, I am talking about the git protocol's ref advertisement on connection. This has been considered a known issue for many years, yet it has not been fixed because it is hard to fix since it requires breaking the protocol in a non backwards compatible way. I would be delighted if you had an easy fix for this rather fundamental ref scaling issue? We talked with Junio and Shawn and they agreed that it would be reasonable to put forward a proposal which does break backwards compatibility. So if there is a chance that there still may be another way, I hope it is found before this gets underway (no, I don't really expect that to happen),

-Martin

Employee of Qualcomm Innovation Center,Inc. which is a member of Code Aurora Forum
Jeff King
2012-05-22 18:21:57 UTC
Permalink
Post by Martin Fick
Post by Jeff King
Anyway, my point is that we don't even have to talk about "reasonable"
or "absurd". Git should be fast even on absurd cases, because 99% of
the work has already been done, and the last 1% is easy.
I hope you are right, but I don't quite completely share your
optimism. Some of that last 1% is perhaps last exactly because it is
hard. More specificaly, I am talking about the git protocol's ref
advertisement on connection. This has been considered a known issue
for many years, yet it has not been fixed because it is hard to fix
since it requires breaking the protocol in a non backwards compatible
way. I would be delighted if you had an easy fix for this rather
fundamental ref scaling issue? We talked with Junio and Shawn and they
agreed that it would be reasonable to put forward a proposal which
does break backwards compatibility. So if there is a chance that there
still may be another way, I hope it is found before this gets underway
(no, I don't really expect that to happen),
I may be overly optimistic, but I think I may also have not communicated
the limits to my optimism. Right now, it is the case that some
operations on many refs are just impossibly slow due to poor algorithm
selection in the implementation. I am optimistic that we can drop those
in favor of linear algorithms, or at least O(n lg n) algorithms in most
cases. And that would bring these cases down from "don't do that, it's
broken" to "it works, but obviously it's slower than not having a
zillion refs".

I am not nearly as optimistic about the work scaling better than linear.
That is going to take some clever algorithms, as well as possibly some
design tradeoffs. AFAIK, ref advertisement scales linearly. Which is
probably not acceptable over a slow link, and we could do better.

But in my mind, we are still doing the "make it at least linear" portion
of the process. People aren't really using repos with tons of refs
because they currently perform so poorly, so we don't yet have good data
on how to solve the "better than linear" problems. As the quadratic
problems clear up, hopefully the locations of (and solutions to) those
other problems will be more clear.

-Peff
Martin Fick
2012-05-22 22:19:30 UTC
Permalink
On Mon, May 21, 2012 at 11:51:16PM -0600, Martin Fick
AFAIK, ref advertisement scales linearly. Which is
probably not acceptable over a slow link, and we could
do better.
Right, although it is not just a slow link issue. Linear or
not, this problem does not scale, it needs to be based on N,
the number of refs which need updating, not N the number of
refs in the repo.

I gave the following numbers to Junio and Shawn recently and
I figure it is probably worth mentioning them here to
perhaps give others some insights into just how bad this
problem can be...

Android consists of approximately 400 projects, and if
anyone tags their builds regularly, that means that they
will be tagging 400 projects per build. We have something
like 10K tags on average across 400 projects, so when we do
a simple No Op sync just to see if all 400 projects are up
to date, this yields about 200MB of data over the wire (4M
refs)!!!

That 200MB is using a -j4 on repo sync and it chews up about
1.5 cores on a high end server to serve that 200MB for over
1min. Now imagine a build bot needing to build about 25
images in parallel all just doing a No Op sync to see if
they are up to date! That's 25 * 200MB = 5GB data and 25 *
1.5 cores ~= 36 cores. That means our high end 24 core
server falls over all for a No Op.

As you can imagine, we really would like to see this
eliminated from our workflows. If we want to check 400
projects to see if they are up to date, it should be 400
refs, not 400 * 10K,

-Martin
--
Employee of Qualcomm Innovation Center, Inc. which is a
member of Code Aurora Forum
Junio C Hamano
2012-05-22 23:23:17 UTC
Permalink
Post by Martin Fick
Android consists of approximately 400 projects, and if
anyone tags their builds regularly, that means that they
will be tagging 400 projects per build. We have something
like 10K tags on average across 400 projects, so when we do
a simple No Op sync just to see if all 400 projects are up
to date, this yields about 200MB of data over the wire (4M
refs)!!!
...
As you can imagine, we really would like to see this
eliminated from our workflows. If we want to check 400
projects to see if they are up to date, it should be 400
refs, not 400 * 10K.
I think we can ignore that 400 part from the above for now. As they are
independent "projects", it is unreasonable to expect that any solution
would add costs less than linearly when you add more projects. But it
should be possible to make the cost of update discovery comparable between
the cases where you have 10K existing refs that both ends have seen and
one ref got updated vs you started from 20K common refs. The latter does
not have to cost twice (or 4 times if an inefficient way is quadratic).

I think the assumption in your build-bot scenario needs to be a bit more
clarified to give context. I am assuming, when you say "see if projects
are up to date", you mean your build-bot is interested in a single branch
in each repository, possibly together with new tags that have become
reachable from the updated tip of the branch (if the branch tip moved). I
also assume that the build-bot polls very often and that is why 200MB (or
divided by 400 it would be 0.5MB, which still is a lot when you talk about
a single repository) hurts a lot.

Everything below is what I think you already know about; I am giving an
overview of _my_ understanding of the issue for the benefit of others (and
have you sanity check if I misunderstood your pain points).

Even an attempt to optimize "can we skip updating" by asking "ls-remote"
about a single branch is futile with the current protocol, as we expect
the server to respond with the full ref advertisement and filter out the
result on our end. In the upload-pack protocol, the serving side talks
first and that is "here are all the refs I have and their values". The
other side that asks for objects does not have a way to tell it that it is
only interested in a subset, even if it wanted to. Unlike the capability
exchange that happens after the initial connection, there is no gap in
the protocol for the side that initiates the connection to tell the
serving side to skip/delay the initial ref advertisement.

What the above means is that the transition has to happen by keeping the
current upload-pack and a new protocol has to be invoked by a different
program ("upload-pack-2" or something needs to be taught to the git-daemon
and also invoked by ssh) even if an updated protocol is designed. The
updated connection initiator (i.e. ls-remote and fetch) may be able to try
upload-pack-2 first and fall back to upload-pack after getting a failure,
but afaik nobody figured out if this is doable by checking if a failure
signal that comes from currently deployed software is detailed enough for
our side to tell if the failure is because the other end does not support
upload-pack-2 or because of some other failure (e.g. auth failed, no such
repo, etc.). Regardless of auto-fallback, an updated connection initiator
needs a configuration switch to be explicitly told which protocol to talk
to with which remote.

What exact protocol "upload-pack-2" talks may be an interesting subject on
its own. It may still have the serving side talk first (probably show its
capabilities and tips of branches if there are not too many of them), or
it may instead let the connection initiator talk first and ask for
specific set of refs. But that is of lessor importance from the
high-level point of view and I won't go into the details in this thread.
Martin Fick
2012-05-23 00:46:37 UTC
Permalink
Post by Junio C Hamano
I think we can ignore that 400 part from the above for
now. As they are independent "projects", it is
unreasonable to expect that any solution would add costs
less than linearly when you add more projects.
As far as git is concerned, sure they are independent, but
the projects aren't really independent. Separate repos were
likely selected for scalibility reasons in the first place,
but this ends up being a "scale fail" pattern. I don't
expect git to fix this * 400 issue, but I want to highlight,
how difficult it is to make things scale in these cases and
that this problems is a natural multiplier when you split
something into separate git repos because your tags tend to
now be multiplied. So while 10K tags sound like a lot but
manageable for 1 repo, it ends up actually being 4M tags if
you split your repos.


...
Post by Junio C Hamano
I think the assumption in your build-bot scenario needs
to be a bit more clarified to give context. I am
assuming, when you say "see if projects are up to date",
you mean your build-bot is interested in a single branch
in each repository, possibly together with new tags that
have become reachable from the updated tip of the branch
(if the branch tip moved).
Just the branch generally.
Post by Junio C Hamano
I also assume that the
build-bot polls very often and that is why 200MB (or
divided by 400 it would be 0.5MB, which still is a lot
when you talk about a single repository) hurts a lot.
To clarify for others (since you likely know), we serialize
our submits, so the "poll" is before every build to ensure
that we are up to date with what just got merged. In our
case we use Gerrit with slaves (read only Gerrit copies), so
we offload the "update" to the slave, which incurs the same
hit, all though it is not quite a No Op but close (there
will have been a batch of commits added to a few projects).
But then since the slave may be slightly behind the master
(replication from the master to the slave takes time,
partially because of the same ref advertisement problem), so
we want to verify the slave update by polling the master,
which if replication was fast enough would be a No Op. So,
our attempts to scale by using slaves does not help very
much since the real update and the No Op are almost
identical in their impacts... There are workarounds, but
mostly hacks.


...
Post by Junio C Hamano
Unlike the capability exchange that happens after the
initial connection, there is no gap in the protocol for
the side that initiates the connection to tell the
serving side to skip/delay the initial ref
advertisement.
Crazy stupid suggestion: could the URI somehow be exploited
to signal the option to disable advertisements? What if the
path part of the URI started with something like several
slashes? Older servers just strip the extra slashes right?
Newer servers could disable advertisements if they see the
slashes. Of course, this gets ugly if scripts add slashes
and expect them to be stripped (old clients talking to new
servers would probably then fail),

-Martin
--
Employee of Qualcomm Innovation Center, Inc. which is a
member of Code Aurora Forum
Continue reading on narkive:
Loading...