Discussion:
how to speed up "git log"?
Bruno Haible
2007-02-11 11:52:28 UTC
Permalink
Hi,

Are there some known tricks to speed up the operation of "git log"?

On a file in a local copy of the coreutils git repository,
"git log tr.c > output" takes
- 33 seconds of CPU time (33 user, 0 system) on a Linux/x86 500MHz system,
- 24 seconds of CPU time (12 user, 12 system) on a MacOS X PowerPC 1.1 GHz
system.
The result shows only 147 commits and a total of 40 KB textual output.

1) Why so much user CPU time?
2) Why so much system CPU time, but only on MacOS X?

Bruno
Johannes Schindelin
2007-02-11 16:49:03 UTC
Permalink
Hi,
Post by Bruno Haible
Are there some known tricks to speed up the operation of "git log"?
On a file in a local copy of the coreutils git repository,
"git log tr.c > output" takes
- 33 seconds of CPU time (33 user, 0 system) on a Linux/x86 500MHz system,
- 24 seconds of CPU time (12 user, 12 system) on a MacOS X PowerPC 1.1 GHz
system.
The result shows only 147 commits and a total of 40 KB textual output.
Yes, because there were only 147 commits which changed the file. But git
looked at all commits to find that.

Basically, we don't do file versions. File versions do not make sense,
since they strip away the context. See also

http://news.gmane.org/group/gmane.comp.version-control.git/thread=37838

for a real flamewar revolving around that very subject.
Post by Bruno Haible
1) Why so much user CPU time?
See above.

Plus, you are probably not really interested in _all_ revisions changing
that file, are you? Usually the output of git-log -- even with pathname
filtering -- starts almost instantaneous, and is piped to your pager. So,
your numbers are misleading.
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
Probably the mmap() problem. Does it go away when you use git 1.5.0-rc4?

Hth,
Dscho
Shawn O. Pearce
2007-02-11 23:00:35 UTC
Permalink
Post by Johannes Schindelin
Post by Bruno Haible
1) Why so much user CPU time?
See above.
Some of the ideas Nico and I have kicked around for a pack v4 (post
1.5.0, obviously) would speed up revision traversal by bypassing
some of the costly decompression overheads.
Post by Johannes Schindelin
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
Probably the mmap() problem. Does it go away when you use git 1.5.0-rc4?
What does 1.5.0-rc4 do here that didn't happen before? Are you
referring to the mmap sliding window? Because NO_MMAP might be
faster on MacOS X then using mmap (thanks to its slower mmap)... but
I can't say I have performance tested it either way.
--
Shawn.
Johannes Schindelin
2007-02-11 23:08:31 UTC
Permalink
Hi,
Post by Shawn O. Pearce
Post by Johannes Schindelin
Post by Bruno Haible
1) Why so much user CPU time?
See above.
Some of the ideas Nico and I have kicked around for a pack v4 (post
1.5.0, obviously) would speed up revision traversal by bypassing
some of the costly decompression overheads.
Maybe. But my point (which you did not quote) was this: git log _starts_
very fast, and the information you are most likely after is shown right
away. So I don't think it makes sense investing much time to enhance
performance for a full log.
Post by Shawn O. Pearce
Post by Johannes Schindelin
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
Probably the mmap() problem. Does it go away when you use git 1.5.0-rc4?
What does 1.5.0-rc4 do here that didn't happen before? Are you
referring to the mmap sliding window?
No. I was referring to v1.5.0-rc0~62, but was too lazy to look that up.

Ciao,
Dscho
Bruno Haible
2007-02-11 23:41:27 UTC
Permalink
Hello Johannes,

Thanks for the helpful answer.
Post by Johannes Schindelin
Yes, because there were only 147 commits which changed the file. But git
looked at all commits to find that.
Ouch.
Post by Johannes Schindelin
Basically, we don't do file versions. File versions do not make sense,
since they strip away the context.
Is there some other concept or command that git offers? I'm in the situation
where I know that 'tr' in coreutils version 5.2.1 had a certain bug and
version 6.4 does not have the bug, and I want to review all commits that
are relevant to this. I know that the only changes in tr.c are relevant
for this, and I'm interested in a display of the minimum amount of relevant
commit messages. If "git log" is not the right command for this question,
which command is it?
Post by Johannes Schindelin
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
Probably the mmap() problem. Does it go away when you use git 1.5.0-rc4?
No, it became even worse: git-1.5.0-rc4 is twice as slow as git-1.4.4 for
this command:
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 12 system)
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 19 system)

Bruno
Shawn O. Pearce
2007-02-11 23:46:49 UTC
Permalink
Post by Bruno Haible
Is there some other concept or command that git offers? I'm in the situation
where I know that 'tr' in coreutils version 5.2.1 had a certain bug and
version 6.4 does not have the bug, and I want to review all commits that
are relevant to this. I know that the only changes in tr.c are relevant
for this, and I'm interested in a display of the minimum amount of relevant
commit messages. If "git log" is not the right command for this question,
which command is it?
Two options come to mind:

`git log v5.2.1..v6.4 -- tr.c`
`git bisect`

The former has a few different flavors, e.g. you can run the
same arguments to `gitk` to view the changes in a graphical form.
The latter will help you do a binary search through the commits
which affected tr.c between the known good and known bad revisions,
allowing you to test the possible candidates for the defect.
Post by Bruno Haible
Post by Johannes Schindelin
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
Probably the mmap() problem. Does it go away when you use git 1.5.0-rc4?
No, it became even worse: git-1.5.0-rc4 is twice as slow as git-1.4.4 for
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 12 system)
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 19 system)
That's not so good... This is `git log -- tr.c >/dev/null` ?
--
Shawn.
Johannes Schindelin
2007-02-11 23:56:36 UTC
Permalink
Hi,
Post by Bruno Haible
Post by Johannes Schindelin
Yes, because there were only 147 commits which changed the file. But git
looked at all commits to find that.
Ouch.
Post by Johannes Schindelin
Basically, we don't do file versions. File versions do not make sense,
since they strip away the context.
You could have it faster, but you'd break a very useful concept by doing
so.
Post by Bruno Haible
Is there some other concept or command that git offers? I'm in the
situation where I know that 'tr' in coreutils version 5.2.1 had a
certain bug and version 6.4 does not have the bug, and I want to review
all commits that are relevant to this.
So, only look at those:

git log v5.2.1..v6.4 tr.c

(provided you have the tags for the releases). You can start reviewing
right away, since the output will start very fast (much faster than it
takes to complete the log!).

If you want to get the patches to tr.c with the logs, just add "-p":

git log -p v5.2.1..v6.4 tr.c
Post by Bruno Haible
Post by Johannes Schindelin
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
Probably the mmap() problem. Does it go away when you use git 1.5.0-rc4?
No, it became even worse: git-1.5.0-rc4 is twice as slow as git-1.4.4 for
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 12 system)
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 19 system)
Hmmm. I don't have MacOSX any more, so I cannot investigate. You might
find this the perfect opening into working on git ;-)

Hth,
Dscho
Robin Rosenberg
2007-02-11 23:59:17 UTC
Permalink
Post by Bruno Haible
Hello Johannes,
=20
Thanks for the helpful answer.
=20
Yes, because there were only 147 commits which changed the file. Bu=
t git=20
Post by Bruno Haible
looked at all commits to find that.
=20
Ouch.
=20
Basically, we don't do file versions. File versions do not make sen=
se,=20
Post by Bruno Haible
since they strip away the context.
=20
Is there some other concept or command that git offers? I'm in the si=
tuation
Post by Bruno Haible
where I know that 'tr' in coreutils version 5.2.1 had a certain bug a=
nd
Post by Bruno Haible
version 6.4 does not have the bug, and I want to review all commits t=
hat
Post by Bruno Haible
are relevant to this. I know that the only changes in tr.c are releva=
nt
Post by Bruno Haible
for this, and I'm interested in a display of the minimum amount of re=
levant
Post by Bruno Haible
commit messages. If "git log" is not the right command for this quest=
ion,
Post by Bruno Haible
which command is it?
Since you know that you are not interested in the whole history, you ca=
n limit your scan.

git log COREUTILS-5_2_1..COREUTILS-6_4 src/tr.c
Post by Bruno Haible
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
=20
Probably the mmap() problem. Does it go away when you use git 1.5.0=
-rc4?
Post by Bruno Haible
=20
No, it became even worse: git-1.5.0-rc4 is twice as slow as git-1.4.4=
for
Post by Bruno Haible
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 1=
2 system)
Post by Bruno Haible
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 1=
9 system)

Could the UTF-8 stuff have anything to do with this?

-- robin
Bruno Haible
2007-02-12 02:02:00 UTC
Permalink
Thanks for the responses.
Since you know that you are not interested in the whole history, you can limit your scan.
git log COREUTILS-5_2_1..COREUTILS-6_4 src/tr.c
Thanks, that indeed does the trick: it reduces the time from 33 sec to 11 sec.

To reduce the time even more, and to allow more flexibility among the
search criteria (e.g. "I need the commits from date X to date Y, on this
file set, from anyone except me"), I would need to connect git to a database.
git cannot store all kinds of indices and reverse mappings to allow all
kinds of queries; that's really a classical database application area.
Post by Bruno Haible
No, it became even worse: git-1.5.0-rc4 is twice as slow as git-1.4.4 for
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 12 system)
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 19 system)
Could the UTF-8 stuff have anything to do with this?
Actually, no. Brown paper bag on me for doing benches in different
conditions. The timing difference is an effect of the buffer cache / page
cache:

- After the second repetition of the command (i.e. when all files are cached
in RAM), the timings are
25 seconds real time, 24 seconds of CPU time (13 user, 11 system)
both in git-1.4.4 and -1.5.0-rc4.

- After unmounting and remounting the disk containing the repository (i.e.
when none of the files are cached in RAM), the timings are
49 seconds real time, 38 seconds of CPU time (20 user, 18 system)

Sorry for the false alarm.

Bruno
Johannes Schindelin
2007-02-12 11:19:24 UTC
Permalink
Hi,
Post by Bruno Haible
Since you know that you are not interested in the whole history, you can limit your scan.
git log COREUTILS-5_2_1..COREUTILS-6_4 src/tr.c
Thanks, that indeed does the trick: it reduces the time from 33 sec to 11 sec.
To reduce the time even more, and to allow more flexibility among the
search criteria (e.g. "I need the commits from date X to date Y, on this
file set, from anyone except me"), I would need to connect git to a
database. git cannot store all kinds of indices and reverse mappings to
allow all kinds of queries; that's really a classical database
application area.
[in the following paragraph, "index" means the index on a classical
database table]

And -- as everywhere else with classical databases -- you have to ask if
it is worth it. Given the fact that a one-time use of such an index is
_worse_ than doing it without index at all (building and writing the
index is _at least_ as expensive as searching once without an index), I'd
rather doubt it.

However, if you do similar kinds of searches quite often, it makes tons of
sense to connect to a database. We already use sqlite in cvsserver, so I'd
try that.

Ciao,
Dscho
Junio C Hamano
2007-02-12 04:08:37 UTC
Permalink
Post by Robin Rosenberg
Post by Bruno Haible
No, it became even worse: git-1.5.0-rc4 is twice as slow as git-1.4.4 for
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 12 system)
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 19 system)
Could the UTF-8 stuff have anything to do with this?
I doubt it -- sliding mmap() in the current git, while is a good
change overall for handling really huge repos, would most likely
perform poorer than the fixed mmap() in 1.4.4 series on
platforms with slow mmap(), most notably on MacOS X.

It _might_ be possible that turning some sliding mmap() calls
into pread() makes it perform better on MacOS X.

I wonder what happens it git is compiled with NO_MMAP there...
Shawn O. Pearce
2007-02-12 06:06:41 UTC
Permalink
Post by Junio C Hamano
I doubt it -- sliding mmap() in the current git, while is a good
change overall for handling really huge repos, would most likely
perform poorer than the fixed mmap() in 1.4.4 series on
platforms with slow mmap(), most notably on MacOS X.
It _might_ be possible that turning some sliding mmap() calls
into pread() makes it perform better on MacOS X.
I wonder what happens it git is compiled with NO_MMAP there...
So I ran three trials, v1.5.0-rc4-26-gcc46a74 with and without
NO_MMAP against v1.4.4.4 on a freshly repacked git.git.

v150-mmap:
3.33 real 3.12 user 0.05 sys
3.32 real 3.12 user 0.05 sys
3.34 real 3.12 user 0.05 sys

v150-nommap:
3.46 real 3.13 user 0.16 sys
3.43 real 3.13 user 0.16 sys
3.46 real 3.13 user 0.16 sys

v1444-mmap:
3.30 real 3.09 user 0.05 sys
3.30 real 3.09 user 0.05 sys
3.25 real 3.09 user 0.04 sys

CFLAGS="-O2"; the above timings are three representative samples
out of 10 runs each, all hot cache.

Clearly the sliding mmap window isn't hurting us in this case by
very much, and NO_MMAP really isn't helping matters at all.
--
Shawn.
Junio C Hamano
2007-02-12 06:11:30 UTC
Permalink
Post by Shawn O. Pearce
Post by Junio C Hamano
I doubt it -- sliding mmap() in the current git, while is a good
change overall for handling really huge repos, would most likely
perform poorer than the fixed mmap() in 1.4.4 series on
platforms with slow mmap(), most notably on MacOS X.
It _might_ be possible that turning some sliding mmap() calls
into pread() makes it perform better on MacOS X.
I wonder what happens it git is compiled with NO_MMAP there...
So I ran three trials, v1.5.0-rc4-26-gcc46a74 with and without
NO_MMAP against v1.4.4.4 on a freshly repacked git.git.
I do not think freshly repacked git.git is a good test case for
a real-world workload where this really matters. Doesn't your
default pack window large enough to cover it with a single
window, or perhaps two at most?
Shawn O. Pearce
2007-02-12 06:22:24 UTC
Permalink
Post by Junio C Hamano
Post by Shawn O. Pearce
So I ran three trials, v1.5.0-rc4-26-gcc46a74 with and without
NO_MMAP against v1.4.4.4 on a freshly repacked git.git.
I do not think freshly repacked git.git is a good test case for
a real-world workload where this really matters. Doesn't your
default pack window large enough to cover it with a single
window, or perhaps two at most?
Its one window, maybe two, as git.git is ~12 MiB and the window
size is 1 MiB (NO_MMAP) or 32 MiB (with mmap).

On linux.git:

v150-mmap:
2.23 real 1.99 user 0.10 sys
2.19 real 1.98 user 0.10 sys
2.19 real 1.98 user 0.10 sys

v150-nommap:
2.63 real 1.99 user 0.50 sys
2.67 real 1.98 user 0.51 sys
2.63 real 1.99 user 0.51 sys

v1444:
2.15 real 1.94 user 0.09 sys
2.19 real 1.95 user 0.10 sys
2.16 real 1.94 user 0.10 sys

Again, we aren't too far away from v1.4.4.4, but the NO_MMAP clearly
is hurting us, even on Mac OS X.
--
Shawn.
Shawn O. Pearce
2007-02-12 06:28:13 UTC
Permalink
Post by Shawn O. Pearce
So I ran three trials, v1.5.0-rc4-26-gcc46a74 with and without
NO_MMAP against v1.4.4.4 on a freshly repacked git.git.
I probably should have mentioned, my run (in all cases) was:

git rev-list HEAD -- Makefile 2>/dev/null

cheap, a file that exists pretty much everywhere, and that triggers
the path limiter in the revision walking code.

BTW, I discovered by accident tonight that this works:

cp git-rev-list ../git-1444
../git-1444 rev-list

which is so not something I would have expected. :-) I honestly
expected the wrapper to puke and say it doesn't know what command
1444 is.
--
Shawn.
Linus Torvalds
2007-02-12 04:20:43 UTC
Permalink
Post by Bruno Haible
Hello Johannes,
Thanks for the helpful answer.
Post by Johannes Schindelin
Yes, because there were only 147 commits which changed the file. But git
looked at all commits to find that.
Ouch.
This should become a FAQ.

Git simply DOES NOT HAVE per-file history. And having it is actually a
BUG in other systems.

Not having per-file history is what allows git to do

git log directory-or-file-set

ratehr than being able to track just one file. You can't do it sanely
with per-file history (because to tie the per-file histories back
together in a logical sequence, you need the global history to sort it
again!)

So:

- git is "slow" on single-file things, because such things DON'T EVEN
EXIST in git!

When you do "git log <path-limiter>", itreally always ends up being a
full git log.

- but this is fundamentally what allows you to track multiple directories
well. It's what makes things like "gitk drivers/scsi/" actually work,
where you really can see the history for a random *collection* of
files. Nobody else can do it, afaik, and git just considers a single
filename to be a case of the "random collection of files".

The example I gave to corecode was to do

gitk builtin-rev-list.c
gitk builtin-rev-parse.c
gitk builtin-rev-parse.c builtin-rev-list.c

adn realize that doing the history for two files together is NOT AT ALL
EQUIVALENT to doing the history for those files individually and stitching
it together.

(The reason the above is a great example is that both of the files alone
have a very simple linear history, but when you look at the *combined*
history you actually see concurrent development, and merges: you see
merge commits that simply don't "exist" when only looking at the history
of one of them separately).
Post by Bruno Haible
Is there some other concept or command that git offers? I'm in the situation
where I know that 'tr' in coreutils version 5.2.1 had a certain bug and
version 6.4 does not have the bug, and I want to review all commits that
are relevant to this. I know that the only changes in tr.c are relevant
for this, and I'm interested in a display of the minimum amount of relevant
commit messages. If "git log" is not the right command for this question,
which command is it?
Do

git log v5.2.1..v6.4 -- tr.c

(or whatever your tag-names for releases are) where you can limit the log
generation cost by giving the beginning commit. But yeah, it *will* look
at the whole history in between, so if there is a long long history
between v5.2.1 and v6.4, you'll still end up using reasonable amounts of
CPU.
Post by Bruno Haible
Post by Johannes Schindelin
Probably the mmap() problem. Does it go away when you use git 1.5.0-rc4?
No, it became even worse: git-1.5.0-rc4 is twice as slow as git-1.4.4 for
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 12 system)
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 19 system)
That's an interesting fact in itself. Do you have the repo available
somewhere?

Yes, some of the operations can be improved upon by not wasting quite so
much time uncompressing stuff, so we could at least help this a bit. But
that's a long-term thing. The slowdown is bad, and that probably has some
simple explanation.

Linus
Bruno Haible
2007-02-12 11:27:15 UTC
Permalink
Linus,
Post by Linus Torvalds
Post by Bruno Haible
git-1.4.4: 25 seconds real time, 24 seconds of CPU time (12 user, 12 system)
git-1.5.0: 50 seconds real time, 39 seconds of CPU time (20 user, 19 system)
That's an interesting fact in itself.
Sorry, these measurements happened to be done in different conditions:
repo fully cached in RAM vs. repo not yet in buffer cache / page cache.

When measured under the same conditions, no speed difference is visible
between git-1.4.4 and git-1.5.0-rc4.

Bruno
Bruno Haible
2007-02-11 23:52:23 UTC
Permalink
- do not use "tr.c", unless you really need it: git has to read more
of a commit in this case. Just "git log" takes only 0.9 sec on the
machine above.
"git log" is indeed faster, but is useless for the given task, since it doesn't
show which of the 4 megabytes of commit messages apply to tr.c.
Post by Bruno Haible
On a file in a local copy of the coreutils git repository,
"git log tr.c > output" takes
Why do you need _all_ commits, btw?
I want to quickly find the cause of a behaviour change between tr.c of
coreutils 5.2.1 and the one of coreutils 6.4. It's a period of 1.5 years,
but limited to a single file. Can't git produce this quickly?
Post by Bruno Haible
2) Why so much system CPU time, but only on MacOS X?
MacOS X is famous for its bad perfomance when doing serious work.
The mmap(2) of it, in particular.
But at least, a MacOS X machine is still interactively usable when it uses
6 times more swap than the machine's RAM size. Whereas a Linux 2.4 machine
is interactively unusable already with 1.5 to 2 times more swap than the
machine has RAM.

Bruno
Bruno Haible
2007-02-17 19:19:20 UTC
Permalink
MacOS X is famous for its bad perfomance when doing serious work.
The mmap(2) of it, in particular.
You can't blame MacOS X mmap(2) for git's slow execution of "git log".
Here are is execution times of "git log tr.c > output"

- with git-1.5.0-rc4 built with -DNO_MMAP

real 0m26.032s
user 0m13.580s
sys 0m11.730s

- with git-1.5.0-rc4 built with the default settings:

real 0m25.469s
user 0m13.530s
sys 0m11.490s

You can see that using mmap() provides a speedup of about 2% on MacOS X,
which is similar to the 4% than Shawn measured on Linux.

Bruno
Johannes Schindelin
2007-02-17 23:20:49 UTC
Permalink
Hi,
Post by Bruno Haible
MacOS X is famous for its bad perfomance when doing serious work.
The mmap(2) of it, in particular.
You can't blame MacOS X mmap(2) for git's slow execution of "git log".
No, but you can blame the person calling git log and waiting until it
finishes. See the list archives for reasons why.

If this comes up one more time, I'm very tempted to write a scathing
remark in the FAQ.

Ciao,
Dscho
Bruno Haible
2007-02-18 00:09:26 UTC
Permalink
Post by Johannes Schindelin
you can blame the person calling git log and waiting until it
finishes. See the list archives for reasons why.
Usually the output of git-log -- even with pathname
filtering -- starts almost instantaneous, and is piped to your pager.
The pager ('less') in a console is not a good solution for everone:
- People used to GUI editors (kate, nedit, ...) miss a scroll bar for
navigation. You can't use kate or nedit as a pager.
- PAGER="vi -" also reads all input before it displays anything.
- PAGER="xless" likewise.
- In Emacs shell-mode, with PAGER="", you see the output as it is produced,
but it's disturbing to work in a buffer which is growing, where the scrollbar
continues to change its position.

It's OK for many people, but not for everyone.

Bruno
Johannes Schindelin
2007-02-18 00:10:00 UTC
Permalink
Hi,
Post by Bruno Haible
Post by Johannes Schindelin
you can blame the person calling git log and waiting until it
finishes. See the list archives for reasons why.
Usually the output of git-log -- even with pathname
filtering -- starts almost instantaneous, and is piped to your pager.
- People used to GUI editors (kate, nedit, ...) miss a scroll bar for
navigation. You can't use kate or nedit as a pager.
- PAGER="vi -" also reads all input before it displays anything.
- PAGER="xless" likewise.
- In Emacs shell-mode, with PAGER="", you see the output as it is produced,
but it's disturbing to work in a buffer which is growing, where the scrollbar
continues to change its position.
It's OK for many people, but not for everyone.
So why don't you go scratch that itch, and write a decent GUI pager?

Ciao,
Dscho
Shawn O. Pearce
2007-02-18 06:33:51 UTC
Permalink
Post by Bruno Haible
MacOS X is famous for its bad perfomance when doing serious work.
The mmap(2) of it, in particular.
You can see that using mmap() provides a speedup of about 2% on MacOS X,
which is similar to the 4% than Shawn measured on Linux.
Uh, I was testing on Mac OS X (G4 PowerBook).
--
Shawn.
Loading...