Eric Sink's blog - notes on git, dscms and a "whole product" approach

Discussion:

Eric Sink's blog - notes on git, dscms and a "whole product" approach

Martin Langhoff

2009-04-27 08:55:55 UTC

Eric Sink hs been working on the (commercial, proprietary) centralised
SCM Vault for a while. He's written recently about his explorations
around the new crop of DSCMs, and I think it's quite interesting. A
quick search of the list archives makes me thing it wasn't discussed
before.

The guy is knowledgeable, and writes quite witty posts -- naturally,
there's plenty to disagree on, but I'd like to encourage readers not
to nitpick or focus on where Eric is wrong. It is interesting to read
where he thinks git and other DSCMs are missing the mark.

Maybe he's right, maybe he's wrong, but damn he's interesting :-)

So here's the blog - http://www.ericsink.com/

These are the best entry points
http://www.ericsink.com/entries/quirky.html
http://www.ericsink.com/entries/hg_denzel.html

To be frank, I think he's wrong in some details (as he's admittedly
only spent limited time with it) but right on the larger-picture
(large userbases want it integrated and foolproof, bugtracking needs
to go distributed alongside the code, git is as powerful^Wdangerous as
C).

cheers,

martin

--
***@gmail.com
***@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff

Jakub Narebski

2009-04-28 11:24:31 UTC

Eric Sink hs been working on the (commercial, proprietary) centralise=

d

SCM Vault for a while. He's written recently about his explorations
around the new crop of DSCMs, and I think it's quite interesting. A
quick search of the list archives makes me thing it wasn't discussed
before.
=20
The guy is knowledgeable, and writes quite witty posts -- naturally,
there's plenty to disagree on, but I'd like to encourage readers not
to nitpick or focus on where Eric is wrong. It is interesting to read
where he thinks git and other DSCMs are missing the mark.
=20
Maybe he's right, maybe he's wrong, but damn he's interesting :-)
=20
So here's the blog - http://www.ericsink.com/

"Here's a blog"... and therefore my dilemma. Should I post my reply
as a comment to this blog, or should I reply here on git mailing list?
=20

These are the best entry points

Because those two entries are quite different, I'll reply separately

1. "Ten Quirky Issues with Cross-Platform Version Control"

http://www.ericsink.com/entries/quirky.html

which is generic comment about (mainly) using version control
in heterogenic environment, where different machines have different
filesystem limitations. I'll concentrate here on that issue.

2. "Mercurial, Subversion, and Wesley Snipes"

http://www.ericsink.com/entries/hg_denzel.html

where, paraphrasing, Eric Sink says that he doesn't write about
Mercurial and Subversion because they are perfect. Or at least not
as controversial (and controversial means interesting).

=20
To be frank, I think he's wrong in some details (as he's admittedly
only spent limited time with it) but right on the larger-picture
(large userbases want it integrated and foolproof, bugtracking needs
to go distributed alongside the code, git is as powerful^Wdangerous a=

s

C).

Neither of mentioned above blog posts touches those issues, BTW...

----------------------------------------------------------------------
Ad 1. "Ten Quirky Issues with Cross-Platform Version Control"

Actually those are two issues: troubles with different limitations of
different filesystems, and different dealing with line endings in text
files on different platforms.

Line endings (issue 8.) is in theory and in practice (at least for
Git) a non-issue. =20

In theory you should use project's convention for end of line
character in text files, and use smart editor that can deal (or can be
configured to deal) with this issue correctly.

In practice this is a matter of correctly setting up core.autocrlf
(and in more complicated cases, where more complicated means for git
very very rare, configuring which files are text and which are not).

There are a few classes of troubles with filesystems (with filenames).

1. Different limitations on file names (e.g. pathname length),
different special characters, different special filenames (if any).
Those are issues 2. (special basename PRN on MS Windows),=20
issue 3. (trailing dot, trailing whitespace), issue 4. (pathname
and filename length limit), issue 6. (special characters, in this
case colon being path element delimiter on MacOS, but it is also
about special characters like colon, asterisk and question mark
on MS Windows) and also issue 7. (name that begins with dash)
in Eric Sink article.

The answer is convention for filenames in a project. Simply DON'T
use filenames which can cause problems. There is no way to simply
solve this problem in version control system, although I think if
you really, really, really need it you should be able to cobble
something together using low-level git tools to have different name
for filename in working directory from the one used in repository
(and index).

See also David A. Wheeler essay "Fixing Unix/Linux/POSIX Filenames:
Control Characters (such as Newline), Leading Dashes, and Other Prob=
lems"=20
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html

DON'T DO THAT.

2. "Case-insensitive" but "case-preserving" filesystems; the case
where some different filenames are equivalent (like 'README' and
'readme' on case-insensitive filesystem), but are returned as you
created them (so if you created 'README', you would get 'README' in
directory listing, but filesystem would return that 'readme' exists
too). This is issue 1. ('README' and 'readme' in the same
directory) in Eric Sink article.

The answer is like for previous issue: don't. Simply DO NOT create
files with filenames which differ only in case (like unfortunate
ct_conntrack.h and cn_CONNTRACK.h or similar in Linux kernel).

But I think that even in case where such unfortunate incident (two
filenames differing only in case) occur, you can deal with it in
Git by using lower level tools (and editing only one of two such
files at once). You would get spurious info about modified files
in git-status, though... perhaps that could be improved using
infrastructure created (IIRC) by Linus for dealing with 'insane'
filesystems.

DON'T DO THAT, SOLVABLE.

3. Non "Case-preserving" filesystems, where filename as sequence of
bytes differ between what you created, and what you get from
filesystem. An example here is MacOS X filesystem, which accepts
filenames in NFC composed normalized form of Unicode, but stores
them internally and returns them in NFD decomposed form. This is
issue 9. (Espa=F1ol being "Espa\u00f1ol" in NFC, but "Espan\u0303ol"
in NFD).

In this case 'don't do this' might be not acceptable answer.
Perhaps you need non-ASCII characters in filenames. Not always can
you use filesystem or specify mount point option that makes it not
a problem.

I remember that this issue was discussed extensively on git mailing
list, but I don't remember what was the conclusion (beside agreeing
that filesystem that is not "*-preserving" is not sane filesystem ;)=
=2E
In particular I do not remember if Git can deal with this issue
sanely (I remember Linus adding infrastructure for that, but did it
solve this problem...).

PROBABLY SOLVED.

4. Filesystems which cannot store all SCM-sane metainfo, for example
filesystems without support for symbolic links, or without support
for executable permission (executable bit). This is extension of
issue 10. (which is limited to symbolic links) in Eric Sink
article.

In Git you have core.fileMode to ignore executable bit differences
(you would need to use SCM tools and not filesystem tools to
maniulate it), and core.symlinks to be able to checkout symlinks as
plain text files (again using SCM tools to manipulate).

SOLVED.

There is also mistaken implicit assumption that version control
systems have (and should) preserve all metadata.

5. The issue of extra metadata that is not SCM-sane, and which
different filesystems can or cannot store. Examples include full
Unix permissions, Unix ownership (and groups file belongs to),
other permission-related metadata such as ACL, extra resources tied
to file such as EA (extended attributes) for some Linux filesystems
or (in)famous resource form in MacOS. This is issue 5. (resource
fork on MacOS vs. xattrs on Linux) in Eric Sink article.

This is not an issue for SCM: _source_ code management system
to solve. Preserving extra metadata indiscrimitedly can cause
problems, like e.g. full permissions and ownership. Therefore
SCM preserve only limited SCM-sane subset of metadata. If you
need to preserve extra metadata, you can use (in good SCMs) hooks
for that, like e.g. etckeeper uses metastore (in Git).

NOT A PROBLEM.

--=20
Jakub Narebski
Poland
ShadeHawk on #git

Robin Rosenberg

2009-04-28 21:00:56 UTC

Post by Jakub Narebski
Line endings (issue 8.) is in theory and in practice (at least for
Git) a non-issue.
In theory you should use project's convention for end of line
character in text files, and use smart editor that can deal (or can be
configured to deal) with this issue correctly.

Windows people will disagree.

Post by Jakub Narebski
In practice this is a matter of correctly setting up core.autocrlf
(and in more complicated cases, where more complicated means for git
very very rare, configuring which files are text and which are not).

Which proves it is an issue or we wouldn't need to tune settings
to make it work right. A non-issue is something that "just works"
without turning knobs. I had had to think more than once on
what the issue was and the right way to solve these issues. It
can be considered wierd, because Eclipse on Linux generated files
with CRLF which I happily committed and Git on Windows happily
converted to LF and determined that the HEAD and index was out
of sync, but refuesed to commit the CRLF>LF change becuase there
was no "diff".. You know the fix, but don't tell me it's not an issue.

-- robin

Martin Langhoff

2009-04-29 06:55:29 UTC

=A0 DON'T DO THAT.
=A0 DON'T DO THAT, SOLVABLE.

As I mentioned, Eric is taking the perspective of offering a supported
SCM to a large and diverse audience. As such, his notes are
interesting not because he's right or he's wrong.

We can be "right" and say "don't do that" if we shrink our audience so
that it looks a lot like us. There, fixed.

But something tells me that successful tools are -- by definition --
tools that grow past their creators use.

So from Eric's perspective, it is worthwhile to work on all those
issues, and get the right for the end user -- support things we don't
like, offer foolproof catches and warnings that prevent the user from
shooting their lovely toes off to mars, etc.

His perspective is one of commercial licensing, but even if we aren't
driven by the "each new user is a new dollar" bit, the long term hopes
for git might also be to be widely used and to improve the version
control life of many unsuspecting users.

To get there, I suspect we have to understand more of Eric's perspectiv=
e.

that's my 2c.

m
--=20
***@gmail.com
***@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff

Jeff King

2009-04-29 07:21:05 UTC

Post by Martin Langhoff
So from Eric's perspective, it is worthwhile to work on all those
issues, and get the right for the end user -- support things we don't
like, offer foolproof catches and warnings that prevent the user from
shooting their lovely toes off to mars, etc.

I read a few of his blog postings. He kept complaining about the
features of git that I like the most. :)

So one thing I took away from it is that there probably isn't _one_
interface that works for everybody. I can see his arguments about how
"add -p" can be dangerous, and how history rewriting can be dangerous.
So for some users, blocking those features makes sense.

But for other users (myself included), those are critical features that
make me _way_ more productive. And I manage the risk that comes from
using them as part of my workflow, and it isn't a problem in practice.

While part of me is happy that cogito is now dead (not because I didn't
think it was good, but because having two sets of tools just seemed to
create maintenance and staleness headaches), I do sometimes wonder if we
would be better off with several "from scratch" git interfaces based
around the plumbing (or even a C library). And I don't just mean simple
wrappers around git commands, but whole new interfaces which make
decisions like "no history rewriting at all", and try to provide a safer
interface based on that.

Of course, _I_ wouldn't want to use such an interface. But in theory I
could seamlessly interoperate with people who did.

-Peff

Markus Heidelberg

2009-04-29 20:05:37 UTC

Post by Jeff King

Post by Martin Langhoff
So from Eric's perspective, it is worthwhile to work on all those
issues, and get the right for the end user -- support things we don't
like, offer foolproof catches and warnings that prevent the user from
shooting their lovely toes off to mars, etc.

I read a few of his blog postings. He kept complaining about the
features of git that I like the most. :)
I can see his arguments about how
"add -p" can be dangerous

Actually, I don't see a very special case here with committing a never
compiled/tested worktree state. You can do this with every VCS (without
an index like git) with just selectively committing files instead of the
whole current worktree.

Markus

Jakub Narebski

2009-04-29 07:52:16 UTC

wrote:=20

[I think you cut out a bit too much. Here I resurrected it]

JN> 1. Different limitations on file names (e.g. pathname length),
JN> different special characters, different special filenames
JN> (if any).
[...]
JN> The answer is convention for filenames in a project. Simply
JN> DON'T use filenames which can cause problems.
[...]

=C2=A0 DON'T DO THAT.

What could be proper solution to that, if you do not accept social=20
rather than technical restriction? We can have pre-commit hook that=20
checks for portability for filenames (which is deployment specific,
and shouldn't be part of SCM perhaps with an exception of being example=
=20
hook) but it wouldn't help dealing with non-portable filenames on=20
filesystem that cannot represent them that are there.

If I remember correctly Git for some time has layer which can translate=
=20
between filenames in repository and filenames on filesystem, but I'm=20
not sure if it is generic enough for it to be a solution to this=20
problem, and currently there is no way to manipulate this mapping, I=20
think.

JN> 2. "Case-insensitive" but "case-preserving" filesystems. [...]
JN>
JN> The answer is like for previous issue: don't. Simply DO NOT
JN> create files with filenames which differ only in case [...]

=C2=A0 DON'T DO THAT, SOLVABLE.

By 'solvable' here I mean that you should be able to modify only one of=
=20
clashing files at once (checkout 'README', modify, add to index, remove=
=20
from filesystem, checkout 'readme', modify, etc.), and deal with=20
annoyances in git-status output. It can be done in Git, with medium=20
amount of hacking. I don't think any other SCM can do even this, and
I cannot think of a better, automatic solution that would somehow deal=20
with case-clashing.

Note that all deals are off in case-insensitive and not preserving=20
filesystem.

By the way, wouldn't be a better solution to use sane filesystem, rathe=
r=20
than complicating SCM? ;-)

=20
As I mentioned, Eric is taking the perspective of offering a supporte=

d

SCM to a large and diverse audience. As such, his notes are
interesting not because he's right or he's wrong.
=20
We can be "right" and say "don't do that" if we shrink our audience s=

o

that it looks a lot like us. There, fixed.

<quote source=3D"Dune by Frank Herbert">
[...] the attitude of the knife =E2=80=94 chopping off what's incompl=
ete and
saying: "Now it's complete because it's ended here."
</quote>

I could not resist posting this quote :-P

=20
But something tells me that successful tools are -- by definition --
tools that grow past their creators use.
=20
So from Eric's perspective, it is worthwhile to work on all those
issues, and get the right for the end user -- support things we don't
like, offer foolproof catches and warnings that prevent the user from
shooting their lovely toes off to mars, etc.

Warnings and catches I can accept; adding complications and corner case=
s=20
for situations which can be trivially avoided with a bit of social=20
engineering aka. project guidelines... not so much.

I simply cannot see the situation where you _must_ have dangerously=20
unportable file names (trailing dot, trailing whitespace) and=20
case-clashing files...

=20
His perspective is one of commercial licensing, but even if we aren't
driven by the "each new user is a new dollar" bit, the long term hope=

s

for git might also be to be widely used and to improve the version
control life of many unsuspecting users.
=20
To get there, I suspect we have to understand more of Eric's
perspective.=20
=20
that's my 2c.

By the way, I think that the article on cross-platform version control=20
(version control in heterogenic environment) is quite good article.
I don't quite like the "10 Issues"/"Top 10" way of writing, but the=20
article examines different ways that heterogenic environment can trip=20
SCM. =20

In my opinion Git does quite good here, where it can, and where the=20
issue is to be solved by SCM and not otherwise (extra metadata like=20
resource fork).

--=20
Jakub Narebski
Poland

Martin Langhoff

2009-04-29 08:25:56 UTC

Post by Jakub Narebski

=A0 DON'T DO THAT.

What could be proper solution to that, if you do not accept social
rather than technical restriction?

Let's say strong checks for case sensitivity clashes, leading/trailing
dots, utf-8 encoding maladies, etc switched on by default. And note
that to be user-friendly you want most of those checks at 'add' time.

If we don't like a particular FS, or we think it is messing up our
utf-8 filenames, say it up-front, at clone and checkout time. For
example, if the checkout has files with interesting utf-8 names, it'd
be reasonable to check for filename mangling.

Some things are hard or impossible to prevent - the utf-8 encoding
maladies of OSX for example. But it may be detectable on checkout.

In short, play on the defensive, for the benefit of users who are not
kernel developers.

It will piss off kernel & git developers and slow some operations
somewhat. It will piss off oldtimers like me. But I'll say git config
--global core.trainingwheels no and life will be good.

It may be - as Jeff King points out - a matter of a polished git
porcelain. We've seen lots of porcelains, but no smooth user-targetted
porcelain yet.

cheers,

m
--=20
***@gmail.com
***@laptop.org -- School Server Architect
- ask interesting questions
- don't get distracted with shiny stuff - working code first
- http://wiki.laptop.org/go/User:Martinlanghoff

Jakub Narebski

2009-04-28 18:16:07 UTC

Post by Martin Langhoff
Eric Sink hs been working on the (commercial, proprietary) centralised
SCM Vault for a while. He's written recently about his explorations
around the new crop of DSCMs, and I think it's quite interesting. A
quick search of the list archives makes me thing it wasn't discussed
before.
The guy is knowledgeable, and writes quite witty posts -- naturally,
there's plenty to disagree on, but I'd like to encourage readers not
to nitpick or focus on where Eric is wrong. It is interesting to read
where he thinks git and other DSCMs are missing the mark.
Maybe he's right, maybe he's wrong, but damn he's interesting :-)
So here's the blog - http://www.ericsink.com/

"Here's a blog"... and therefore my dilemma. Should I post my reply
as a comment to this blog, or should I reply here on git mailing list?

I think I will just add link to this thread in GMane mailing list
archive for git mailing list...

Post by Martin Langhoff
These are the best entry points

* "Ten Quirky Issues with Cross-Platform Version Control"

Post by Martin Langhoff
http://www.ericsink.com/entries/quirky.html

which I have answered in separate post in this thread

* "Mercurial, Subversion, and Wesley Snipes"

Post by Martin Langhoff
http://www.ericsink.com/entries/hg_denzel.html

which I will comment now. The 'ES>' prefix means quoting above blog
post.

First there is a list of earlier blog post, with links, which makes
article in question a good staring point.

ES> As part of that effort, I have undertaken an exploration of the
ES> DVCS world. Several weeks ago I started writing one blog entry
ES> every week, mostly focused on DVCS topics. In chronological
ES> order, here they are:
ES>
ES> * The one where I gripe about Git's index

where Eric complains that "git add -p" allows for committing untested
changes... not knowing about "git stash --keep-index", and not
understanding that comitting is (usually) separate from publishing in
distributed version control systems (so you can check after commit,
and amend commit if it does not pass test).

ES> * The one where I whine about the way Git allows developers to
ES> rearrange the DAG

where Eric seems to not notice that you are strongly encouraged to do
'rearranging the DAG' (rewriting the history) _only_ in unpublished
(not made public) part of history.

ES> * The one where it looks like I am against DAG-based version
ES> control but I'm really not

where Eric conflates linear versus merge workflows with
update-before-commit versus commit-then-merge paradigm, not noticing
that you can have linear history using sane commit-update-rebase
rather than unsafe update-before-commit.

ES> * The one where I fuss about DVCSes that try to act like
ES> centralized tools

where DVCS in question that behaves this way is Bazaar (if I
understood this correctly).

ES> * The one where I complain that DVCSes have a lousy story when it
ES> comes to bug-tracking

where Eric correctly notice that distributed version control would not
help much if you use centralized bugtracker, and speculates about
required features that distributed bugtracker should have. Very nice
post in my opinion.

ES> * The one where I lament that I want to like Darcs but I can't

where Eric talks about difference between parentage in merge commit
(which is needed for good merging) and "parentage"/weak link in
cherry-picked commit; Git uses weak link = no link.

ES> * The one where I speculate cluelessly about why Git is so fast

where Eric guesses instead of asking on git mailing list or #git
channel... ;-)

ES> Along the way, I've been spending some time getting hands-on
ES> experience with these tools. I've been using Bazaar for several
ES> months. I don't like it very much. I am currently in the process
ES> of switching to Git, but I don't expect to like it very much
ES> either.

Aaaargh... if you expect to not like it very much, I would be very
suprised if you find it to your liking...

ES> So why don't I write about Mercurial? Because I'm pretty sure I
ES> would like it.
ES>
ES> I chose Bazaar and Git for the experience. But if I were choosing
ES> a DVCS as a regular user, I would choose Mercurial. I've used it
ES> some, and found it to be incredibly pleasant. It seems like the
ES> DVCS that got everything just about right. That's great if you're
ES> a user, but for a writer, what's interesting about that?

Well, Mercurial IMHO didn't get everything right. Not mentioning
implementation issues, like dealing with copies, binary files, and
large files, it got IMHO wrong:
* branching in multiple branches per repository
* tags which should be transferrable but non-versioned

--
Jakub Narebski
Poland
ShadeHawk on #git

Sitaram Chamarty

2009-04-29 07:54:50 UTC

Post by Jakub Narebski
ES> * The one where I lament that I want to like Darcs but I can't
where Eric talks about difference between parentage in merge commit
(which is needed for good merging) and "parentage"/weak link in
cherry-picked commit; Git uses weak link = no link.

Well the patch-id is a sort of "compute on demand" link, so
it would qualify as a weak link, especially because git
manages to use it during a rebase.

I wanted to point that out but I didn't see a link to post
comments so I didn't bother.

Jakub Narebski

2009-04-30 12:17:58 UTC

Post by Jakub Narebski

Post by Martin Langhoff
Eric Sink hs been working on the (commercial, proprietary) centralised
SCM Vault for a while. He's written recently about his explorations
around the new crop of DSCMs, and I think it's quite interesting.

[...]

Post by Jakub Narebski

Post by Martin Langhoff
So here's the blog - http://www.ericsink.com/

[...]

Post by Jakub Narebski
* "Mercurial, Subversion, and Wesley Snipes"

Post by Martin Langhoff
http://www.ericsink.com/entries/hg_denzel.html

which I will comment now. The 'ES>' prefix means quoting above blog
post.

[...]

Post by Jakub Narebski
ES> * The one where I speculate cluelessly about why Git is so fast
where Eric guesses instead of asking on git mailing list or #git
channel... ;-)

This issue is interesting: what features and what design decision
make Git fast? One of the goals of Git was good performance; are
we there?

All quotes marked 'es> ' below are from "Why is Git so Fast?" post
http://www.ericsink.com/entries/why_is_git_fast.html

es> One: Maybe Git is fast simply because it's a DVCS.
es>
es> There's probably some truth here. One of the main benefits touted
es> by the DVCS fanatics is the extra performance you get when
es> everything is "local".

This is I think quite obvious. Accessing memory is faster than
acessing disk, which in turn is faster than accessing network. So if
commit and (change)log does not require access to server via network,
they are so much faster.

BTW. that is why Subversion stores along working copy 'pristine'
versions of files: to make status and diff fast enough to be usable.
Which in turn might make SVN checkout to be larger than full Git
clone ;-)

es>
es> But this answer isn't enough. Maybe it explains why Git is faster
es> than Subversion, but it doesn't explain why Git is so often
es> described as being faster than the other DVCSs.

Not only described; see http://git.or.cz/gitwiki/GitBenchmarks
(although some, if not most of those benchmarks are dated,
and e.g. Bazaar claims to have much better performance now).

es>
es> Two: Maybe Git is fast because Linus Torvalds is so smart.

[non answer; the details are important]

es> Three: Maybe Git is fast because it's written in C instead of one
es> of those newfangled higher-level languages.
es>
es> Nah, probably not. Lots of people have written fast software in
es> C#, Java or Python.
es>
es> And lots of people have written really slow software in
es> traditional native languages like C/C++. [...]

Well, I guess that access to low-level optimization techniques like
mmap are important for performance. But here I am guessing and
speculating like Eric did; well, I am asking on a proper forum ;-)

We have some anegdotical evidence supporting this possibility (which
Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
three most common open source DVCS (Git, Mercurial, bazaar) and the
fact that parts of Mercurial were written in C for better performance.

We can also compare implementations of Git in other, higher level
languages, with reference implementation in C (and shell scripts, and
Perl ;-)). For example most complete I think but still not fully
complete Java implementation: JGit. I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance. Of course we have to take into account the
possibility that JGit isn't simply as well optimized because of less
manpower.

es>
es> Four: Maybe Git is fast because being fast is the primary goal for
es> Git.

[non answer; the details are important]

es>
es> Five: Maybe Git is fast because it does less.
es>
es> One of my favorite recent blog entries is this piece[1] which
es> claims that the way to make code faster is to have it do less.
es>
es> [1] "How to write fast code" by Kas Thomas
es> http://asserttrue.blogspot.com/2009/03/how-to-write-fast-code.html
[...]

es>
es> For example, the way you get something in the Git index is you use
es> the "git add" command. Git doesn't scan your working copy for
es> changed files unless you explicitly tell it to. This can be a
es> pretty big performance win for huge trees. Even when you use the
es> "remember the timestamp" trick, detecting modified files in a
es> really big tree can take a noticeable amount of time.

That of course depends on how you compare performance of different
version control systems (to not compare apples with oranges). But if
you compare e.g. "<scm> commit" with Git equivalent "git commit -a"
the above is simply not true.

BTW. when doing comparison you have to take care of the reverse,
e.g. git doing more like calculating and dislaying diffstat by default
for merges/pulls.

es>
es> Or maybe Git's shortcut for handling renames is faster than doing
es> them more correctly[2] like Bazaar does.
es>
es> [2] "Renaming is the killer app of distributed version control"
es> http://www.markshuttleworth.com/archives/123

Errr... what?

es> Six: Maybe Git is fast because it doesn't use much external code.
es>
es> Very often, when you are facing a decision to use somebody else's
es> code or write it yourself, there is a performance tradeoff. Not
es> always, but often. Maybe the third party code is just slower than
es> the code you could write yourself if you had time to do it. Or
es> maybe there is an impedance mismatch between the API of the
es> external library and your own architecture.
es>
es> This can happen even when the library is very high quality. For
es> example, consider libcurl. This is a great library. Tons of
es> people use it. But it does have one problem that will cause
es> performance problems for some users: When using libcurl to fetch
es> an object, it wants to own the buffer. In some situations, this
es> can end up forcing you to use extra memcpys or temporary files.
es> The reason all the low level calls like send() and recv() allow
es> the caller to own the loop and the buffer is because this is the
es> best way to avoid the need to make extra copies of the data on
es> disk or in memory.
[...]

es>
es> Maybe Git is fast because every time they faced one of these "buy
es> vs. build" choices, they decided to just write it themselves.

I don't think so. Rather the opposite is true. Git uses libcurl for
HTTP transport. Git uses zlib for compression. Git uses SHA-1 from
OpenSSL or from Mozilla. Git uses (modified, internal) LibXDiff for
(binary) deltaifying, for diffs and for merges.

OTOH Git includes several own micro-libraries: parseopt, strbuf,
ALLOC_GROW, etc. NIH syndrome? I don't think so; rather avoiding
extra dependencies (bstring vs strbuf), and existing solutions not
fitting all needs (popt/argp/getopt vs parse-options).

es> Seven: Maybe Git isn't really that fast.
es>
es> If there is one thing I've learned about version control it's that
es> everybody's situation is different. It is quite likely that Git
es> is a lot faster for some scenarios than it is for others.
es>
es> How does Git handle really large trees? Git was designed primary
es> to support the efforts of the Linux kernel developers. A lot of
es> people think the Linux kernel is a large tree, but it's really
es> not. Many enterprise configuration management repositories are
es> FAR bigger than the Linux kernel.

c.f. "Why Perforce is more scalable than Git" by Steve Hanov
http://gandolf.homelinux.org/blog/index.php?id=50

I don't really know about this.

But there is one issue Eric Sink didn't think about:

Eight: Git seems fast.
======================

Here I mean concentaring on low _latency_, which means that when git
produces more than one page of output (for example "git log"), it tries to output the first page as fast as possible; which means that first page e.g.
"git <sth> | head -25 >/dev/null" has to be fast, and not
"git <sth> >/dev/null" itself.

Having progress indicator appearing whenever is longer wait (quite
fresh feature) also help impression of being fast...

And what do you think about this?

--
Jakub Narebski
Poland
ShadeHawk on #git

Michael Witten

2009-04-30 12:56:35 UTC

Post by Jakub Narebski
I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance.

Java is definitely higher than C, but you can do some pretty low-level
operations on bits and bytes and the like, not to mention the presence
of a JIT.

My point: I don't think that Java can tell us anything special in this regard.

Jakub Narebski

2009-04-30 15:28:04 UTC

Post by Michael Witten

Post by Jakub Narebski
I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance.

Java is definitely higher than C, but you can do some pretty low-level
operations on bits and bytes and the like, not to mention the presence
of a JIT.
My point: I don't think that Java can tell us anything special in this regard.

Let's rephrase question a bit then: what low-level operation were needed
for good performance in JGit?

--
Jakub Narebski
Poland

Shawn O. Pearce

2009-04-30 18:52:44 UTC

Post by Jakub Narebski
Let's rephrase question a bit then: what low-level operation were needed
for good performance in JGit?

Aside from the message I just posted:

- Avoid String, its too expensive most of the time. Stick with
byte[], and better, stick with data that is a triplet of (byte[],
int start, int end) to define a region of data. Yes, its annoying,
as its 3 values you need to pass around instead of just 1, but
its makes a big difference in running time.

- Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
which can be inlined into an object allocation.

- Subclass instead of contain references. We extend ObjectId to
attach application data, rather than contain a reference to an
ObjectId. Classical Java programming techniques would say this
is a violation of encapsulatio. But it gets us the same memory
impact that C Git gets by saying:

struct appdata {
unsigned char[20] sha1;
....
}

- We're hurting dearly for not having more efficient access to the
pack-*.pack file data. mmap in Java is crap. We implement our
own page buffer, reading in blocks of 8192 bytes at a time and
holding them in our own cache.

Really, we should write our own mmap library as an optional JNI
thing, and tie it into libz so we can efficiently run inflate()
off the pack data directly.

- We're hurting dearly for not having more efficient access to the
pack-*.idx files. Again, with no mmap we read the entire bloody
index into memory. But since you won't touch most of it we keep
it in large byte[], but since you are searching with an ObjectId
(5 ints) we pay a conversion price on every search step where
we have to copy from the large byte[] to 5 local variable ints,
and then compare to the ObjectId. Its an overhead C git doesn't
have to deal with.

Anyway.

I'm still just amazed at how well JGit runs given these limitations.
I guess that's Moore's Law for you. 10 years ago, JGit wouldn't
have been practical.

--
Shawn.

Kjetil Barvik

2009-04-30 20:36:03 UTC

* "Shawn O. Pearce" <***@spearce.org> writes:
<snipp>
| - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
| which can be inlined into an object allocation.

What to pepole think about doing something simmilar in C GIT?

That is, convert the current internal representation of the SHA-1 from
"unsigned char sha1[20]" to "unsigned long sha1[5]"?

Ok, I currently see 2 problems with it:

1) Will the type "unsigned long" always be unsigned 32 bit on all
platforms on all computers? do we need an "unit_32_t" thing?

2) Can we get in truble because of differences between litle- and
big-endian machines?

And, simmilar I can see or guess the following would be positive with
this change:

3) From a SHA1 library I worked with some time ago, I noticed that
it internaly used the type "unsigned long arr[5]", so it can
mabye be possible to get some shurtcuts or maybe speedups here,
if we want to do it.

4) The "static inline void hashcpy(....)" in cache.h could then
maybe be written like this:

static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
{
sha_dst[0] = sha_src[0];
sha_dst[1] = sha_src[1];
sha_dst[2] = sha_src[2];
sha_dst[3] = sha_src[3];
sha_dst[4] = sha_src[4];
}

And hopefully will be compiled to just 5 store/more
instructions, or at least hopefully be faster than the currently
memcpy() call. But mabye we get more compiled instructions compared
to a single call to memcpy()?

5) Simmilar as 4) for the other SHA1 realted hash functions nearby
hashcpy() in cache.h

OK, just some thought's. Sorry if this allready has been discussed
but could not find something abouth it after a simple google search.

-- kjetil

Shawn O. Pearce

2009-04-30 20:40:33 UTC

Post by Kjetil Barvik
<snipp>
| - Avoid allocating byte[] for SHA-1s, instead we convert to 5 ints,
| which can be inlined into an object allocation.
What to pepole think about doing something simmilar in C GIT?
That is, convert the current internal representation of the SHA-1 from
"unsigned char sha1[20]" to "unsigned long sha1[5]"?

Its not worth the code churn.

Post by Kjetil Barvik
1) Will the type "unsigned long" always be unsigned 32 bit on all
platforms on all computers? do we need an "unit_32_t" thing?

Yea, "unsigned long" isn't always 32 bits. So we'd need to use
uint32_t. Which we already use elsewhere, but still.

Post by Kjetil Barvik
2) Can we get in truble because of differences between litle- and
big-endian machines?

Yes, especially if compare was implemented using native uint32_t
compare and the processor was little-endian.

Post by Kjetil Barvik
4) The "static inline void hashcpy(....)" in cache.h could then

Its already done as "memcpy(a, b, 20)" which most compilers will
inline and probably reduce to 5 word moves anyway. That's why
hashcpy() itself is inline.

--
Shawn.

Kjetil Barvik

2009-04-30 21:36:07 UTC

* "Shawn O. Pearce" <***@spearce.org> writes:
|> 4) The "static inline void hashcpy(....)" in cache.h could then
|> maybe be written like this:
|
| Its already done as "memcpy(a, b, 20)" which most compilers will
| inline and probably reduce to 5 word moves anyway. That's why
| hashcpy() itself is inline.

But would the compiler be able to trust that the hashcpy() is always
called with correct word alignment on variables a and b?

I made a test and compiled git with:

make USE_NSEC=1 CFLAGS="-march=core2 -mtune=core2 -O2 -g2 -fno-stack-protector" clean all

compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3
CPU: Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz GenuineIntel

Then used gdb to get the following:

(gdb) disassemble write_sha1_file
Dump of assembler code for function write_sha1_file:
0x080e3830 <write_sha1_file+0>: push %ebp
0x080e3831 <write_sha1_file+1>: mov %esp,%ebp
0x080e3833 <write_sha1_file+3>: sub $0x58,%esp
0x080e3836 <write_sha1_file+6>: lea -0x10(%ebp),%eax
0x080e3839 <write_sha1_file+9>: mov %ebx,-0xc(%ebp)
0x080e383c <write_sha1_file+12>: mov %esi,-0x8(%ebp)
0x080e383f <write_sha1_file+15>: mov %edi,-0x4(%ebp)
0x080e3842 <write_sha1_file+18>: mov 0x14(%ebp),%ebx
0x080e3845 <write_sha1_file+21>: mov %eax,0x8(%esp)
0x080e3849 <write_sha1_file+25>: lea -0x44(%ebp),%edi
0x080e384c <write_sha1_file+28>: lea -0x24(%ebp),%esi
0x080e384f <write_sha1_file+31>: mov %edi,0x4(%esp)
0x080e3853 <write_sha1_file+35>: mov %esi,(%esp)
0x080e3856 <write_sha1_file+38>: mov 0x10(%ebp),%ecx
0x080e3859 <write_sha1_file+41>: mov 0xc(%ebp),%edx
0x080e385c <write_sha1_file+44>: mov 0x8(%ebp),%eax
0x080e385f <write_sha1_file+47>: call 0x80e0350 <write_sha1_file_prepare>
0x080e3864 <write_sha1_file+52>: test %ebx,%ebx
0x080e3866 <write_sha1_file+54>: je 0x80e3885 <write_sha1_file+85>

0x080e3868 <write_sha1_file+56>: mov -0x24(%ebp),%eax
0x080e386b <write_sha1_file+59>: mov %eax,(%ebx)
0x080e386d <write_sha1_file+61>: mov -0x20(%ebp),%eax
0x080e3870 <write_sha1_file+64>: mov %eax,0x4(%ebx)
0x080e3873 <write_sha1_file+67>: mov -0x1c(%ebp),%eax
0x080e3876 <write_sha1_file+70>: mov %eax,0x8(%ebx)
0x080e3879 <write_sha1_file+73>: mov -0x18(%ebp),%eax
0x080e387c <write_sha1_file+76>: mov %eax,0xc(%ebx)
0x080e387f <write_sha1_file+79>: mov -0x14(%ebp),%eax
0x080e3882 <write_sha1_file+82>: mov %eax,0x10(%ebx)

I admit that I am not particular familar with intel machine
instructions, but I guess that the above 10 mov instructions is the
result for the compiled inline hashcpy() in the write_sha1_file()
function in sha1_file.c

Question: would it be possible for the compiler to compile it down to
just 5 mov instructions if we had used unsigned 32 bits type? Or is
this the best we can reasonable hope for inside the write_sha1_file()
function?

I checked 3 other output of "disassemble function_foo", and it seems
that those 3 functions I checked got 10 mov instructions for the
inline hashcpy(), as far as I can tell.

0x080e3885 <write_sha1_file+85>: mov %esi,(%esp)
0x080e3888 <write_sha1_file+88>: call 0x80e3800 <has_sha1_file>
0x080e388d <write_sha1_file+93>: xor %edx,%edx
0x080e388f <write_sha1_file+95>: test %eax,%eax
0x080e3891 <write_sha1_file+97>: jne 0x80e38b6 <write_sha1_file+134>
0x080e3893 <write_sha1_file+99>: mov 0xc(%ebp),%eax
0x080e3896 <write_sha1_file+102>: mov %edi,%edx
0x080e3898 <write_sha1_file+104>: mov %eax,0x4(%esp)
0x080e389c <write_sha1_file+108>: mov -0x10(%ebp),%ecx
0x080e389f <write_sha1_file+111>: mov 0x8(%ebp),%eax
0x080e38a2 <write_sha1_file+114>: movl $0x0,0x8(%esp)
0x080e38aa <write_sha1_file+122>: mov %eax,(%esp)
0x080e38ad <write_sha1_file+125>: mov %esi,%eax
0x080e38af <write_sha1_file+127>: call 0x80e1e40 <write_loose_object>
0x080e38b4 <write_sha1_file+132>: mov %eax,%edx
0x080e38b6 <write_sha1_file+134>: mov %edx,%eax
0x080e38b8 <write_sha1_file+136>: mov -0xc(%ebp),%ebx
0x080e38bb <write_sha1_file+139>: mov -0x8(%ebp),%esi
0x080e38be <write_sha1_file+142>: mov -0x4(%ebp),%edi
0x080e38c1 <write_sha1_file+145>: leave
0x080e38c2 <write_sha1_file+146>: ret
End of assembler dump.
(gdb)

So, maybe the compiler is doing the right thing after all?

-- kjetil

Steven Noonan

2009-05-01 00:23:57 UTC

|> =A0 =A0 =A04) The "static inline void hashcpy(....)" in cache.h co=

uld then

|
| Its already done as "memcpy(a, b, 20)" which most compilers will
| inline and probably reduce to 5 word moves anyway. =A0That's why
| hashcpy() itself is inline.
=A0But would the compiler be able to trust that the hashcpy() is alwa=

ys

=A0called with correct word alignment on variables a and b?
=A0 =A0make USE_NSEC=3D1 CFLAGS=3D"-march=3Dcore2 -mtune=3Dcore2 -O2 =

-g2 -fno-stack-protector" clean all

=A0compiler: gcc (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3
(gdb) disassemble write_sha1_file
0x080e3830 <write_sha1_file+0>: push =A0 %ebp
0x080e3831 <write_sha1_file+1>: mov =A0 =A0%esp,%ebp
0x080e3833 <write_sha1_file+3>: sub =A0 =A0$0x58,%esp
0x080e3836 <write_sha1_file+6>: lea =A0 =A0-0x10(%ebp),%eax
0x080e3839 <write_sha1_file+9>: mov =A0 =A0%ebx,-0xc(%ebp)
0x080e383c <write_sha1_file+12>: =A0 =A0 =A0 =A0mov =A0 =A0%esi,-0x8(=

%ebp)

0x080e383f <write_sha1_file+15>: =A0 =A0 =A0 =A0mov =A0 =A0%edi,-0x4(=

%ebp)

0x080e3842 <write_sha1_file+18>: =A0 =A0 =A0 =A0mov =A0 =A00x14(%ebp)=

,%ebx

0x080e3845 <write_sha1_file+21>: =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x8(%=

esp)

0x080e3849 <write_sha1_file+25>: =A0 =A0 =A0 =A0lea =A0 =A0-0x44(%ebp=

),%edi

0x080e384c <write_sha1_file+28>: =A0 =A0 =A0 =A0lea =A0 =A0-0x24(%ebp=

),%esi

0x080e384f <write_sha1_file+31>: =A0 =A0 =A0 =A0mov =A0 =A0%edi,0x4(%=

esp)

0x080e3853 <write_sha1_file+35>: =A0 =A0 =A0 =A0mov =A0 =A0%esi,(%esp=

)

0x080e3856 <write_sha1_file+38>: =A0 =A0 =A0 =A0mov =A0 =A00x10(%ebp)=

,%ecx

0x080e3859 <write_sha1_file+41>: =A0 =A0 =A0 =A0mov =A0 =A00xc(%ebp),=

%edx

0x080e385c <write_sha1_file+44>: =A0 =A0 =A0 =A0mov =A0 =A00x8(%ebp),=

%eax

0x080e385f <write_sha1_file+47>: =A0 =A0 =A0 =A0call =A0 0x80e0350 <w=

rite_sha1_file_prepare>

0x080e3864 <write_sha1_file+52>: =A0 =A0 =A0 =A0test =A0 %ebx,%ebx
0x080e3866 <write_sha1_file+54>: =A0 =A0 =A0 =A0je =A0 =A0 0x80e3885 =

<write_sha1_file+85>

0x080e3868 <write_sha1_file+56>: =A0 =A0 =A0 =A0mov =A0 =A0-0x24(%ebp=

),%eax

0x080e386b <write_sha1_file+59>: =A0 =A0 =A0 =A0mov =A0 =A0%eax,(%ebx=

)

0x080e386d <write_sha1_file+61>: =A0 =A0 =A0 =A0mov =A0 =A0-0x20(%ebp=

),%eax

0x080e3870 <write_sha1_file+64>: =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x4(%=

ebx)

0x080e3873 <write_sha1_file+67>: =A0 =A0 =A0 =A0mov =A0 =A0-0x1c(%ebp=

),%eax

0x080e3876 <write_sha1_file+70>: =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x8(%=

ebx)

0x080e3879 <write_sha1_file+73>: =A0 =A0 =A0 =A0mov =A0 =A0-0x18(%ebp=

),%eax

0x080e387c <write_sha1_file+76>: =A0 =A0 =A0 =A0mov =A0 =A0%eax,0xc(%=

ebx)

0x080e387f <write_sha1_file+79>: =A0 =A0 =A0 =A0mov =A0 =A0-0x14(%ebp=

),%eax

0x080e3882 <write_sha1_file+82>: =A0 =A0 =A0 =A0mov =A0 =A0%eax,0x10(=

%ebx)

=A0I admit that I am not particular familar with intel machine
=A0instructions, but I guess that the above 10 mov instructions is th=

e

=A0result for the compiled inline hashcpy() in the write_sha1_file()
=A0function in sha1_file.c
=A0Question: would it be possible for the compiler to compile it down=

to

=A0just 5 mov instructions if we had used unsigned 32 bits type? =A0O=

r is

=A0this the best we can reasonable hope for inside the write_sha1_fil=

e()

=A0function?
=A0I checked 3 other output of "disassemble function_foo", and it see=

ms

=A0that those 3 functions I checked got 10 mov instructions for the
=A0inline hashcpy(), as far as I can tell.
0x080e3885 <write_sha1_file+85>: =A0 =A0 =A0 =A0mov =A0 =A0%esi,(%esp=

)

0x080e3888 <write_sha1_file+88>: =A0 =A0 =A0 =A0call =A0 0x80e3800 <h=

as_sha1_file>

0x080e388d <write_sha1_file+93>: =A0 =A0 =A0 =A0xor =A0 =A0%edx,%edx
0x080e388f <write_sha1_file+95>: =A0 =A0 =A0 =A0test =A0 %eax,%eax
0x080e3891 <write_sha1_file+97>: =A0 =A0 =A0 =A0jne =A0 =A00x80e38b6 =

<write_sha1_file+134>

0x080e3893 <write_sha1_file+99>: =A0 =A0 =A0 =A0mov =A0 =A00xc(%ebp),=

%eax

0x080e3896 <write_sha1_file+102>: =A0 =A0 =A0 mov =A0 =A0%edi,%edx
0x080e3898 <write_sha1_file+104>: =A0 =A0 =A0 mov =A0 =A0%eax,0x4(%es=

p)

0x080e389c <write_sha1_file+108>: =A0 =A0 =A0 mov =A0 =A0-0x10(%ebp),=

%ecx

0x080e389f <write_sha1_file+111>: =A0 =A0 =A0 mov =A0 =A00x8(%ebp),%e=

ax

0x080e38a2 <write_sha1_file+114>: =A0 =A0 =A0 movl =A0 $0x0,0x8(%esp)
0x080e38aa <write_sha1_file+122>: =A0 =A0 =A0 mov =A0 =A0%eax,(%esp)
0x080e38ad <write_sha1_file+125>: =A0 =A0 =A0 mov =A0 =A0%esi,%eax
0x080e38af <write_sha1_file+127>: =A0 =A0 =A0 call =A0 0x80e1e40 <wri=

te_loose_object>

0x080e38b4 <write_sha1_file+132>: =A0 =A0 =A0 mov =A0 =A0%eax,%edx
0x080e38b6 <write_sha1_file+134>: =A0 =A0 =A0 mov =A0 =A0%edx,%eax
0x080e38b8 <write_sha1_file+136>: =A0 =A0 =A0 mov =A0 =A0-0xc(%ebp),%=

ebx

0x080e38bb <write_sha1_file+139>: =A0 =A0 =A0 mov =A0 =A0-0x8(%ebp),%=

esi

0x080e38be <write_sha1_file+142>: =A0 =A0 =A0 mov =A0 =A0-0x4(%ebp),%=

edi

0x080e38c1 <write_sha1_file+145>: =A0 =A0 =A0 leave
0x080e38c2 <write_sha1_file+146>: =A0 =A0 =A0 ret
End of assembler dump.
(gdb)
=A0So, maybe the compiler is doing the right thing after all?

Well, I just tested this with GCC myself. I used this segment of code:

#include <memory.h>
void hashcpy(unsigned char *sha_dst, const unsigned char *sha_s=
rc)
{
memcpy(sha_dst, sha_src, 20);
}

I compiled using Apple's GCC 4.0.1 (note that GCC 4.3 and 4.4 vanilla
yield the same code) with these parameters to get Intel assembly:
gcc -O2 -arch i386 -march=3Dpentium3 -mtune=3Dpentium3
-fomit-frame-pointer -fno-strict-aliasing -S test.c
and these parameters to get the equivalent PowerPC code:
gcc -O2 -mcpu=3DG5 -arch ppc -fomit-frame-pointer
-fno-strict-aliasing -S test.c

Intel code:
.text
.align 4,0x90
=2Eglobl _hashcpy
_hashcpy:
subl $12, %esp
movl 20(%esp), %edx
movl 16(%esp), %ecx
movl (%edx), %eax
movl %eax, (%ecx)
movl 4(%edx), %eax
movl %eax, 4(%ecx)
movl 8(%edx), %eax
movl %eax, 8(%ecx)
movl 12(%edx), %eax
movl %eax, 12(%ecx)
movl 16(%edx), %eax
movl %eax, 16(%ecx)
addl $12, %esp
ret
.subsections_via_symbols

and the PowerPC code:

.section __TEXT,__text,regular,pure_instructions
.section __TEXT,__picsymbolstub1,symbol_stubs,pure_instructions=
,32
.machine ppc970
.text
.align 2
.p2align 4,,15
.globl _hashcpy
_hashcpy:
lwz r0,0(r4)
lwz r2,4(r4)
lwz r9,8(r4)
lwz r11,12(r4)
stw r0,0(r3)
stw r2,4(r3)
stw r9,8(r3)
stw r11,12(r3)
lwz r0,16(r4)
stw r0,16(r3)
blr
.subsections_via_symbols

So it does look like GCC does what it should and it inlines the memcpy.

A bit off topic, but the results are rather interesting to me, and I
think I see a weakness in how GCC is doing this on Intel. Someone
please correct me if I'm wrong, but the PowerPC code seems much better
because it can yield very high instruction-level parallelism. It does
5 loads and then 5 stores, using 4 registers for temporary storage and
2 registers for pointers.

I realize the Intel x86 architecture is quite constrained in that it
has so few general purpose registers, but there has to be better code
than what GCC emitted above. It seems like the processor would stall
because of the quantity of sequential inter-dependent instructions
that can't be done in parallel (mov to memory that depends on a mov to
eax, etc).

I suppose the code might not be stalling if it's using the maximum
number of registers and doing as many memory accesses that it can per
clock, but based on known details about the architecture, does it seem
to be doing that?

- Steven

James Pickens

2009-05-01 01:25:21 UTC

Post by Steven Noonan
A bit off topic, but the results are rather interesting to me, and I
think I see a weakness in how GCC is doing this on Intel. Someone
please correct me if I'm wrong, but the PowerPC code seems much better
because it can yield very high instruction-level parallelism. It does
5 loads and then 5 stores, using 4 registers for temporary storage and
2 registers for pointers.
I realize the Intel x86 architecture is quite constrained in that it
has so few general purpose registers, but there has to be better code
than what GCC emitted above. It seems like the processor would stall
because of the quantity of sequential inter-dependent instructions
that can't be done in parallel (mov to memory that depends on a mov to
eax, etc).

There aren't any unnecessary dependencies. Take this sequence:

1: movl (%edx), %eax
2: movl %eax, (%ecx)
3: movl 4(%edx), %eax
4: movl %eax, 4(%ecx)

There are two unavoidable dependencies - #2 depends on #1, and #4
depends on #3. #3 does not depend on #2, even though they both
use %eax, because #3 is a write to %eax. So whatever was in %eax
before #3 is irrelevant. The processor knows this and will use
register renaming to execute #1 and #3 in parallel, and #2 and #4
in parallel.

James

Kjetil Barvik

2009-05-01 09:19:04 UTC

* Steven Noonan <***@uplinklabs.net> writes:
| On Thu, Apr 30, 2009 at 2:36 PM, Kjetil Barvik <***@broadpark.no> wrote:
|> * "Shawn O. Pearce" <***@spearce.org> writes:
|> |> 4) The "static inline void hashcpy(....)" in cache.h could then
|> |> maybe be written like this:
|> |
|> | Its already done as "memcpy(a, b, 20)" which most compilers will
|> | inline and probably reduce to 5 word moves anyway. That's why
|> | hashcpy() itself is inline.
|>
|> But would the compiler be able to trust that the hashcpy() is always
|> called with correct word alignment on variables a and b?

<snipp>

| Well, I just tested this with GCC myself. I used this segment of code:
|
| #include <memory.h>
| void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
| {
| memcpy(sha_dst, sha_src, 20);
| }

OK, here is a smal test, which maybe shows at least one difference
between using "unsigned char sha1[20]" and "unsigned long sha1[5]".
Given the following file, memcpy_test.c:

#include <string.h>
extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src);
void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src)
{
memcpy(sha_dst, sha_src, 20);
}
extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src);
void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src)
{
memcpy(sha_dst, sha_src, 5);
}

And, compiled with the following:

gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c

It produced the following memcpy_test.s file:

.file "memcpy_test.c"
.text
.p2align 4,,15
.globl hashcpy_ulong
.type hashcpy_ulong, @function
hashcpy_ulong:
movl 8(%esp), %edx
movl 4(%esp), %ecx
movl (%edx), %eax
movl %eax, (%ecx)
movzbl 4(%edx), %eax
movb %al, 4(%ecx)
ret
.size hashcpy_ulong, .-hashcpy_ulong
.p2align 4,,15
.globl hashcpy_uchar
.type hashcpy_uchar, @function
hashcpy_uchar:
movl 8(%esp), %edx
movl 4(%esp), %ecx
movl (%edx), %eax
movl %eax, (%ecx)
movl 4(%edx), %eax
movl %eax, 4(%ecx)
movl 8(%edx), %eax
movl %eax, 8(%ecx)
movl 12(%edx), %eax
movl %eax, 12(%ecx)
movl 16(%edx), %eax
movl %eax, 16(%ecx)
ret
.size hashcpy_uchar, .-hashcpy_uchar
.ident "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3"
.section .note.GNU-stack,"",@progbits

So, the "unsigned long" type hashcpy() used 7 instructions, compared
to 13 for the "unsigned char" type hascpy().

Would I guess correct if the hashcpy_ulong() function will also use
less CPU cycles, and then would be faster than hashcpy_uchar()?

-- kjetil

Mike Hommey

2009-05-01 09:34:27 UTC

Post by Kjetil Barvik
|> |> 4) The "static inline void hashcpy(....)" in cache.h could then
|> |
|> | Its already done as "memcpy(a, b, 20)" which most compilers will
|> | inline and probably reduce to 5 word moves anyway. That's why
|> | hashcpy() itself is inline.
|>
|> But would the compiler be able to trust that the hashcpy() is always
|> called with correct word alignment on variables a and b?
<snipp>
|
| #include <memory.h>
| void hashcpy(unsigned char *sha_dst, const unsigned char *sha_src)
| {
| memcpy(sha_dst, sha_src, 20);
| }
OK, here is a smal test, which maybe shows at least one difference
between using "unsigned char sha1[20]" and "unsigned long sha1[5]".
#include <string.h>
extern void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src);
void hashcpy_uchar(unsigned char *sha_dst, const unsigned char *sha_src)
{
memcpy(sha_dst, sha_src, 20);
}
extern void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src);
void hashcpy_ulong(unsigned long *sha_dst, const unsigned long *sha_src)
{
memcpy(sha_dst, sha_src, 5);
}
gcc -O2 -mtune=core2 -march=core2 -S -fomit-frame-pointer memcpy_test.c
.file "memcpy_test.c"
.text
.p2align 4,,15
.globl hashcpy_ulong
movl 8(%esp), %edx
movl 4(%esp), %ecx
movl (%edx), %eax
movl %eax, (%ecx)
movzbl 4(%edx), %eax
movb %al, 4(%ecx)
ret
.size hashcpy_ulong, .-hashcpy_ulong
.p2align 4,,15
.globl hashcpy_uchar
movl 8(%esp), %edx
movl 4(%esp), %ecx
movl (%edx), %eax
movl %eax, (%ecx)
movl 4(%edx), %eax
movl %eax, 4(%ecx)
movl 8(%edx), %eax
movl %eax, 8(%ecx)
movl 12(%edx), %eax
movl %eax, 12(%ecx)
movl 16(%edx), %eax
movl %eax, 16(%ecx)
ret
.size hashcpy_uchar, .-hashcpy_uchar
.ident "GCC: (Gentoo 4.3.3-r2 p1.1, pie-10.1.5) 4.3.3"
So, the "unsigned long" type hashcpy() used 7 instructions, compared
to 13 for the "unsigned char" type hascpy().

But your "unsigned long" version only copies 5 bytes...

Mike

Kjetil Barvik

2009-05-01 09:42:05 UTC

* Mike Hommey <***@glandium.org> writes:
<snipp>
| But your "unsigned long" version only copies 5 bytes...

Yes, that is true... OK, same result for hashcpy_uchar() and
hashcpy_ulong() when corrected for this.

--kjetil, with a brown paper bag

Tony Finch

2009-05-01 17:42:38 UTC

Post by Kjetil Barvik
I admit that I am not particular familar with intel machine
instructions, but I guess that the above 10 mov instructions is the
result for the compiled inline hashcpy() in the write_sha1_file()
function in sha1_file.c
Question: would it be possible for the compiler to compile it down to
just 5 mov instructions if we had used unsigned 32 bits type?

No, because the x86 can't do direct memory-to-memory moves.

Tony.

--
f.anthony.n.finch <***@dotat.at> http://dotat.at/
GERMAN BIGHT HUMBER: SOUTHWEST 5 TO 7. MODERATE OR ROUGH. SQUALLY SHOWERS.
MODERATE OR GOOD.

Dmitry Potapov

2009-05-01 05:24:34 UTC

Post by Kjetil Barvik
4) The "static inline void hashcpy(....)" in cache.h could then
static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
{
sha_dst[0] = sha_src[0];
sha_dst[1] = sha_src[1];
sha_dst[2] = sha_src[2];
sha_dst[3] = sha_src[3];
sha_dst[4] = sha_src[4];
}
And hopefully will be compiled to just 5 store/more
instructions, or at least hopefully be faster than the currently
memcpy() call. But mabye we get more compiled instructions compared
to a single call to memcpy()?

Good compilers can inline memcpy and should produce more efficient code
for the target architecture, which can be faster than manually written.
On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
while the above code requires 5 operations.

Dmitry

Mike Hommey

2009-05-01 09:42:21 UTC

Post by Dmitry Potapov

Post by Kjetil Barvik
4) The "static inline void hashcpy(....)" in cache.h could then
static inline void hashcpy(unsigned long sha_dst[5], const unsigned long sha_src[5])
{
sha_dst[0] = sha_src[0];
sha_dst[1] = sha_src[1];
sha_dst[2] = sha_src[2];
sha_dst[3] = sha_src[3];
sha_dst[4] = sha_src[4];
}
And hopefully will be compiled to just 5 store/more
instructions, or at least hopefully be faster than the currently
memcpy() call. But mabye we get more compiled instructions compared
to a single call to memcpy()?

Good compilers can inline memcpy and should produce more efficient code
for the target architecture, which can be faster than manually written.
On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
while the above code requires 5 operations.

I guess, though, that some enforced alignment could help produce
slightly more efficient code on some architectures (most notably sparc,
which really doesn't like to deal with unaligned words).

Mike

Dmitry Potapov

2009-05-01 10:46:55 UTC

Post by Mike Hommey

Post by Dmitry Potapov
Good compilers can inline memcpy and should produce more efficient code
for the target architecture, which can be faster than manually written.
On x86_64, memcpy() requires only 3 load/store operations to copy SHA-1
while the above code requires 5 operations.

I guess, though, that some enforced alignment could help produce
slightly more efficient code on some architectures (most notably sparc,
which really doesn't like to deal with unaligned words).

Agreed. Enforcing good alignment may be useful. My point was that avoiding
memcpy with modern compilers is rather pointless or even harmful because the
compiler know more about the target architecture than the author of the code.

Dmitry

Shawn O. Pearce

2009-04-30 18:43:19 UTC

Post by Michael Witten

Post by Jakub Narebski
I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance.

Java is definitely higher than C, but you can do some pretty low-level
operations on bits and bytes and the like, not to mention the presence
of a JIT.

But its still costly compared to C.

Post by Michael Witten
My point: I don't think that Java can tell us anything special in this regard.

Sure it can.

Peff I think made a good point here, that we rely on a lot of small
tweaks in the C git code to get *really* good performance. 5% here,
10% there, and suddenly you are 60% faster than you were before.
Nico, Linus, Junio, they have all spent some time over the past
3 or 4 years trying to tune various parts of Git to just flat out
run fast.

Higher level languages hide enough of the machine that we can't
make all of these optimizations.

JGit struggles with not having mmap(), or when you do use Java NIO
MappedByteBuffer, we still have to copy to a temporary byte[] in
order to do any real processing. C Git avoids that copy. Sure,
other higher level langauges may offer a better mmap facility,
but they also tend to offer garbage collection and most try to tie
the mmap management into the GC "for safety and ease of use".

JGit struggles with not having unsigned types in Java. There are
many locations in JGit where we really need "unsigned int32_t" or
"unsigned long" (largest machine word available) or "unsigned char"
but these types just don't exist in Java. Converting a byte up to
an int just to treat it as an unsigned requires an extra " & 0xFF"
operation to remove the sign extension.

JGit struggles with not having an efficient way to represent a SHA-1.
C can just say "unsigned char[20]" and have it inline into the
container's memory allocation. A byte[20] in Java will cost an
*additional* 16 bytes of memory, and be slower to access because
the bytes themselves are in a different area of memory from the
container object. We try to work around it by converting from a
byte[20] to 5 ints, but that costs us machine instructions.

C Git takes for granted that memcpy(a, b, 20) is dirt cheap when
doing a copy from an inflated tree into a struct object. JGit has
to pay a huge penalty to copy that 20 byte region out into 5 ints,
because later on, those 5 ints are cheaper.

Other higher level languages also lack the ability to mark a
type unsigned. Or face similiar penalties with storing a 20 byte
binary region.

Native Java collection types have been a snare for us in JGit.
We've used java.util.* types when they seem to be handy and already
solve the data structure problem at hand, but they tend to preform
a lot worse than writing a specialized data structure.

For example, we have ObjectIdSubclassMap for what should be
Map<ObjectId,Object>. Only it requires that the Object type you
use as the "value" entry in the map extend from ObjectId, as the
instance serves as both key *and* value. But it screams when
compared to HashMap<ObjectId,Object>. (For those who don't know,
ObjectId is JGit's "unsigned char[20]" for a SHA-1.)

Just a day or so ago I wrote LongMap, a faster HashMap<Long,Object>,
for hashing objects by indexes in a pack file. Again, the boxing
costs in Java to convert a "long" (largest integer type) into an
Object that the standard HashMap type would accept was rather high.

Right now, JGit is still paying dearly when it comes to ripping
apart a commit or a tree object to follow the object links. Or when
invoking inflate(). We spend a lot more time doing this sort of work
than C git does, and yet we're trying to be as close to the machine
as we can go by using byte[] whenever possible, by avoiding copying
whenever possible, and avoiding memory allocation when possible.

Notably, `rev-list --objects --all` takes about 2x as long in
JGit as it does in C Git on a project like the linux kernel, and
`index-pack` for the full ~270M pack file takes about 2x as long.

Both parts of JGit are about as good as I know how to make them,
but we're really at the mercy of the JIT, and changes in the JIT
can cause us to perform worse (or better) than before. Unlike in
C Git where Linus has done assembler dumps of sections of code and
tried to determine better approaches. :-)

So. Yes, its practical to build Git in a higher level language, but
you just can't get the same performance, or tight memory utilization,
that C Git gets. That's what that higher level language abstraction
costs you. But, JGit performs reasonably well; well enough that
we use internally at Google as a git server.

--
Shawn.

Jeff King

2009-04-30 14:22:44 UTC

Post by Jakub Narebski
This is I think quite obvious. Accessing memory is faster than
acessing disk, which in turn is faster than accessing network. So if
commit and (change)log does not require access to server via network,
they are so much faster.

Like all generalizations, this is only mostly true. Fast network servers
with big caches can outperform disks for some loads. And in many cases
with a VCS, you are performing a query that might look over the whole
dataset, but return only a small fraction of data.

So I wouldn't rule out the possibility of a pleasant VCS experience on a
network-optimized system backed by beefy servers on a local network. I
have never used perforce, but I get the impression that it is more
optimized for such a situation. Git is really optimized for open source
projects: slow servers across high-latency, low-bandwidth links.

Post by Jakub Narebski
es> Nah, probably not. Lots of people have written fast software in
es> C#, Java or Python.
es>
es> And lots of people have written really slow software in
es> traditional native languages like C/C++. [...]
Well, I guess that access to low-level optimization techniques like
mmap are important for performance. But here I am guessing and
speculating like Eric did; well, I am asking on a proper forum ;-)

Certainly there's algorithmic fastness that you can do in any language,
and I think git does well at that. Most operations are independent of
the total size of history (e.g., branching is O(1) and commit is
O(changed files), diff looks only at endpoints, etc). Operations which
deal only with history are independent of the size of the tree (e.g.,
"git log" and the history graph in gitk look only at commits, never at
the tree). And when we do have to look at the tree, we can drastically
reduce our I/O by comparing hashes instead of full files.

But there are also some micro-optimizations that make a big difference
in practice. Some of them can be done in any language. For example, the
packfiles are ordered by type so that all of the commits have a nice I/O
pattern when doing a history walk.

Some other micro-optimizations are really language-specific, though. I
don't recall the numbers, but I think Linus got measurable speedups from
cutting the memory footprint of the object and commit structs (which
gave better cache usage patterns). Git uses some variable-length fields
inside structs instead of a pointer to a separate allocated string to
give better memory access patterns. Tricks like that won't give the
order-of-magnitude speedups that algorithmic optimizations will, but 10%
here and 20% there means you can get a system that is a few times faster
than the competition. For an operation that takes 0.1s anyway, that
doesn't matter. But with current hardware and current project size, you
are often talking about dropping a 3-second operation down to 1s or
0.5s, which just feels a lot snappier.

And finally, git tries to do as little work as possible when starting a
new command, and streams output as soon as possible. Which means that in
a command-line setting, git can _feel_ snappier, because it starts
output immediately. Higher-level languages can often have a much longer
startup time, especially if they have a lot of modules to load. E.g.,:

# does enough work to easily fill your pager
$ time git log -100 >/dev/null
real 0m0.011s
user 0m0.008s
sys 0m0.004s

# does nothing, just starts perl and aborts with usage
$ time git send-email >/dev/null
real 0m0.150s
user 0m0.104s
sys 0m0.048s

Both are warm-cache times. C git gives you output almost instaneously,
whereas just loading perl with a modest set of modules introduces a
noticeable pause before any work is actually done. In the grand scheme
of things, .1s probably isn't relevant, but I think avoiding that delay
adds to the perception of git as fast.

Post by Jakub Narebski
es> Or maybe Git's shortcut for handling renames is faster than doing
es> them more correctly[2] like Bazaar does.
es>
es> [2] "Renaming is the killer app of distributed version control"
es> http://www.markshuttleworth.com/archives/123
Errr... what?

Yeah, I had the same thought. Git's rename handling is _much_ more
computationally intensive than other systems. In fact, it is one of only
two places where I have ever wanted git to be any faster (the other
being repacking of large repos).

Post by Jakub Narebski
Eight: Git seems fast.
======================
Here I mean concentaring on low _latency_, which means that when git

I do think this helps (see above), but I wanted to note that it is more
than just "streaming"; I think other systems stream, as well. For
example, I am pretty sure that "cvs log" streamed (but thank god it has
been so long since I touched CVS that I can't really remember), but it
_still_ felt awfully slow.

So it is also about keeping start times low and having your data in a
format that is ready to use.

-Peff

Linus Torvalds

2009-05-01 18:43:49 UTC

Post by Jeff King
Like all generalizations, this is only mostly true. Fast network servers
with big caches can outperform disks for some loads.

That's _very_ few loads.

It doesn't matter how good a server you have, network filesystems
invariably suck.

Why? It's not that the network or the server sucks - you can easily find
beefy NAS setups that have big raids etc and are much faster than most
local disks.

And they _still_ suck.

Simple reason: caching. It's a lot easier to cache local filesystems. Even
modern networked filesystems (ie NFSv4), that do a pretty good job on a
file-per-file basis with delegations etc, and they still tend to suck
horribly at metadata.

In contrast, a workstation with local filesystems and enough memory to
cache it well will just be a lot nicer.

Post by Jeff King
So I wouldn't rule out the possibility of a pleasant VCS experience on a
network-optimized system backed by beefy servers on a local network.

Hey, you can always throw resources at it.

Post by Jeff King
I have never used perforce, but I get the impression that it is more
optimized for such a situation.

I doubt it. I suspect git will outperform pretty much anything else in
that kind of situation too.

One thing that git does - and some other VCS's avoid - is to actually
stat() the whole working tree in order to not need special per-file "I use
this file" locking semantics. That can in theory make git slower over a
network filesystem than such (very broken) alternatives.

If your VCS requires that you mark all files for editing somehow (ie you
can't just use your favourite editor or scripting to modify files, but
have to use "p4 edit" to say that you're going to write to the file, and
the file is otherwise read-only), then such a VCS can - by being annoying
and in your way - do some things faster than git can.

And yes, perforce does that (the "p4 edit" command is real, and exists).

And yes, in theory that can probably mean that perforce doesn't care so
much about the metadata caching problem on network filesystems - because
p4 will maintain some file of its own that contains the metadata.

But I suspect that the git "async stat" ("core.preloadindex") thing means
that git will kick p4 *ss even on that benchmark, and be a whole lot more
pleasant to use. Even on networked filesystems.

Linus

Jeff King

2009-05-01 19:08:54 UTC

Post by Linus Torvalds

Post by Jeff King
Like all generalizations, this is only mostly true. Fast network servers
with big caches can outperform disks for some loads.

[...]
In contrast, a workstation with local filesystems and enough memory to
cache it well will just be a lot nicer.
[...]

Post by Jeff King
I have never used perforce, but I get the impression that it is more
optimized for such a situation.

I doubt it. I suspect git will outperform pretty much anything else in
that kind of situation too.

Thanks for the analysis; what you said makes sense to me. However, there
is at least one case of somebody complaining that git doesn't scale as
well as perforce for their load:

http://gandolf.homelinux.org/blog/index.php?id=50

Part of his issue is with git-p4 sucking, which it probably does. But
part of it sounds like he has a gigantic workload (the description of
which sounds silly to me, but I respect the fact that he is probably
describing standard practice among some companies), and that workload is
just a little too gigantic for the workstations to handle. I.e., by
throwing resources at the central server they can avoid throwing as many
at each workstation.

But there are so few details it's hard to say whether he's doing
something else wrong or suboptimally. He does mention Windows, which
IIRC has horrific stat performance.

-Peff

d***@lang.hm

2009-05-01 19:13:50 UTC

Post by Jeff King

Post by Linus Torvalds

Post by Jeff King
Like all generalizations, this is only mostly true. Fast network servers
with big caches can outperform disks for some loads.

[...]
In contrast, a workstation with local filesystems and enough memory to
cache it well will just be a lot nicer.
[...]

Post by Jeff King
I have never used perforce, but I get the impression that it is more
optimized for such a situation.

I doubt it. I suspect git will outperform pretty much anything else in
that kind of situation too.

Thanks for the analysis; what you said makes sense to me. However, there
is at least one case of somebody complaining that git doesn't scale as
http://gandolf.homelinux.org/blog/index.php?id=50
Part of his issue is with git-p4 sucking, which it probably does. But
part of it sounds like he has a gigantic workload (the description of
which sounds silly to me, but I respect the fact that he is probably
describing standard practice among some companies), and that workload is
just a little too gigantic for the workstations to handle. I.e., by
throwing resources at the central server they can avoid throwing as many
at each workstation.
But there are so few details it's hard to say whether he's doing
something else wrong or suboptimally. He does mention Windows, which
IIRC has horrific stat performance.

the key thing for his problem is the support for large binary objects.
there was discussion here a few weeks ago about ways to handle such things
without trying to pull them into packs. I suspect that solving those sorts
of issues would go a long way towards closing the gap on this workload.

there may be issues in doing a clone for repositories that large, I don't
remember exactly what happens when you have something larger than 4G to
send in a clone.

David Lang

Nicolas Pitre

2009-05-01 19:32:18 UTC

the key thing for his problem is the support for large binary objects. there
was discussion here a few weeks ago about ways to handle such things without
trying to pull them into packs. I suspect that solving those sorts of issues
would go a long way towards closing the gap on this workload.
there may be issues in doing a clone for repositories that large, I don't
remember exactly what happens when you have something larger than 4G to send
in a clone.

If you have files larger than 4G then you definitively need a 64-bit
machine with plenty of RAM for git to at least be able to cope at the
moment.

That should be easy to add a config option to determine how big is a big
file, and store those big files directly in a pack of their own instead
of a loose object (for easy pack reuse during a further repack), and
never attempt to deltify them, etc. etc. At which point git will handle
big files just fine even on a 32-bit machine but it won't do more than
copying them in and out, and possibly deflating/inflating them while at
it, but nothing fancier.

Nicolas

Daniel Barkalow

2009-05-01 21:17:31 UTC

Post by Jeff King

Post by Linus Torvalds

Post by Jeff King
Like all generalizations, this is only mostly true. Fast network servers
with big caches can outperform disks for some loads.

[...]
In contrast, a workstation with local filesystems and enough memory to
cache it well will just be a lot nicer.
[...]

Post by Jeff King
I have never used perforce, but I get the impression that it is more
optimized for such a situation.

I doubt it. I suspect git will outperform pretty much anything else in
that kind of situation too.

Thanks for the analysis; what you said makes sense to me. However, there
is at least one case of somebody complaining that git doesn't scale as
http://gandolf.homelinux.org/blog/index.php?id=50
Part of his issue is with git-p4 sucking, which it probably does. But
part of it sounds like he has a gigantic workload (the description of
which sounds silly to me, but I respect the fact that he is probably
describing standard practice among some companies), and that workload is
just a little too gigantic for the workstations to handle. I.e., by
throwing resources at the central server they can avoid throwing as many
at each workstation.

I think his problem is that he's trying to replace his p4 repository with
a git repository, which is a bit like trying to download github, rather
than a project from github. Perforce is good at dealing with the case
where people check in a vast quantity of junk that you don't check out.

That is, you can back up your workstation into Perforce, and it won't
affect anyone's performance if you use a path that's not in the range that
anybody else checks out. And people actually do that. And Perforce doesn't
make a distinction between different projects and different branches of
the same project and different subdirectories of a branch of the same
project, so it's impossible to tease apart except by company policy.

Git doesn't scale in that it can't do the extremely narrow checkouts you
need if your repository root directory contains thousands of complete
unrelated projects with each branch of each project getting a
subdirectory. On the other hand, it does a great job when the data is
already partitioned into useful repositories.

-Daniel
*This .sig left intentionally blank*

Linus Torvalds

2009-05-01 21:37:28 UTC

Post by Jeff King
Thanks for the analysis; what you said makes sense to me. However, there
is at least one case of somebody complaining that git doesn't scale as

So we definitely do have scaling issues, there's no question about that. I
just don't think they are about enterprise network servers vs the more
workstation-oriented OSS world..

I think they're likely about the whole git mentality of looking at the big
picture, and then getting swamped by just how _huge_ that picture can be
if somebody just put the whole world in a single repository..

With perforce, repository maintenance is such a central issue that the
whole p4 mentality seems to _encourage_ everybody to put everything into
basically one single p4 repository. And afaik, p4 basically works mostly
like CVS, ie it really ends up being pretty much oriented to a "one file
at a time" model.

Which is nice in that you can have a million files, and then only check
out a few of them - you'll never even _see_ the impact of the other
999,995 files.

And git obviously doesn't have that kind of model at all. Git
fundamnetally never really looks at less than the whole repo. Even if you
limit things a bit (ie check out just a portion, or have the history go
back just a bit), git ends up still always caring about the whole thing,
and carrying the knowledge around.

So git scales really badly if you force it to look at everything as one
_huge_ repository. I don't think that part is really fixable, although we
can probably improve on it.

And yes, then there's the "big file" issues. I really don't know what to
do about huge files. We suck at them, I know. There are work-arounds (like
not deltaing big objects at all), but they aren't necessarily that great
either.

I bet we could probably improve git large-file behavior for many common
cases. Do we have a good test-case of some particular suckiness that is
actually relevant enough that people might decide to look at it (and by
"people", I do mean myself too - but I'd need to be somewhat motivated by
it. A usage case that we suck at and that is available and relevant).

Linus

d***@lang.hm

2009-05-01 22:11:03 UTC

Post by Linus Torvalds
I bet we could probably improve git large-file behavior for many common
cases. Do we have a good test-case of some particular suckiness that is
actually relevant enough that people might decide to look at it (and by
"people", I do mean myself too - but I'd need to be somewhat motivated by
it. A usage case that we suck at and that is available and relevant).

I think that a sane use case that would make sense to people is based on
the 'game developer' example

they have source code, but they also have large images (and sometimes
movie clips), where a particular release of the game needs a particular
set of the images. during development you may change images frequently
(although most changesets probably only change a few, if any of the
images)

the images can be large (movies can be very large), and since they are
already compressed they don't diff or compress well.

David Lang

Nicolas Pitre

2009-04-30 18:56:23 UTC

Post by Jakub Narebski
es> Two: Maybe Git is fast because Linus Torvalds is so smart.
[non answer; the details are important]

I think Linus is certainly responsible for a big part of Git's speed.
He came with the basic data structure used by git which has lots to do
with that. Also, he designed Git specifically to fulfill a need for
which none of the alternatives were fast enough. Hence Git was designed
from the ground up with speed as one of the primary design goals, such
as being able to create multiple commits per second instead of the other
way around (several seconds per commit). And yes, Linus is usually smart
enough with the proper mindset to achieve such goals.

Post by Jakub Narebski
es> Three: Maybe Git is fast because it's written in C instead of one
es> of those newfangled higher-level languages.
es>
es> Nah, probably not. Lots of people have written fast software in
es> C#, Java or Python.
es>
es> And lots of people have written really slow software in
es> traditional native languages like C/C++. [...]
Well, I guess that access to low-level optimization techniques like
mmap are important for performance. But here I am guessing and
speculating like Eric did; well, I am asking on a proper forum ;-)
We have some anegdotical evidence supporting this possibility (which
Eric dismisses), namely the fact that pure-Python Bazaar is slowest of
three most common open source DVCS (Git, Mercurial, bazaar) and the
fact that parts of Mercurial were written in C for better performance.
We can also compare implementations of Git in other, higher level
languages, with reference implementation in C (and shell scripts, and
Perl ;-)). For example most complete I think but still not fully
complete Java implementation: JGit. I hope that JGit developers can
tell us whether using higher level language affects performance, how
much, and what features of higher-level language are causing decrease
in performance. Of course we have to take into account the
possibility that JGit isn't simply as well optimized because of less
manpower.

One of the main JGit developers is Shawn Pearce. If you look at Shawn's
contribution to C git, they mostly are all related to performance
issues. Amongst other things, he is the author of git-fast-import, he
contributed the pack access windowing code, and he was also involved in
the initial design of pack v4. Hence Shawn is a smart guy who certainly
knows one or two things about performance optimizations. Yet he
reported on this list that his efforts to make JGit faster were not much
successful anymore, most probably due to the language overhead.

Post by Jakub Narebski
es> Four: Maybe Git is fast because being fast is the primary goal for
es> Git.
[non answer; the details are important]

Still, this is actually true (see about Linus above). Without such a
goal, you quickly lose sight of performance regressions.

Post by Jakub Narebski
es> Maybe Git is fast because every time they faced one of these "buy
es> vs. build" choices, they decided to just write it themselves.
I don't think so. Rather the opposite is true. Git uses libcurl for
HTTP transport. Git uses zlib for compression. Git uses SHA-1 from
OpenSSL or from Mozilla. Git uses (modified, internal) LibXDiff for
(binary) deltaifying, for diffs and for merges.

Well, I think he's right on this point as well. libcurl is not so
relevant since it is rarely the bottleneck (the network bandwidth itself
usually is). zlib is already as fast as it can be as multiple attempts
to make it faster didn't succeed. Git already carries its own version
of SHA-1 code for ARM and PPC because the alternatives were slower.
The fact that libxdiff was made internal is indeed to have a better
impedance matching with the core code, otherwise it could have remained
fully external just like zlib. And the binary delta code is not
libxdiff anymore but a much smaller, straight forward, and optimized to
death version to achieve speed over versatility (no need to be versatile
when strictly dealing with Git's needs only).

Post by Jakub Narebski
es> Seven: Maybe Git isn't really that fast.
es>
es> If there is one thing I've learned about version control it's that
es> everybody's situation is different. It is quite likely that Git
es> is a lot faster for some scenarios than it is for others.
es>
es> How does Git handle really large trees? Git was designed primary
es> to support the efforts of the Linux kernel developers. A lot of
es> people think the Linux kernel is a large tree, but it's really
es> not. Many enterprise configuration management repositories are
es> FAR bigger than the Linux kernel.
c.f. "Why Perforce is more scalable than Git" by Steve Hanov
http://gandolf.homelinux.org/blog/index.php?id=50
I don't really know about this.

Git certainly sucks big time with large files.

Git also sucks to a lesser extent (but still) with very large
repositories.

But large trees? I don't think Git is worse than anything out there
with a large tree of average size files.

Yet, this point is misleading because when people gives to Git the
reputation of being faster, this is certainly from comparison of
operations performed on the same source tree. Who cares about scenarios
for which the tool was not designed? Those "enterprise configuration
management repositories" are not what Git was designed for indeed, but
neither was Mercurial nor Bazaar, or any other contender to which Git is
usually compared.

Nicolas

Alex Riesen

2009-04-30 19:16:59 UTC

Post by Nicolas Pitre
Yet, this point is misleading because when people gives to Git the
reputation of being faster, this is certainly from comparison of
operations performed on the same source tree. =C2=A0Who cares about s=

cenarios

Post by Nicolas Pitre
for which the tool was not designed? =C2=A0Those "enterprise configur=

ation

Post by Nicolas Pitre
management repositories" are not what Git was designed for indeed, bu=

t

Especially when no sane developer will put in his repository the toolch=
ain
(pre-compiled. For all supported platforms!), all the supporting tools
(like grep,
find, etc.Pre-compiled _and_ source), the in-house framework (pre-compi=
led
and source, again), firmware (pre-compiled and put in the repository we=
ekly),
and operating system code (pre-compiled, with firmware-specific drivers=
,
updated, you guessed it, weekly), and well, there is the project itself=
(Java or
C++, and documentation in .doc and .xls)...
Now, what kind of self-hating idiot will design a system for that kind =
of abuse?
(And if someone says that's is not true in the most enterprise
f$%cking configurations,
he definitely hasn't had to live through big enough number of them).

Andreas Ericsson

2009-05-04 08:01:57 UTC

Post by Nicolas Pitre
Yet, this point is misleading because when people gives to Git the
reputation of being faster, this is certainly from comparison of
operations performed on the same source tree. Who cares about scenarios
for which the tool was not designed? Those "enterprise configuration
management repositories" are not what Git was designed for indeed, but

Especially when no sane developer will put in his repository the toolchain
(pre-compiled. For all supported platforms!), all the supporting tools
(like grep,
find, etc.Pre-compiled _and_ source), the in-house framework (pre-compiled
and source, again), firmware (pre-compiled and put in the repository weekly),
and operating system code (pre-compiled, with firmware-specific drivers,
updated, you guessed it, weekly), and well, there is the project itself (Java or
C++, and documentation in .doc and .xls)...

Well, git could actually handle that just fine if the toolchain was in a
submodule or even in a separate repository that developers never had to
worry about. Then you'd design a little tool that said "re-create build 8149"
and it would pull the tools used to do that, and the code and the artwork,
and then set to work. It'd be an overnight (or over-weekend) job, but no
man-hours would be spent on it. That's how I'd do it anyways, probably
with the "build" repository as a master repo with "tools", "code" and
"artwork" as submodules to it.

Now, what kind of self-hating idiot will design a system for that kind of abuse?

Noone, naturally, but one might design a system where each folder
in the repository root is considered a repository in its own right,
and then get that more or less for free.

The problem with git for such scenarios is that you have to think
*before* creating the repository, or play silly buggers when importing
which makes it hard to see how the pieces fit together afterwards.

A tool that could take a repository from a different scm, create a
master repository and several submodule repositories from it would
probably solve many of the issues gaming companies have if they want
to switch to using git. Not least because it would open their eyes
to how that sort of separation can be done in git, and why it's
useful. The binary repos can then turn off delta-compression (and
zlib compression) for all its blobs using a .gitattributes file,
and things would be several orders of magnitudes faster.
--
Andreas Ericsson ***@op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231

Register now for Nordic Meet on Nagios, June 3-4 in Stockholm
http://nordicmeetonnagios.op5.org/

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

Jakub Narebski

2009-04-30 19:33:59 UTC

Post by Jakub Narebski
es> Maybe Git is fast because every time they faced one of these "buy
es> vs. build" choices, they decided to just write it themselves.
I don't think so. Rather the opposite is true. Git uses libcurl for
HTTP transport. Git uses zlib for compression. Git uses SHA-1 from
OpenSSL or from Mozilla. Git uses (modified, internal) LibXDiff for
(binary) deltaifying, for diffs and for merges.

Well, I think he's right on this point as well. [...]
The fact that libxdiff was made internal is indeed to have a better
impedance matching with the core code, otherwise it could have remained
fully external just like zlib. And the binary delta code is not
libxdiff anymore but a much smaller, straight forward, and optimized to
death version to achieve speed over versatility (no need to be versatile
when strictly dealing with Git's needs only).

Hrmmmm... I have thought that LibXDiff was internalized mainly for ease
of modification, as my impression is that LibXDiff is single developer
effort, while Git from beginning have many contributors (and submodules
didn't exist then). If I remember correctly the rcsmerge/diff3 algorithm
was added first in internalized git's xdiff... was it added to LibXDiff
proper, anyway?

BTW. I wonder what other F/OSS version control systems: Bazaar,
Mercurial, Darcs, Monotone use for binary deltas, for diff engine,
and for textual three-way merge engine. Hmmm... perhaps I'll ask
on #revctrl

--
Jakub Narebski
Poland

38 Replies
42 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Martin Langhoff 2009-04-27 08:55:55 UTC

Jakub Narebski 2009-04-28 11:24:31 UTC

Robin Rosenberg 2009-04-28 21:00:56 UTC

Martin Langhoff 2009-04-29 06:55:29 UTC

Jeff King 2009-04-29 07:21:05 UTC

Markus Heidelberg 2009-04-29 20:05:37 UTC

Jakub Narebski 2009-04-29 07:52:16 UTC

Martin Langhoff 2009-04-29 08:25:56 UTC

Jakub Narebski 2009-04-28 18:16:07 UTC

Sitaram Chamarty 2009-04-29 07:54:50 UTC

Jakub Narebski 2009-04-30 12:17:58 UTC

Michael Witten 2009-04-30 12:56:35 UTC

Jakub Narebski 2009-04-30 15:28:04 UTC

Shawn O. Pearce 2009-04-30 18:52:44 UTC

Kjetil Barvik 2009-04-30 20:36:03 UTC

Shawn O. Pearce 2009-04-30 20:40:33 UTC

Kjetil Barvik 2009-04-30 21:36:07 UTC

Steven Noonan 2009-05-01 00:23:57 UTC

James Pickens 2009-05-01 01:25:21 UTC

Kjetil Barvik 2009-05-01 09:19:04 UTC

Mike Hommey 2009-05-01 09:34:27 UTC

Kjetil Barvik 2009-05-01 09:42:05 UTC

Tony Finch 2009-05-01 17:42:38 UTC

Dmitry Potapov 2009-05-01 05:24:34 UTC

Mike Hommey 2009-05-01 09:42:21 UTC

Dmitry Potapov 2009-05-01 10:46:55 UTC

Shawn O. Pearce 2009-04-30 18:43:19 UTC

Jeff King 2009-04-30 14:22:44 UTC

Linus Torvalds 2009-05-01 18:43:49 UTC

Jeff King 2009-05-01 19:08:54 UTC

d***@lang.hm 2009-05-01 19:13:50 UTC

Nicolas Pitre 2009-05-01 19:32:18 UTC

Daniel Barkalow 2009-05-01 21:17:31 UTC

Linus Torvalds 2009-05-01 21:37:28 UTC

d***@lang.hm 2009-05-01 22:11:03 UTC

Nicolas Pitre 2009-04-30 18:56:23 UTC

Alex Riesen 2009-04-30 19:16:59 UTC

Andreas Ericsson 2009-05-04 08:01:57 UTC

Jakub Narebski 2009-04-30 19:33:59 UTC

about - legalese

Loading...