Discussion:
Another bench on gitweb
(too old to reply)
Bruno Cesar Ribas
2008-02-10 03:09:19 UTC
Permalink
Hello,

I made another SIMPLE bench on gitweb. Testing time on git-for-each-ref.

Using my 1000 projects I ran:
8<----------------
#/bin/bash
PEGAR_ref() {
PROJ=projeto$1.git;
cd $PROJ;
printf "\tlastref = $(git-for-each-ref --sort=-committerdate --count=1\
--format='%(committer)')\n" >> config;
cd -;
}
cd $HOME/scm
for((i=1;i<=1000;i++)){ PEGAR_ref $i & }
8<----------------

And at the "git_get_last_activity" instead of running git-for-each-ref i
asked to get gitweb.lastref

Here are the results:
"dd" means: dd if=/dev/zero of=$HOME/dd/$i bs=1M count=400000

Running 2 dd to generate disk IO. Here comes the results:
NO projects_list projects_list
7m56s55 6m11s95 cached last change, using gitweb.lastref
16m30s69 15m10s74 default gitweb, using FS's owner
16m07s40 15m24s34 patched to get gitweb.owner


Now results for a 1000projects on an idle machine. (No dd running to
generate IO)
NO projects_list projects_list
0m26s79 0m38s70 cached last change, using gitweb.lastref
1m19s08 1m09s55 default gitweb, using FS's owner
1m17s58 1m09s55 patched to get gitweb.owner


I found out those VERY interesting, so instead of trying to think a new way
to store gitweb config, we should think a way to cache those information.
--
Bruno Ribas - ***@c3sl.ufpr.br
http://web.inf.ufpr.br/ribas
C3SL: http://www.c3sl.ufpr.br
Jakub Narebski
2008-02-12 00:44:23 UTC
Permalink
Post by Bruno Cesar Ribas
I made another SIMPLE bench on gitweb. Testing time on git-for-each-ref.
8<----------------
#/bin/bash
PEGAR_ref() {
PROJ=projeto$1.git;
cd $PROJ;
printf "\tlastref = $(git-for-each-ref --sort=-committerdate --count=1\
--format='%(committer)')\n" >> config;
cd -;
}
cd $HOME/scm
for((i=1;i<=1000;i++)){ PEGAR_ref $i & }
8<----------------
Could you please do not mix English and your native language
(Portuguese?) in shown examples? Mixing two languages in one
identifier name (unless it is ref in br too) is especially bad
form... TIA.

Besides, what I'm more interested in is a script used to generate
those 1000 projects...
Post by Bruno Cesar Ribas
And at the "git_get_last_activity" instead of running git-for-each-ref i
asked to get gitweb.lastref
"dd" means: dd if=/dev/zero of=$HOME/dd/$i bs=1M count=400000
NO projects_list projects_list
7m56s55 6m11s95 cached last change, using gitweb.lastref
16m30s69 15m10s74 default gitweb, using FS's owner
16m07s40 15m24s34 patched to get gitweb.owner
Now results for a 1000projects on an idle machine. (No dd running to
generate IO)
NO projects_list projects_list
0m26s79 0m38s70 cached last change, using gitweb.lastref
1m19s08 1m09s55 default gitweb, using FS's owner
1m17s58 1m09s55 patched to get gitweb.owner
Those are results of running gitweb as standalone script, or your
script runing git-for-each-ref?

Besides, I'd rather see results of running ApacheBench. On Linux it
usually comes with installed Apache, and it is called by runing
'ab'. Your tests instead of adding superficial load could try to use
concurrent requests, and more than 1 request to get better average.
Post by Bruno Cesar Ribas
I found out those VERY interesting, so instead of trying to think a
new way to store gitweb config, we should think a way to cache those
information.
Below there are my thoughts about caching information for gitweb:

First, the basis of each otimisation is checking the bottlenecks.
I think it was posted sometime there that the pages taking most load
are projects list and feeds.

Kernel.org even run modified version of gitweb, with some caching
support; Cgit (git web interface in C) also has caching support.


Due to the fact that gitweb produces relative time in output for
projects list page and for project summary page, it is unfortunately
not easy to just simply cache HTML output: one would have either
resign from using relative time, or rewrite time from relative to
absolute, either on server (in gitweb), or on client (in JavaScript).
So perhaps it would be better to cache generating (costly to obtain)
information; like lastchanged time for projects.

Or we can for example assume (i.e. do that if appropriate gitweb
feature is set) that projects are bare projects pushed to, and that
git-update-server-info is ran on repository update (for example for
HTTP protocol transport), and stat $GIT_DIR/info/refs and/or
$GIT_DIR/objects/info/packs instead of running git-for-each-ref.
Of course then column would be called something like "Last Update"
instead of "Last Change".

The "Last Update" information is especially easy because it can be
invalidated / update externally, by the update / post-receive hook,
outside gitweb. So gitweb doesn't need to implement some caching
invalidation mechanism for this.

We can store lastref / lastchange information in repository config, as
for example "gitweb.lastref" key. We can store it in gitweb wide
config, for example in $projectroot/gitwebconfig file, as for example
"gitweb.<project>.lastref" key. Or we can store it as hash initializer
in some sourced Perl file, read from gitweb_config.perl (this I think
can be done even now without touching gitweb code at all); we can use
Data::Dumper to save such information.

The possibilities are many.
--
Jakub Narebski
Poland
ShadeHawk on #git
Bruno Cesar Ribas
2008-02-13 00:45:28 UTC
Permalink
Post by Jakub Narebski
Could you please do not mix English and your native language
(Portuguese?) in shown examples? Mixing two languages in one
identifier name (unless it is ref in br too) is especially bad
form... TIA.
I agree... that's not good =( i'll enforce to send everything in english.
Post by Jakub Narebski
Besides, what I'm more interested in is a script used to generate
those 1000 projects...
So.. like I said, i made a simple test so I cloned a very small project[1]
e replicated it, just generated different owner and descriptions.
Post by Jakub Narebski
Post by Bruno Cesar Ribas
NO projects_list projects_list
7m56s55 6m11s95 cached last change, using gitweb.lastref
16m30s69 15m10s74 default gitweb, using FS's owner
16m07s40 15m24s34 patched to get gitweb.owner
Those are results of running gitweb as standalone script, or your
script runing git-for-each-ref?
Runing gitweb as standalone script.
Post by Jakub Narebski
Besides, I'd rather see results of running ApacheBench. On Linux it
usually comes with installed Apache, and it is called by runing
'ab'. Your tests instead of adding superficial load could try to use
concurrent requests, and more than 1 request to get better average.
hmmm I see, but we will bench with it running with filesystem cached.
This could be a good idea if the machine runs only git!
I find interesting running with all those dds to simulate something like my
environment, which is shared with all of our mirrors. I can even run a test
inside this machine but results may be very different depending on the time
of the day.

As soon I get the machine I ran those tests available again i'll run it with
apache. If you have some ideas of which tests to run tell me =) So we don't
waste time when i get the machine.
Post by Jakub Narebski
Post by Bruno Cesar Ribas
I found out those VERY interesting, so instead of trying to think a
new way to store gitweb config, we should think a way to cache those
information.
First, the basis of each otimisation is checking the bottlenecks.
I think it was posted sometime there that the pages taking most load
are projects list and feeds.
Kernel.org even run modified version of gitweb, with some caching
support; Cgit (git web interface in C) also has caching support.
Is this gitweb version for kernel.org available somewhere?
Post by Jakub Narebski
<snip>
The "Last Update" information is especially easy because it can be
invalidated / update externally, by the update / post-receive hook,
outside gitweb. So gitweb doesn't need to implement some caching
invalidation mechanism for this.
that's what i thought.
Post by Jakub Narebski
We can store lastref / lastchange information in repository config, as
for example "gitweb.lastref" key. We can store it in gitweb wide
config, for example in $projectroot/gitwebconfig file, as for example
"gitweb.<project>.lastref" key. Or we can store it as hash initializer
in some sourced Perl file, read from gitweb_config.perl (this I think
can be done even now without touching gitweb code at all); we can use
Data::Dumper to save such information.
The possibilities are many.
That's right.

Caching lastref at $projectroot/gitwebconfig might be a good idea. I think
that caching it at $GIT_DIR/config is somehow ugly, because we will have a
script modifying this file.

And having this $projectroot/gitwebconfig with lastref cached can act as a
project_list because we already all the directories we should get gitweb
confs, like gitweb.description, gitweb.url and gitweb.owner (soon?!) and
others that will appear.
Post by Jakub Narebski
--
Jakub Narebski
Poland
ShadeHawk on #git
--
Bruno Ribas - ***@c3sl.ufpr.br
http://web.inf.ufpr.br/ribas
C3SL: http://www.c3sl.ufpr.br
Bruno Cesar Ribas
2008-02-13 00:50:41 UTC
Permalink
Post by Bruno Cesar Ribas
So.. like I said, i made a simple test so I cloned a very small project[1]
e replicated it, just generated different owner and descriptions.
just in time: Project is:
http://git.c3sl.ufpr.br/gitweb?p=chessd/bosh.git;a=summary
--
Bruno Ribas - ***@c3sl.ufpr.br
http://web.inf.ufpr.br/ribas
C3SL: http://www.c3sl.ufpr.br
J.H.
2008-02-13 00:57:30 UTC
Permalink
Post by Bruno Cesar Ribas
Post by Jakub Narebski
Post by Bruno Cesar Ribas
I found out those VERY interesting, so instead of trying to think a
new way to store gitweb config, we should think a way to cache those
information.
First, the basis of each otimisation is checking the bottlenecks.
I think it was posted sometime there that the pages taking most load
are projects list and feeds.
Kernel.org even run modified version of gitweb, with some caching
support; Cgit (git web interface in C) also has caching support.
Is this gitweb version for kernel.org available somewhere?
It's available from my git tree on kernel.org
http://git.kernel.org/?p=git/warthog9/gitweb.git;a=summary

or

git://git.kernel.org/pub/scm/git/warthog9/gitweb.git

Mind you my performance on the non-cache state is not going to be any
better than normal gitweb, however the performance on a cache-hit is
orders of magnitude faster - though at a rather expensive cost - disk
space. There is currently something like 20G of disk being used on one
of kernel.org's machines providing the cache (this does get flushed on
occasion - I think) but that is providing caching for everything that
kernel.org has in it's git trees (or 255188 unique urls currently). My
code base is now, horribly, out of date with respect to mainline but it
works and it's been solid and reasonably reliable (though I do know of
two bugs in it right now I need to track down - one with respect to a
failure of the script - and one that is an array out of bounds error)

- John
J.H.
2008-02-13 01:01:33 UTC
Permalink
Post by Bruno Cesar Ribas
Post by Jakub Narebski
Post by Bruno Cesar Ribas
I found out those VERY interesting, so instead of trying to think a
new way to store gitweb config, we should think a way to cache those
information.
First, the basis of each otimisation is checking the bottlenecks.
I think it was posted sometime there that the pages taking most load
are projects list and feeds.
Kernel.org even run modified version of gitweb, with some caching
support; Cgit (git web interface in C) also has caching support.
Is this gitweb version for kernel.org available somewhere?
It's available from my git tree on kernel.org
http://git.kernel.org/?p=git/warthog9/gitweb.git;a=summary

or

git://git.kernel.org/pub/scm/git/warthog9/gitweb.git

Mind you my performance on the non-cache state is not going to be any
better than normal gitweb, however the performance on a cache-hit is
orders of magnitude faster - though at a rather expensive cost - disk
space. There is currently something like 20G of disk being used on one
of kernel.org's machines providing the cache (this does get flushed on
occasion - I think) but that is providing caching for everything that
kernel.org has in it's git trees (or 255188 unique urls currently). My
code base is now, horribly, out of date with respect to mainline but it
works and it's been solid and reasonably reliable (though I do know of
two bugs in it right now I need to track down - one with respect to a
failure of the script - and one that is an array out of bounds error)

- John
Jakub Narebski
2008-02-13 12:17:46 UTC
Permalink
Post by J.H.
Post by Bruno Cesar Ribas
Post by Jakub Narebski
Kernel.org even run modified version of gitweb, with some caching
support; Cgit (git web interface in C) also has caching support.
Is this gitweb version for kernel.org available somewhere?
It's available from my git tree on kernel.org
http://git.kernel.org/?p=git/warthog9/gitweb.git;a=summary
or
git://git.kernel.org/pub/scm/git/warthog9/gitweb.git
Mind you my performance on the non-cache state is not going to be any
better than normal gitweb, however the performance on a cache-hit is
orders of magnitude faster - though at a rather expensive cost - disk
space. There is currently something like 20G of disk being used on one
of kernel.org's machines providing the cache (this does get flushed on
occasion - I think) but that is providing caching for everything that
kernel.org has in it's git trees (or 255188 unique urls currently). My
code base is now, horribly, out of date with respect to mainline but it
works and it's been solid and reasonably reliable (though I do know of
two bugs in it right now I need to track down - one with respect to a
failure of the script - and one that is an array out of bounds error)
BTW. did you consider using cgit (C/Caching git web interface) instead
or in addition to gitweb? Freedesktop.org uses it side by side with
gitweb. I wonder how it would perform on kernel.org...

(Almost) every optimization should begin with profiling. Could you tell
us which gitweb pages are most called and perhaps which pages generate
most load for kernel.org? How new projects are added (old projects
deleted)? Do you control (can add to or can add multiplexing) to update
or post-receive hooks?

Without this data we could concentrate on things which are of no
importance. BTW. I wonder if slitting projects_list page would help...
--
Jakub Narebski
Poland
J.H.
2008-02-13 19:12:03 UTC
Permalink
Post by Jakub Narebski
BTW. did you consider using cgit (C/Caching git web interface) instead
or in addition to gitweb? Freedesktop.org uses it side by side with
gitweb. I wonder how it would perform on kernel.org...
When I branched and did the initial work for gitweb-caching CGit had
only barely made verion 0.01. So putting something *that* new into
production on Kernel.org didn't even remotely make sense. Since than
the caching modifications (along with a few other fixes and such) have
proven to be quite stable and have withstood the onslaught of users
fairly well. I have toyed with the idea of giving up on gitweb-caching
(since I either need to redo it to bring it closer to mainline gitweb,
and probably give up on breaking it up into multiple files or switch to
something new) but the current question that I, and no one else on the
kernel.org admin staff has had time to investigate is does cgit use the
same url paths. If so it would be a simple drop-in replacement and that
would appeal to us. If it doesn't we can't use cgit and will have to
stick with gitweb or a direct derivative there-of.
Post by Jakub Narebski
(Almost) every optimization should begin with profiling. Could you tell
us which gitweb pages are most called and perhaps which pages generate
most load for kernel.org?
That would be correct, though when I did up gitweb-caching the profiling
was blatantly obvious, with every single page request git was being
called, git was hammering the disk and it was becoming increasingly
obvious that going and running git for every page load was completely
impractical. I know git is fast - but it's not *that* fast, and it is a
bit abusive to the system for certain things requiring a lot of memory,
chewing cpu or chewing disk. In order of badness for kernel.org:
chewing memory, disk, cpu. Use up too much memory and you force too
much needed content out of ram, chew up disk and you make queries that
are forced to disk to take longer (and if you've chewed up too much ram
this gets *lots* worse).

So the simplest and obvious thing - take git out of the equation,
directly, for most calls. If you've ever seen the "Generating..." page
on kernel.org that is a stalling mechanism I'm using to let git run in
the background and generate the page your going to see. If you notice
it can take several seconds for that to complete, and we are on *very*
fast boxes - now multiply that by hundreds of times a second and you'll
start to understand why the caching layer is saving us right now.

As for the most often hit pages - the front page is by far hit the most,
which should be no surprise to everyone, and it is by far the most
abusive since it has to query *every* project have. After that things
taper off as people find the project they want and go looking for the
data they are interested in.
Post by Jakub Narebski
How new projects are added (old projects
deleted)?
By and large - left up to the users - if they don't want their tree
anymore they delete it (though I don't know of anyone who has) if they
need another one - they create it.
Post by Jakub Narebski
Do you control (can add to or can add multiplexing) to update
or post-receive hooks?
No. We do not want to, at all, control in any way the tree's that
people put up on Kernel.org. We just don't have the bandwidth to deal
with that for every single tree on kernel.org. Anything that would
require us to go changing or forcing a user to change something in their
git tree means we've already lost. Taking the caching layer and making
it 100% transparent to the git tree's owners and generally speaking to
the end user makes things very simple for us to deal with.
Post by Jakub Narebski
Without this data we could concentrate on things which are of no
importance. BTW. I wonder if slitting projects_list page would help...
That would be bad - I know for a fact there are people who will go to
git.kernel.org and then search on the page for the things they want - so
changing this would probably cause a lot of confusion for minor gain at
this point.

- John 'Warthog9' Hawley
Jakub Narebski
2008-02-14 01:01:53 UTC
Permalink
Post by J.H.
Post by Jakub Narebski
BTW. did you consider using cgit (C/Caching git web interface) instead
or in addition to gitweb? Freedesktop.org uses it side by side with
gitweb. I wonder how it would perform on kernel.org...
When I branched and did the initial work for gitweb-caching CGit had
only barely made verion 0.01. So putting something *that* new into
production on Kernel.org didn't even remotely make sense.
If I remember correctly cgit was _created_ in response (or at least
around) to discussion on git mailing list that kerne.org needs caching
for gitweb.

BTW. I have CC-ed CGit author, Lars Hjemli.
Post by J.H.
Since than
the caching modifications (along with a few other fixes and such) have
proven to be quite stable and have withstood the onslaught of users
fairly well. I have toyed with the idea of giving up on gitweb-caching
(since I either need to redo it to bring it closer to mainline gitweb,
and probably give up on breaking it up into multiple files or switch to
something new)
By the way, why did you split into so many modules? I would think
that separating into generic modules (like for example commit parsing),
HTML generation modules (specifing to gitweb) and caching module;
perhaps for easier integration only main gitweb (core version)
and caching module would be enough.

I was thinking about adding caching, using code from your fork,
to git.git gitweb, somewhere along the line... I guess I can move
it earlier in the TODO list, somewhere along CSS cleanup and log-like
views generation cleanup, and using feed links depending on page.
Post by J.H.
but the current question that I, and no one else on the
kernel.org admin staff has had time to investigate is does cgit use the
same url paths. If so it would be a simple drop-in replacement and that
would appeal to us. If it doesn't we can't use cgit and will have to
stick with gitweb or a direct derivative there-of.
Unfortunately cgit is not designed to be gitweb compatibile; it is
simplier (which might be considered better), doesn't support the
multitude of gitweb views, and unfortunately doesn't understand
gitweb URLs.

On the other side having gitweb and cgit coexist together on the same
set of repositories should be quite easy, as shown by FreeDesktop folks:
http://gitweb.freedesktop.org and http://cgit.freedesktop.org
Post by J.H.
Post by Jakub Narebski
(Almost) every optimization should begin with profiling. Could you tell
us which gitweb pages are most called and perhaps which pages generate
most load for kernel.org?
That would be correct, though when I did up gitweb-caching the profiling
was blatantly obvious, with every single page request git was being
called, git was hammering the disk and it was becoming increasingly
obvious that going and running git for every page load was completely
impractical.
[bringing back old quote]
Post by J.H.
Post by Jakub Narebski
Post by J.H.
There is currently something like 20G of disk being used on one
of kernel.org's machines providing the cache (this does get flushed on
occasion - I think) but that is providing caching for everything that
kernel.org has in it's git trees (or 255188 unique urls currently).
What I meant here that one should balance between not having cache
at all (and spending all the CPU), to caching everything under the
sun (and spending all the HDD space). To that one should know which
pages are most requested, so they would be cached by having static
page to serve; which generate most load / are less requested, so
perhaps git commands output would be cached; and which are rare enough
that caching is waste of disk space, and hints to caching proxies
should be enough.

[...]
Post by J.H.
As for the most often hit pages - the front page is by far hit the most,
which should be no surprise to everyone, and it is by far the most
abusive since it has to query *every* project have. After that things
taper off as people find the project they want and go looking for the
data they are interested in.
But what of those pages are most requested and generate most load?
'summary' page? 'rss' or 'atom' feeds? 'tree' view? README 'blob'?
snapshot (if enabled)?
Post by J.H.
Post by Jakub Narebski
How new projects are added (old projects
deleted)?
By and large - left up to the users - if they don't want their tree
anymore they delete it (though I don't know of anyone who has) if they
need another one - they create it.
Bummer. If projects were created by some script (like I think is
the case for git hosting facilities, like repo.or.cz, GitHub,
Gitorious or TucFamily) we could update projects listing file
(so gitweb doesn't need to scan directories), and perhaps even
add some gitweb-specific hooks (add multiplexer + hooks).
Post by J.H.
Post by Jakub Narebski
Do you control (can add to or can add multiplexing) to update
or post-receive hooks?
No. We do not want to, at all, control in any way the tree's that
people put up on Kernel.org. We just don't have the bandwidth to deal
with that for every single tree on kernel.org. Anything that would
require us to go changing or forcing a user to change something in their
git tree means we've already lost. Taking the caching layer and making
it 100% transparent to the git tree's owners and generally speaking to
the end user makes things very simple for us to deal with.
That's bad, because update / post-recive hook could be used for example
to invalidate 'summary', 'rss' and 'atom' views cache, and perhaps
regenerate projects list page. First request would generate cache, which
would be then used till it was deleted by the hook script.
Post by J.H.
Post by Jakub Narebski
Without this data we could concentrate on things which are of no
importance. BTW. I wonder if slitting projects_list page would help...
That would be bad - I know for a fact there are people who will go to
git.kernel.org and then search on the page for the things they want - so
changing this would probably cause a lot of confusion for minor gain at
this point.
I was thinking about first page being page of categories, perhaps with
"search projects" box. The page with so many projects is a bit unwieldy.

P.S. Do you make use of alternates, or do you left it to users to setup.?
--
Jakub Narebski
Poland
J.H.
2008-02-14 22:43:53 UTC
Permalink
Post by Jakub Narebski
Post by J.H.
Post by Jakub Narebski
BTW. did you consider using cgit (C/Caching git web interface) instead
or in addition to gitweb? Freedesktop.org uses it side by side with
gitweb. I wonder how it would perform on kernel.org...
When I branched and did the initial work for gitweb-caching CGit had
only barely made verion 0.01. So putting something *that* new into
production on Kernel.org didn't even remotely make sense.
If I remember correctly cgit was _created_ in response (or at least
around) to discussion on git mailing list that kerne.org needs caching
for gitweb.
BTW. I have CC-ed CGit author, Lars Hjemli.
I didn't realize it had come out of those discussions a couple of years
ago, though it's good to hear that it did come out. From what I have
used of it, it does seem to be as fast as gitweb-caching and it's
caching doesn't seem to be quite the sledge hammer that mine is.
Post by Jakub Narebski
Post by J.H.
Since than
the caching modifications (along with a few other fixes and such) have
proven to be quite stable and have withstood the onslaught of users
fairly well. I have toyed with the idea of giving up on gitweb-caching
(since I either need to redo it to bring it closer to mainline gitweb,
and probably give up on breaking it up into multiple files or switch to
something new)
By the way, why did you split into so many modules? I would think
that separating into generic modules (like for example commit parsing),
HTML generation modules (specifing to gitweb) and caching module;
perhaps for easier integration only main gitweb (core version)
and caching module would be enough.
I did it originally for a couple of reasons:

1) a single script that's almost 6000 lines long is a little hard to
handle in a single gulp. I can appreciate why it's being done that way,
mainly to simplify installation, but...

2) ... I needed something to help me understand the flow of code more -
ripping it apart was a good way to do it and try and group similar
functions.

There is an obvious downside, in retrospect, to me doing it this way - I
can't track upstream *nearly* as easily as I would like. Which means
that the code gitweb-caching is based on is about a year and a half old
(it has had a major update along the way but that took me two days of
manually applying patches to get updated)
Post by Jakub Narebski
I was thinking about adding caching, using code from your fork,
to git.git gitweb, somewhere along the line... I guess I can move
it earlier in the TODO list, somewhere along CSS cleanup and log-like
views generation cleanup, and using feed links depending on page.
It's on my todo list as well, basically re-base to current head and
instead of breaking stuff up like I did in the original, completely redo
the tree so that I can pull from trunk easier. Considering I'll have
some time that I can spend on my OSS projects in the next couple of
weeks again - I was going to try and get this accomplished and back into
my tree.
Post by Jakub Narebski
Post by J.H.
but the current question that I, and no one else on the
kernel.org admin staff has had time to investigate is does cgit use the
same url paths. If so it would be a simple drop-in replacement and that
would appeal to us. If it doesn't we can't use cgit and will have to
stick with gitweb or a direct derivative there-of.
Unfortunately cgit is not designed to be gitweb compatibile; it is
simplier (which might be considered better), doesn't support the
multitude of gitweb views, and unfortunately doesn't understand
gitweb URLs.
Sadly - that makes it more or less unusable to us at this point. There
are, I'm sure, a number of links that point back to kernel.org that
would be nice to not break - if it's really felt that this isn't the
case I would consider switching, but if gitweb + my caching code works
it might be better for me to try and get the code merged back into trunk
and not risk changing infrastructure that's been in place for several
years now.
Post by Jakub Narebski
On the other side having gitweb and cgit coexist together on the same
http://gitweb.freedesktop.org and http://cgit.freedesktop.org
I'm not a big fan of maintaining multiple things that all accomplish
the same goal. I would rather devote the resources to maintaining one
on kernel.org vs. maintaining two. This prevents users from getting
confused and in the event of upgrades forgetting that one needs
updating vs. the other one.
Post by Jakub Narebski
Post by J.H.
Post by Jakub Narebski
(Almost) every optimization should begin with profiling. Could you tell
us which gitweb pages are most called and perhaps which pages generate
most load for kernel.org?
That would be correct, though when I did up gitweb-caching the profiling
was blatantly obvious, with every single page request git was being
called, git was hammering the disk and it was becoming increasingly
obvious that going and running git for every page load was completely
impractical.
[bringing back old quote]
Post by J.H.
Post by Jakub Narebski
Post by J.H.
There is currently something like 20G of disk being used on one
of kernel.org's machines providing the cache (this does get flushed on
occasion - I think) but that is providing caching for everything that
kernel.org has in it's git trees (or 255188 unique urls currently).
What I meant here that one should balance between not having cache
at all (and spending all the CPU), to caching everything under the
sun (and spending all the HDD space). To that one should know which
pages are most requested, so they would be cached by having static
page to serve; which generate most load / are less requested, so
perhaps git commands output would be cached; and which are rare enough
that caching is waste of disk space, and hints to caching proxies
should be enough.
To a greater extent, disk is cheap, and load is expensive. While I
agree it would be worth spending cpu time to not cache things - thats
not the way gitweb / git works. Gitweb forces calls down to disk if
it's displaying the index page or showing a diff between two
revisions. The problem comes in that git, while fast, is very resource
intensive, requiring a lot of ram and a lot of disk seeking - on a very
busy setup, disk seeking is a killer. What you want to be able to do
is take a file and just run it straight out, not have to poke around to
re-assemble everything and then output. This is one of the reasons why
the index page is so painful - it's reassembling bits from *every*
repository. It's not cpu that's getting hit, it's the disk and ram
that's getting hit.

So yes - while I agree caching can be expensive, in very high volume
setups you have to have caching of some sort. Caching proxies aren't
smart enough for something like gitweb either, having a caching layer
directly in gitweb makes a lot more sense - you know what pages need
caching, which ones don't, what may be hit harder than others, what you
can have a longer cache timeout, etc on.
Post by Jakub Narebski
[...]
Post by J.H.
As for the most often hit pages - the front page is by far hit the most,
which should be no surprise to everyone, and it is by far the most
abusive since it has to query *every* project have. After that things
taper off as people find the project they want and go looking for the
data they are interested in.
But what of those pages are most requested and generate most load?
'summary' page? 'rss' or 'atom' feeds? 'tree' view? README 'blob'?
snapshot (if enabled)?
Right now I don't have explicit statistics on that, though it wouldn't
be hard to add in an additional file or small database of some sort
that would track generation times. My gut feeling is the index page is
the worst (particularly with the number of trees we have), followed by
the summary pages, and from there things will fall off dramatically as
most pages after that may not get hit often.
Post by Jakub Narebski
Post by J.H.
Post by Jakub Narebski
How new projects are added (old projects
deleted)?
By and large - left up to the users - if they don't want their tree
anymore they delete it (though I don't know of anyone who has) if they
need another one - they create it.
Bummer. If projects were created by some script (like I think is
the case for git hosting facilities, like repo.or.cz, GitHub,
Gitorious or TucFamily) we could update projects listing file
(so gitweb doesn't need to scan directories), and perhaps even
add some gitweb-specific hooks (add multiplexer + hooks).
At this point the git tree is left up to the user and we have no
intention of changing it, we don't even force them to turn on the
post-update hook that will deal with git-update-server-info.
Post by Jakub Narebski
Post by J.H.
Post by Jakub Narebski
Do you control (can add to or can add multiplexing) to update
or post-receive hooks?
No. We do not want to, at all, control in any way the tree's that
people put up on Kernel.org. We just don't have the bandwidth to deal
with that for every single tree on kernel.org. Anything that would
require us to go changing or forcing a user to change something in their
git tree means we've already lost. Taking the caching layer and making
it 100% transparent to the git tree's owners and generally speaking to
the end user makes things very simple for us to deal with.
That's bad, because update / post-recive hook could be used for example
to invalidate 'summary', 'rss' and 'atom' views cache, and perhaps
regenerate projects list page. First request would generate cache, which
would be then used till it was deleted by the hook script.
They could be useful, but this is completely left up to the tree's
owner, we provide a location for them to publish their trees, we don't
want to control or limit how or what they do with those trees.
Post by Jakub Narebski
Post by J.H.
Post by Jakub Narebski
Without this data we could concentrate on things which are of no
importance. BTW. I wonder if slitting projects_list page would help...
That would be bad - I know for a fact there are people who will go to
git.kernel.org and then search on the page for the things they want - so
changing this would probably cause a lot of confusion for minor gain at
this point.
I was thinking about first page being page of categories, perhaps with
"search projects" box. The page with so many projects is a bit unwieldy.
P.S. Do you make use of alternates, or do you left it to users to setup.?
Left up to the users, we suggest it's use but I'm sure there are trees
on kernel.org that could use alternates but don't.

- John 'Warthog9' Hawley
Jakub Narebski
2008-02-15 23:19:08 UTC
Permalink
Post by J.H.
Post by Jakub Narebski
By the way, why did you split into so many modules? I would think
that separating into generic modules (like for example commit parsing),
HTML generation modules (specifing to gitweb) and caching module;
perhaps for easier integration only main gitweb (core version)
and caching module would be enough.
1) a single script that's almost 6000 lines long is a little hard to
handle in a single gulp. I can appreciate why it's being done that way,
mainly to simplify installation, but...
2) ... I needed something to help me understand the flow of code more -
ripping it apart was a good way to do it and try and group similar
functions.
If I remember correctly the "great renaming", renaming subroutines
and a bit of code restructurization (moving subroutines so similar
subroutines are together) came later.
Post by J.H.
There is an obvious downside, in retrospect, to me doing it this way - I
can't track upstream *nearly* as easily as I would like. Which means
that the code gitweb-caching is based on is about a year and a half old
(it has had a major update along the way but that took me two days of
manually applying patches to get updated)
Post by Jakub Narebski
I was thinking about adding caching, using code from your fork,
to git.git gitweb, somewhere along the line... I guess I can move
it earlier in the TODO list, somewhere along CSS cleanup and log-like
views generation cleanup, and using feed links depending on page.
It's on my todo list as well, basically re-base to current head and
instead of breaking stuff up like I did in the original, completely redo
the tree so that I can pull from trunk easier. Considering I'll have
some time that I can spend on my OSS projects in the next couple of
weeks again - I was going to try and get this accomplished and back into
my tree.
Better you than me. I don't know if I'd have time for it, and obviosly
I don't know as well the caching code.
Post by J.H.
Post by Jakub Narebski
Unfortunately cgit is not designed to be gitweb compatibile; it is
simplier (which might be considered better), doesn't support the
multitude of gitweb views, and unfortunately doesn't understand
gitweb URLs.
Sadly - that makes it more or less unusable to us at this point. There
are, I'm sure, a number of links that point back to kernel.org that
would be nice to not break - if it's really felt that this isn't the
case I would consider switching, but if gitweb + my caching code works
it might be better for me to try and get the code merged back into trunk
and not risk changing infrastructure that's been in place for several
years now.
Truly, it would be nice if cgit had compatibility mode, accepting gitweb
(or gitweb-like) URLs, and returning similar page. Or at least
mod_rewrite rules to rewrite gitweb URLs to equivalent cgit ones.
Post by J.H.
Post by Jakub Narebski
[...] one should balance between not having cache
at all (and spending all the CPU), to caching everything under the
sun (and spending all the HDD space). To that one should know which
pages are most requested, so they would be cached by having static
page to serve; which generate most load / are less requested, so
perhaps git commands output would be cached; and which are rare enough
that caching is waste of disk space, and hints to caching proxies
should be enough.
To a greater extent, disk is cheap, and load is expensive. While I
agree it would be worth spending cpu time to not cache things - thats
not the way gitweb / git works. Gitweb forces calls down to disk if
it's displaying the index page or showing a diff between two
revisions.
What I mean here was to have caching in gitweb for pages like projects
list, summary for a project, main RSS/Atom feed for a project on one
hand, but just adding Last-Modified: and (weak?) ETag: headers plus
a year expire (or was it half of a year be infinity according to RFC?)
for immutable rarely (I think) accessed pages, like 'blob' view of given
file at given revision, or 'commit' view, or even perhaps 'tree' view
(all for given by immutable sha-1 revision / object-id).

And in the middle ground we could habe saving git command output in
a kind if "git cache", like storing update / change time for each
project; another example would be incremental blame output.
Post by J.H.
The problem comes in that git, while fast, is very resource
intensive, requiring a lot of ram and a lot of disk seeking - on a very
busy setup, disk seeking is a killer. What you want to be able to do
is take a file and just run it straight out, not have to poke around to
re-assemble everything and then output. This is one of the reasons why
the index page is so painful - it's reassembling bits from *every*
repository. It's not cpu that's getting hit, it's the disk and ram
that's getting hit.
True, the projects list especially with so large number of projects
just have to be cached. It is a pity that due to histerical raisings^W^W
historical reasons called backwards compatibility we cannot just put
projects search page, or projects catalogue (projects divided into
categories) instead of listing of all projects. Or at least divide
projects list in pages... Though the last option is not that good,
unless you somehow can include most commonly requested projects
on the main page.
Post by J.H.
So yes - while I agree caching can be expensive, in very high volume
setups you have to have caching of some sort. Caching proxies aren't
smart enough for something like gitweb either, having a caching layer
directly in gitweb makes a lot more sense - you know what pages need
caching, which ones don't, what may be hit harder than others, what you
can have a longer cache timeout, etc on.
I can agree with that.

On the other hand you are duplicating effort what is already done:
selecting what to cache, when to invalidate cache, when to prune / purge
cache etc. But I guess that having gitweb provide hints to caching
engines in the form of Last-Modified: and ETag:, and responding quickly
to If-Modified-Since: and If-None-Match: requests / HEAD requests
might be not enough; and responding to If-* requests might be not easy;
well, not easier than implementing caching inside gitweb.
Post by J.H.
Post by Jakub Narebski
[...]
Post by J.H.
As for the most often hit pages - the front page is by far hit the most,
which should be no surprise to everyone, and it is by far the most
abusive since it has to query *every* project have. After that things
taper off as people find the project they want and go looking for the
data they are interested in.
But what of those pages are most requested and generate most load?
'summary' page? 'rss' or 'atom' feeds? 'tree' view? README 'blob'?
snapshot (if enabled)?
Right now I don't have explicit statistics on that, though it wouldn't
be hard to add in an additional file or small database of some sort
that would track generation times.
Profiling websites. Debugging websites. I don't think it is easy...
Post by J.H.
My gut feeling is the index page is
the worst (particularly with the number of trees we have), followed by
the summary pages, and from there things will fall off dramatically as
most pages after that may not get hit often.
What about RSS feeds (as compared to summary page for example)?
Post by J.H.
Post by Jakub Narebski
Post by J.H.
Post by Jakub Narebski
How new projects are added (old projects
deleted)?
By and large - left up to the users - if they don't want their tree
anymore they delete it (though I don't know of anyone who has) if they
need another one - they create it.
Bummer. If projects were created by some script (like I think is
the case for git hosting facilities, like repo.or.cz, GitHub,
Gitorious or TucFamily) we could update projects listing file
(so gitweb doesn't need to scan directories), and perhaps even
add some gitweb-specific hooks (add multiplexer + hooks).
At this point the git tree is left up to the user and we have no
intention of changing it, we don't even force them to turn on the
post-update hook that will deal with git-update-server-info.
[...]
Post by J.H.
They could be useful, but this is completely left up to the tree's
owner, we provide a location for them to publish their trees, we don't
want to control or limit how or what they do with those trees.
Do the project list on kernel.org is then generated by scanning
filesystem ($projects_list unset, or set to directory)?


I have asked about this because with projects added, renamed and
deleted by script you can generate / regenerate projects list file
when chaning a project. I guess that you could in theory watch
filesystem for that...

If you can add gitweb's update / post-receive hook you would be able
to update file with "last changed" information, and delete or regenerate
caches for output which depend on the tip of current branch: project
summary, RSS feeds etc.

But if it is truly "no can do", then you have to implement storing
cache and invalidating caches yourself...
--
Jakub Narebski
Poland
Continue reading on narkive:
Loading...