Discussion:
git-grep while excluding files in a blacklist
Dov Grobgeld
2012-01-17 09:14:59 UTC
Permalink
Does git-grep allow searching for a pattern in all files *except*
files matching a pattern. E.g. in our project we have multiple DLL's
in git, but when searching I would like to exclude these for speed. Is
that possible with git-grep?

Thanks,
Dov
Nguyen Thai Ngoc Duy
2012-01-17 09:19:49 UTC
Permalink
Post by Dov Grobgeld
Does git-grep allow searching for a pattern in all files *except*
files matching a pattern. E.g. in our project we have multiple DLL's
in git, but when searching I would like to exclude these for speed. Is
that possible with git-grep?
Not from command line, no. You can put "*.dll" to .gitignore file then
"git grep --exclude-standard".
--
Duy
Junio C Hamano
2012-01-17 20:09:36 UTC
Permalink
Post by Nguyen Thai Ngoc Duy
Post by Dov Grobgeld
Does git-grep allow searching for a pattern in all files *except*
files matching a pattern. E.g. in our project we have multiple DLL's
in git, but when searching I would like to exclude these for speed. Is
that possible with git-grep?
Not from command line, no. You can put "*.dll" to .gitignore file then
"git grep --exclude-standard".
No rush, but is this something we would eventually want to handle with the
negative pathspec?
Nguyen Thai Ngoc Duy
2012-01-18 01:24:03 UTC
Permalink
Post by Junio C Hamano
Post by Nguyen Thai Ngoc Duy
Post by Dov Grobgeld
Does git-grep allow searching for a pattern in all files *except*
files matching a pattern. E.g. in our project we have multiple DLL's
in git, but when searching I would like to exclude these for speed. Is
that possible with git-grep?
Not from command line, no. You can put "*.dll" to .gitignore file then
"git grep --exclude-standard".
No rush, but is this something we would eventually want to handle with the
negative pathspec?
Definitely. But because I'm stuck at adding "seen" feature from
match_pathspec_depth to tree_entry_interesting, that probably won't
happen this year. Adding "--exclude=<pattern>" to git-grep is a more
plausible option.
--
Duy
c***@gmail.com
2012-01-23 09:37:36 UTC
Permalink
Post by Nguyen Thai Ngoc Duy
Definitely. But because I'm stuck at adding "seen" feature from
match_pathspec_depth to tree_entry_interesting, that probably won't happen
this year. Adding "--exclude=<pattern>" to git-grep is a more plausible
option.
I think the .gitattributes mechanism is a better place to configure this, as the
files I with to exclude from grep are always the same files (test fixtures
mainly, minified library code also). I often want to make git-diff be quieter
about them too, so I'll be setting the -diff attribute already.

I've attached a patch, which adds the "grep" attribute, which can (for now) only
be unset. This has the same effect for me as my original patch [1], but should
also improve life for people with large blobs as described above.

In future we could extend the attribute to give a meaning to set values, but I'm
not yet sure what that would look like. We could also add an --exclude=<foo>
flag to grep for people who want to configure grep on a per-run basis, but I
think that is a much less common desire.

The failing test attached to this patch is a symptom of a larger issue with the
way that git-grep handles objects that are not at the root of the repository. A
more obvious symptom can be revealed by comparing the output of:

git grep int HEAD:./builtin

cd builtin; git grep int HEAD:./

The problem is that grep doesn't correctly separate the path from the revision
part of the spec. It's currently unobvious to me how to fix this but I hope
someone more familiar with the code (Nguyen or Junio) might be able to see a
way.

Yours,
Conrad

[1] http://thread.gmane.org/gmane.comp.version-control.git/179299

---8<---


To set -grep on an file in .gitattributes will cause that file to be
skipped completely while grepping. This can be used to reduce the number
of false positives when your repository contains files that are
uninteresting to you, for example test fixtures, dlls or minified source
code.

The other approach considered was to allow an --exclude flag to grep at runtime,
however that better serves the less common use-case of wanting to customise the
list of files per-invocation rather than statically.

Signed-off-by: Conrad Irwin <***@gmail.com>
---
Documentation/git-grep.txt | 7 +++++++
Documentation/gitattributes.txt | 9 +++++++++
builtin/grep.c | 8 ++++++++
grep.c | 21 +++++++++++++++++++++
grep.h | 1 +
t/t7810-grep.sh | 30 ++++++++++++++++++++++++++++++
6 files changed, 76 insertions(+), 0 deletions(-)

diff --git a/Documentation/git-grep.txt b/Documentation/git-grep.txt
index 6a8b1e3..7c74165 100644
--- a/Documentation/git-grep.txt
+++ b/Documentation/git-grep.txt
@@ -242,6 +242,13 @@ OPTIONS
If given, limit the search to paths matching at least one pattern.
Both leading paths match and glob(7) patterns are supported.

+ATTRIBUTES
+----------
+
+grep::
+ If the grep attribute is unset on a file using the linkgit:gitattributes[1]
+ mechanism, then that file will not be searched.
+
Examples
--------

diff --git a/Documentation/gitattributes.txt b/Documentation/gitattributes.txt
index a85b187..3ffcec7 100644
--- a/Documentation/gitattributes.txt
+++ b/Documentation/gitattributes.txt
@@ -869,6 +869,15 @@ If this attribute is not set or has an invalid value, the value of the
`gui.encoding` configuration variable is used instead
(See linkgit:git-config[1]).

+Configuring files to search
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+`grep`
+^^^^^^
+
+If the attribute `grep` is unset for a file then linkgit:git-grep[1]
+will ignore that file while searching for matches.
+

USING MACRO ATTRIBUTES
----------------------
diff --git a/builtin/grep.c b/builtin/grep.c
index 9ce064a..9f8dfc0 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -398,6 +398,10 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
struct strbuf pathbuf = STRBUF_INIT;
char *name;

+ if (!should_grep_path(opt, filename + tree_name_len)) {
+ return 0;
+ }
+
if (opt->relative && opt->prefix_length) {
quote_path_relative(filename + tree_name_len, -1, &pathbuf,
opt->prefix);
@@ -464,6 +468,10 @@ static int grep_file(struct grep_opt *opt, const char *filename)
struct strbuf buf = STRBUF_INIT;
char *name;

+ if (!should_grep_path(opt, filename)) {
+ return 0;
+ }
+
if (opt->relative && opt->prefix_length)
quote_path_relative(filename, -1, &buf, opt->prefix);
else
diff --git a/grep.c b/grep.c
index 486230b..e948576 100644
--- a/grep.c
+++ b/grep.c
@@ -1,5 +1,6 @@
#include "cache.h"
#include "grep.h"
+#include "attr.h"
#include "userdiff.h"
#include "xdiff-interface.h"

@@ -829,6 +830,26 @@ static inline void grep_attr_unlock(struct grep_opt *opt)
#define grep_attr_unlock(opt)
#endif

+static struct git_attr_check attr_check[1];
+static void setup_attr_check(void)
+{
+ if (attr_check[0].attr)
+ return; /* already done */
+ attr_check[0].attr = git_attr("grep");
+}
+int should_grep_path(struct grep_opt *opt, const char *name) {
+ int ret = 1;
+
+ grep_attr_lock(opt);
+ setup_attr_check();
+ git_check_attr(name, 1, attr_check);
+ if (ATTR_FALSE(attr_check[0].value))
+ ret = 0;
+ grep_attr_unlock(opt);
+
+ return ret;
+}
+
static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
{
xdemitconf_t *xecfg = opt->priv;
diff --git a/grep.h b/grep.h
index fb205f3..266002d 100644
--- a/grep.h
+++ b/grep.h
@@ -129,6 +129,7 @@ extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field
extern void compile_grep_patterns(struct grep_opt *opt);
extern void free_grep_patterns(struct grep_opt *opt);
extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
+extern int should_grep_path(struct grep_opt *opt, const char *name);

extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
extern int grep_threads_ok(const struct grep_opt *opt);
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 7ba5b16..c991518 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -871,4 +871,34 @@ test_expect_success 'mimic ack-grep --group' '
test_cmp expected actual
'

+test_expect_success 'with -grep attribute' '
+ echo "hello.c -grep" >.gitattributes &&
+ test_must_fail git grep printf &&
+ rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific file' '
+ echo "hello.c -grep" >.gitattributes &&
+ test_must_fail git grep printf hello.c &&
+ rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific revision' '
+ echo "hello.c -grep" >.gitattributes &&
+ test_must_fail git grep printf HEAD &&
+ rm .gitattributes
+'
+
+test_expect_success 'with -grep attribute on specific file/revision' '
+ echo "hello.c -grep" >.gitattributes &&
+ test_must_fail git grep printf HEAD hello.c &&
+ rm .gitattributes
+'
+
+test_expect_failure 'with -grep attribute on specific tree' '
+ echo "hello.c -grep" >.gitattributes &&
+ test_must_fail git grep printf HEAD:hello.c &&
+ rm .gitattributes
+'
+
test_done
--
1.7.9.rc2.1.g1fdd3
Junio C Hamano
2012-01-23 18:33:33 UTC
Permalink
Post by c***@gmail.com
---8<---
To set -grep on an file in .gitattributes will cause that file to be
skipped completely while grepping. This can be used to reduce the number
of false positives when your repository contains files that are
uninteresting to you, for example test fixtures, dlls or minified source
code.
Please reword this to describe the problem being solved first (why it
needs to be solved, in what situation you cannot live without the
feature), and then describe the approach you used to solve it.

Plain "grep" does this:

$ grep world hello.*
hello.c: printf("Hello, world.\n");
Binary file hello.o matches

in order to avoid uselessly spewing garbage at you while reminding you
that the command is not showing the whole truth and you _might_ want to
look into the elided file if you wanted to with "grep -a world hello.o".
Compared to this, it feels that the result your patch goes too far and
hides the truth a bit too much to my taste. Maybe it is just me.

Perhaps you, or all participants of your particular project, usually do
not want to see any grep hits from minified.js, but you may still want to
be able to say "I want to dig deeper and make sure I have copyright
strings in all files", for example. It is unclear how you envision to
support such a use case building on top of this patch.

Your "attributes only" is not an acceptable solution in the longer run,
even though it is a good first step (i.e. "attributes first and other
enhancement later"). There should be an easy way to get the best of both
worlds.
Post by c***@gmail.com
The other approach considered was to allow an --exclude flag to grep at
runtime, however that better serves the less common use-case of wanting
to customise the list of files per-invocation rather than statically.
I doubt that it is justifiable to call per-invocation "the less common".

By the way, if the uninteresting ones are dll and minified.js, I wonder
why it is insufficient to mark them binary, i.e. uninteresting for the
purpose of all textual operations not just grep but also for diff.

I am *not* going to ask why they are treated as source and tracked by git
to begin with.
Conrad Irwin
2012-01-23 22:59:15 UTC
Permalink
This post might be inappropriate. Click to display it.
Junio C Hamano
2012-01-24 06:59:45 UTC
Permalink
This post might be inappropriate. Click to display it.
Jeff King
2012-01-25 21:46:26 UTC
Permalink
Post by Junio C Hamano
I used to use this approach, hooking into the "diff" attribute directly to mark
a file as binary, however that was clearly a hack.
After thinking about this a bit more, I have to say I disagree that it is
a hack.
I kind of agree.

The biggest problem is that the name is wrong. The "diff.*.command"
option really is about generating a diff between two blobs of a certain
type. But "diff.*.textconv" and "diff.*.binary" are really just
attributes of the file, and may or may not have to do with generating a
diff. Ditto for diff.*.funcname, I think.

You argue, and I agree, that if we are talking about attributes of the
files and not diff-specific things, then other parts of git can and
should make use of that information.

So if this was all spelled:

$ cat .gitattributes
*.pdf filetype=pdf
$ cat .git/config
[filetype "pdf"]
binary = true
textconv = pdf2txt

I think it would be a no-brainer that those type attributes should apply
to "git grep".

So maybe the first step on this path would be to introduce something
like "filetype" as a new attribute, have "diff" respect its settings,
and recommend people set up filetypes as appropriate. Or maybe that just
makes things more confusing in the long run, and we are better off
simply accepting that the name is slightly misleading. But either way,
it seems clear that git should be respecting gitattributes at the very
least to mark files as binary (and I think we already use funcname
patterns; textconv is a slightly stickier subject, so I'll address that
below).

But what I'm not sure I agree with is that the idea of "I don't want to
include path X in my grep" maps to "just mark the file as binary". For
example, in git we carry around a lot of code in compat/ that comes from
other places. I generally don't want to see grep results from
compat/nedmalloc/, because that isn't git code.

But should I mark everything in compat/nedmalloc as binary? I don't
think so. I _do_ want to see changes in nedmalloc in "git log" or "git
diff". They don't bother me because they're infrequent, and we still
want to produce regular text patches for the list when they come up. But
because nedmalloc contains a lot of lines of code (even though they
don't change a lot), it happens to produce a lot of uninteresting
matches when grepping.
Post by Junio C Hamano
- The user has flexibility to set "diff" and "grep" independently, which
is an unnecessary complication [*1*]; and
In the nedmalloc case, if we are to have "grep" and "diff" attributes
that behave similarly, you potentially do want to set them differently.
It would be nice to be able to treat them differently in the cases you
wanted to, but not _have_ to do so. Attribute macros can almost
implement this. You could add "-grep" to binary. But you can't (as far
as I know) do "macro=foo" to handle arbitrary diff drivers. I suspect we
could extend the rules to allow macros that take an argument and pass it
to their constituent attributes.

However, I think this is the wrong road to go down. You would want
macros like this _if_ you have grep and diff attributes that basically
do the same thing (e.g., marking a file as binary for diff versus binary
for grep). But I think that's a wrong road to go down. More likely a
file is binary or it is not binary, and the problem is conflating "file
is binary" with "I do not usually want to grep this file".

I'd much rather see grep inherit diff's file attributes unconditionally
(whether we still call them "diff" or not), and add a grep attribute
that is about "usually this is worth grepping". And then have a
tri-state command-line option for grep to either include uninteresting
things, exclude them, or to give terse output for them (mention that
there were matches in foo.c, but not each one). Probably defaulting to
terse output.

That makes the case you presented work out of the box: things marked as
binary for diffing look binary to grep, and we give the usual terse
"binary file matches" output. The user doesn't have to do anything. For
more complex cases like nedmalloc, you can still achieve the "this is
text, but it's usually boring" behavior. And if you really want to do a
thorough grep, you can just "git grep --exclude=none".
Post by Junio C Hamano
So let's step back a bit and take a look at the handling of files for
which you do not want to see patch output and/or you do not want to see
grep hits, in a fictional but realisitic use scenario.
[...]
I think this is a nice user story, and it fits in with your suggested
git behavior. But I think there are other stories, too, like the
nedmalloc one. And it would be nice to make them work without hurting
the simplicity of the case you mentioned.
Post by Junio C Hamano
If you think about it this way from the very high level description of the
problem, aka, end user's point of view, it is fairly clear that tying the
"binary" attribute to "git grep" to allow us to override the built-in
buffer_is_binary() call you see in grep.c gives the most intuitive result,
without forcing the user to do anything more than they are already doing.
This is not a complaint about the core of your point, but rather an
aside that should be considered: how many people are really using the
binary macro attribute? Personally, I never use it, because when I mark
something with a "diff" attribute, it is because I am telling git about
a specific diff driver (usually textconv). Otherwise I don't bother
setting attributes at all, because git's binary detection tends to be
good. This leaves me without setting "-text", of course, but I don't
generally care because I don't do CRLF conversion at all.

So I think it is worth considering not just people setting "binary", but
how users of just "diff" (both "-diff" and "diff=foo") will want things
to work.
Post by Junio C Hamano
Suppose that this binary blob firmware came with an API manual formatted
in PDF, xyzzy.pdf, also supplied by the vendor. It is also kept in the
repository, but again, running textual diff between updated versions of
the PDF documentation would not be very useful. I however may have a
textconv filter defined for it to grab the text by running pdf2ascii.
Now if my "git show --textconv xyzzy.pdf" has an output line that says a
string "XYZZY API 2.0" was added to the current version, wouldn't it be
natural for me to expect that "git grep --textconv 'XYZZY API' xyzzy.pdf"
to find it [*4*]?
This is an interesting concept. As a user of textconv, I already have
some specialized grep tools for matching inside binary files (e.g., I
have a tool that greps within exif tags of images). But being able to do
so with "git grep", and even at an arbitrary revision, is kind of neat.

I would worry about turning it on by default, since the results could be
misleading. In particular, your pattern "foo" might be in the binary
file but not in the textconv'd version, leading you to think you had
found all instances of "foo" but had not (or much more subtle, things
like line breaks really matter during the conversion if you are going to
be using "grep -C").

Making it available by "--textconv" seems reasonable to me, though. The
only inconsistency is that it's on by default for "git show", but would
not be for "git grep".

Perhaps I am being overly paranoid on the "misleading" bit above. It
seems to me that grep has the room to be a lot more subtle, because an
omission from the output is considered "did not match". But you could
construct equally weird scenarios for "git show" (e.g., you changed
"foo" to "bar" but that part did not appear in the textconv portion.
Which is really a quality-of-implementation issue for your textconv
filter).
Post by Junio C Hamano
As an added bonus, if you truly want to omit _any_ hit from "git grep" for
your minified.js files, you can easily do so. Just define a textconv
filter that yields an empty string for them, and "grep --textconv" won't
produce any matches, not even "Binary file ... matches".
Clever. But then you will never ever see a diff for that file, either,
because we will consider all changes to be empty (actually, I didn't
check, but you may get the diff header without any content, similar to
the stat-dirty entries).

-Peff
Stephen Bash
2012-01-26 13:51:52 UTC
Permalink
----- Original Message -----
Sent: Wednesday, January 25, 2012 4:46:26 PM
Subject: Re: [PATCH] Don't search files with an unset "grep" attribute
... snip ...
$ cat .gitattributes
*.pdf filetype=pdf
$ cat .git/config
[filetype "pdf"]
binary = true
textconv = pdf2txt
I think it would be a no-brainer that those type attributes should
apply to "git grep".
Looking at this purely as a user, what difference/advantage would that bring versus

$ cat .gitattributes
*.pdf binary=true textconv=pdf2text

or

$ cat .gitattributes
[attr]pdf binary=true textconv=pdf2text
*.pdf pdf

(admittedly I have no clue if gitattributes actually supports anything like this)

I guess my point is as a user, I've gravitated to "gitattributes is about files in my repo, gitconfig is about Git's behavior" (though this is a grey area).

To partially answer my own question: one advantage of putting the filetype information in a config file is it allows system- and user-wide filetype settings. In my personal experience I've always handled that information on a per-repository basis, but that doesn't mean everyone would want to.

Thanks,
Stephen
Jeff King
2012-01-26 17:29:48 UTC
Permalink
Post by Stephen Bash
Post by Jeff King
$ cat .gitattributes
*.pdf filetype=pdf
$ cat .git/config
[filetype "pdf"]
binary = true
textconv = pdf2txt
Looking at this purely as a user, what difference/advantage would that bring versus
$ cat .gitattributes
*.pdf binary=true textconv=pdf2text
For "binary", probably not much. But for textconv, it is all about the
Post by Stephen Bash
To partially answer my own question: one advantage of putting the
filetype information in a config file is it allows system- and
user-wide filetype settings. In my personal experience I've always
handled that information on a per-repository basis, but that doesn't
mean everyone would want to.
Right. Setting things system-wide instead of per-repo is one advantage.
But more important is that attributes are not per-repo, but rather
"per-project". They get committed, and everybody who works on the
project shares them.

In your example, the gitattributes get committed, and the project is
mandating "you _will_ use pdf2text to view diffs of these files". But
that may not be appropriate for everybody who clones. Somebody may have
a different pdf-to-text converter. Somebody may simply have pdf2txt at a
different path, or need different options. Or somebody may want to skip
it altogether and use an external diff command, or even just see the
files as binary.

By splitting the information across the two files, the project gets to
say "this file is of type pdf", and then each user gets to decide "how
do I want to diff pdf files?"

-Peff
Michael Haggerty
2012-01-26 16:45:16 UTC
Permalink
Post by Jeff King
But what I'm not sure I agree with is that the idea of "I don't want to
include path X in my grep" maps to "just mark the file as binary".
Anybody who wants this policy can simply set

[attr]binary -diff -text -grep

If they want finer granularity, they can adjust the settings for
particular file types or for particular files.
Post by Jeff King
But should I mark everything in compat/nedmalloc as binary? I don't
think so. I _do_ want to see changes in nedmalloc in "git log" or "git
diff". They don't bother me because they're infrequent, and we still
want to produce regular text patches for the list when they come up. But
because nedmalloc contains a lot of lines of code (even though they
don't change a lot), it happens to produce a lot of uninteresting
matches when grepping.
I think decisions such as whether to include an imported module in "git
diff" output is a personal preference and should not be decided at the
level of the git project. The in-tree .gitattributes files should, by
and large, just *describe* the files and leave it to users to associate
policies with the tags (or at least make it possible for users to
override the policies) via .git/info/attributes. For example, the
repository could set an "external=nedmalloc" attribute on all files
under compat/nedmalloc, and users could themselves configure a macro
"[attr]external -diff -grep" (or maybe something like
"[attr]external=nedmalloc -diff -grep") if that is their preference.
Post by Jeff King
It would be nice to be able to treat them differently in the cases you
wanted to, but not _have_ to do so. Attribute macros can almost
implement this. You could add "-grep" to binary. But you can't (as far
as I know) do "macro=foo" to handle arbitrary diff drivers. I suspect we
could extend the rules to allow macros that take an argument and pass it
to their constituent attributes.
Is it really common to want to use the same argument on multiple macros
without also wanting to set other things specifically? If not, then
there is not much reason to complicate macros with argument support.

For example, I do something like

[attr]type-python type=python text diff=python check-ws
*.py type-python

[attr]type-makefile type=makefile text diff check-ws -check-tab
Makefile.* type-makefile

for the main file types in my repository, and it is not very cumbersome.

"type-python" and "type=python" seem redundant but they are not.
"type-python" is needed so that it can be used as a macro.
"type=python" makes it easier to inquire about the type of a file using
something like "git check-attr type -- PATH" rather than having to
inquire about each possible type-* attribute. It might be nice to
support a slightly extended macro definition syntax like

[attr]type=python text diff=python check-ws
*.py type=python

[attr]type=makefile text diff check-ws -check-tab
Makefile.* type=makefile

(i.e., macros that are only triggered for particular values of an
attribute).

Michael
--
Michael Haggerty
***@alum.mit.edu
http://softwareswirl.blogspot.com/
Jeff King
2012-01-27 06:35:03 UTC
Permalink
Post by Michael Haggerty
I think decisions such as whether to include an imported module in "git
diff" output is a personal preference and should not be decided at the
level of the git project.
You're right. I thought of it as an annotation that the project could
mark via .gitattributes, or the user could mark via .git/info/attributes.
But that is not following the right split of responsibility for
attributes and config. The attributes should annotate "this isn't really
part of the regular git code base" or "this is really part of the
nedmalloc codebase". And then the _config_ should say "when I am
grepping, I am not interested in nedmalloc". I.e.:

# mark a set of paths with an attribute
echo "compat/nedmalloc external" >>.gitattributes

# and then ignore that attribute for this grep
git grep --exclude-attr=external

# or for all greps
git config --add grep.exclude external

and git doesn't even have to care about what the attribute is called.
It's between the project and the user how they want to annotate their
files, and how they want to feed them to grep.

Or any other program, for that matter. I wonder if this could also be a
more powerful way of grouping files to be included or excluded from diff
pathspecs. Something like (and I'm just talking off the top of my head,
so there may be some syntactic conflicts here):

# annotate some files
cat >>.gitattributes <<-\EOF
t/t????-*.sh test-script
t/lib-*.sh test-script
t/test-lib.sh test-script
EOF

# and then consider the tagged files to be a group, and look only at
# that group
git log :attr:test-script

# ditto, but imagine we had the negative pathspecs Duy has proposed
git log :~attr:test-script

That seems kind of cool to me. But maybe it is getting into crazy
over-engineering. I like the idea that we don't need a new option to
grep or diff; rather it is simply a new syntax for mentioning paths.
Post by Michael Haggerty
The in-tree .gitattributes files should, by and large, just *describe*
the files and leave it to users to associate policies with the tags
(or at least make it possible for users to override the policies) via
.git/info/attributes. For example, the repository could set an
"external=nedmalloc" attribute on all files under compat/nedmalloc,
and users could themselves configure a macro "[attr]external -diff
-grep" (or maybe something like "[attr]external=nedmalloc -diff
-grep") if that is their preference.
So obviously I took what you were saying here and ran with it above. But
I do disagree with one thing here: the attributes should be giving some
tag to the paths, but the actual decision about whether to grep should
be part of the _config_. That's the usual split we have for all of the
other attributes, and I think it makes sense and has worked well.
Post by Michael Haggerty
Is it really common to want to use the same argument on multiple macros
without also wanting to set other things specifically? If not, then
there is not much reason to complicate macros with argument support.
I dunno. I admit my attribute usage tends to just match by extension,
and I generally only have one or two such lines.
Post by Michael Haggerty
For example, I do something like
[attr]type-python type=python text diff=python check-ws
*.py type-python
[attr]type-makefile type=makefile text diff check-ws -check-tab
Makefile.* type-makefile
for the main file types in my repository, and it is not very cumbersome.
I think it's not a big deal if you are making your own macros. I was
more concerned that people would want to use the "binary" macro to get
the "-grep" automagic, but could not do so because they don't want
"-diff", but rather "diff=foo".

Anyway, after reading your response and thinking on it more, I think
"-grep" is totally the wrong way to go. If the files are marked binary,
then grep should be respecting "-diff" or the "diff.*.binary" config. If
we want to do more advanced exclusion, then the right place for that is
the config file (or the weird :attr pathspec thing I mentioned above).
Post by Michael Haggerty
"type-python" and "type=python" seem redundant but they are not.
"type-python" is needed so that it can be used as a macro.
"type=python" makes it easier to inquire about the type of a file using
something like "git check-attr type -- PATH" rather than having to
inquire about each possible type-* attribute. It might be nice to
support a slightly extended macro definition syntax like
[attr]type=python text diff=python check-ws
*.py type=python
[attr]type=makefile text diff check-ws -check-tab
Makefile.* type=makefile
(i.e., macros that are only triggered for particular values of an
attribute).
I don't think there's any semantic reason why that is not workable. It's
simply not syntactically allowed at this point.

-Peff
Junio C Hamano
2012-02-01 08:01:41 UTC
Permalink
Post by Jeff King
Post by Junio C Hamano
I used to use this approach, hooking into the "diff" attribute directly to mark
a file as binary, however that was clearly a hack.
After thinking about this a bit more, I have to say I disagree that it is
a hack.
I kind of agree.
The biggest problem is that the name is wrong. The "diff.*.command"
option really is about generating a diff between two blobs of a certain
type. But "diff.*.textconv" and "diff.*.binary" are really just
attributes of the file, and may or may not have to do with generating a
diff. Ditto for diff.*.funcname, I think.
You argue, and I agree, that if we are talking about attributes of the
files and not diff-specific things, then other parts of git can and
should make use of that information.
$ cat .gitattributes
*.pdf filetype=pdf
$ cat .git/config
[filetype "pdf"]
binary = true
textconv = pdf2txt
I think it would be a no-brainer that those type attributes should apply
to "git grep".
I think this discussion has, instead of forking into two equally
interesting subthreads, veered to a more intellectually stimulating
tangent and we ended up losing focus.

Regardless of what to do with "I do not want to grep in these types of
files" and "I want textconv applied when grepping in these types", which
would be new attributes to implement two new features, I would like to see
us first concentrate on fixing the "binary" issue. When somebody tells us
"Your autodetection may screw it up, but this file is binary; just show
'Binary files differ.' when comparing." with "-diff" (or "binary"), we
should honor that when "git grep" decides if it should take the 'Binary
file matches' codepath. We currently do not, and it clearly is a bug.

This is especially made somewhat urgent because I do not want a half-baked
"two pathspecs" approach that only "git grep" knows about when we add the
support for "git grep --exclude-path=...".

We should have to teach the underlying machinery that matches pathspec
about negative pathspec entries only once. After we have done so, all the
callers, not just "git grep", should be able to take advantage of the
change by just learning to place negative pathspec entries in the "struct
pathspec" they pass to the machinery. Doing anything else will lead to
madness of adding ad-hoc "here we should further filter with the other
negative 'struct pathspec'" in each and every application.

But I suspect that it would not materialize anytime soon. And I also
suspect that the correct handling of 'Binary file matches', which is a
pure bugfix, should solve the original issue started these threads 90% in
practice.
Jeff King
2012-02-01 08:20:05 UTC
Permalink
Post by Junio C Hamano
Post by Jeff King
$ cat .gitattributes
*.pdf filetype=pdf
$ cat .git/config
[filetype "pdf"]
binary = true
textconv = pdf2txt
I think it would be a no-brainer that those type attributes should apply
to "git grep".
I think this discussion has, instead of forking into two equally
interesting subthreads, veered to a more intellectually stimulating
tangent and we ended up losing focus.
That's what I'm here for.
Post by Junio C Hamano
Regardless of what to do with "I do not want to grep in these types of
files" and "I want textconv applied when grepping in these types", which
would be new attributes to implement two new features, I would like to see
us first concentrate on fixing the "binary" issue. When somebody tells us
"Your autodetection may screw it up, but this file is binary; just show
'Binary files differ.' when comparing." with "-diff" (or "binary"), we
should honor that when "git grep" decides if it should take the 'Binary
file matches' codepath. We currently do not, and it clearly is a bug.
Right. It may have been lost in the verbosity of what I wrote in my
previous email, but I completely agree. With the caveat that one should
also respect "diff=foo" coupled with "diff.foo.binary = true" as making
something binary. But that is already handled transparently by the
userdiff.[ch] code (which seems like the obvious entry point for
grep to use for attribute lookup, and which we already use there for
funcname lookup).

The trivial-ish patch is below.
Post by Junio C Hamano
We should have to teach the underlying machinery that matches pathspec
about negative pathspec entries only once. After we have done so, all the
callers, not just "git grep", should be able to take advantage of the
change by just learning to place negative pathspec entries in the "struct
pathspec" they pass to the machinery. Doing anything else will lead to
madness of adding ad-hoc "here we should further filter with the other
negative 'struct pathspec'" in each and every application.
Yes, I agree.
Post by Junio C Hamano
But I suspect that it would not materialize anytime soon. And I also
suspect that the correct handling of 'Binary file matches', which is a
pure bugfix, should solve the original issue started these threads 90% in
practice.
Also agree. Let's fix the bug and then give it some time to see whether
people really want more explicit exclusions.

Here's the bug-fix patch. Not quite ready for inclusion, as it obviously
needs tests and a commit message. Also, we can cache the result of the
userdiff lookup so the funcname code doesn't have to look it up again.

---
diff --git a/grep.c b/grep.c
index b29d09c..d7ab054 100644
--- a/grep.c
+++ b/grep.c
@@ -960,6 +960,15 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
fwrite(buf, size, 1, stdout);
}

+static int grep_buffer_is_binary(const char *path,
+ char *buf, unsigned long size)
+{
+ struct userdiff_driver *drv = userdiff_find_by_path(path);
+ if (drv && drv->binary != -1)
+ return drv->binary;
+ return buffer_is_binary(buf, size);
+}
+
static int grep_buffer_1(struct grep_opt *opt, const char *name,
char *buf, unsigned long size, int collect_hits)
{
@@ -994,11 +1003,11 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,

switch (opt->binary) {
case GREP_BINARY_DEFAULT:
- if (buffer_is_binary(buf, size))
+ if (grep_buffer_is_binary(name, buf, size))
binary_match_only = 1;
break;
case GREP_BINARY_NOMATCH:
- if (buffer_is_binary(buf, size))
+ if (grep_buffer_is_binary(name, buf, size))
return 0; /* Assume unmatch */
break;
case GREP_BINARY_TEXT:
Jeff King
2012-02-01 09:10:09 UTC
Permalink
Post by Jeff King
Here's the bug-fix patch. Not quite ready for inclusion, as it obviously
needs tests and a commit message. Also, we can cache the result of the
userdiff lookup so the funcname code doesn't have to look it up again.
Actually, it's a little bit more complicated. I was looking at a
slightly old version of grep.c. Since 0579f91 (grep: enable threading
with -p and -W using lazy attribute lookup, 2011-12-12), the lookup
happens in lots of sub-functions, and locking is required.

So this is what the patch looks like with proper locking and caching of
the looked-up driver. It's quite messy because the cached driver pointer
has to get passed around quite a bit. And I'm not sure it buys that much
in practice. The cost of attribute lookup _is_ noticeable (which I'll
discuss below), but funcname lookup only happens when we get a grep hit.
So unless you are searching for something extremely common, you're only
going to do a lookup very occasionally (compared to the load of actually
searching through the files). So all of the messiness and caching may
not be worth the effort, as I wasn't able to measure a performance gain.

But there's more. Respecting binary attributes does mean looking up
attributes for _every_ file. And that has a noticeable impact. My
best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.392s.
Yuck.

Part of the problem, I suspect, is that the attribute lookup code is
optimized for locality. We only unwind as much of the stack as we need,
so looking at "foo/bar/baz.c" after "foo/bar/bleep.c" is much cheaper
than looking at "some/other/directory.c". But with threaded grep, that
locality is likely lost, as we are mixing up attribute requests from
different threads.

Given that binary lookup means we need every file's gitattribute, it
might be better to look them up serially at the beginning of the
program, and then pass the resulting userdiff driver to grep_buffer
along with each path.

---
diff --git a/grep.c b/grep.c
index 486230b..3ca840a 100644
--- a/grep.c
+++ b/grep.c
@@ -829,15 +829,28 @@ static inline void grep_attr_unlock(struct grep_opt *opt)
#define grep_attr_unlock(opt)
#endif

-static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
+static struct userdiff_driver *get_cached_userdiff(struct grep_opt *opt,
+ const char *path,
+ struct userdiff_driver **drv)
{
- xdemitconf_t *xecfg = opt->priv;
- if (xecfg && !xecfg->find_func) {
- struct userdiff_driver *drv;
+ if (!*drv) {
grep_attr_lock(opt);
- drv = userdiff_find_by_path(name);
+ *drv = userdiff_find_by_path(path);
+ if (!*drv)
+ *drv = userdiff_find_by_name("default");
grep_attr_unlock(opt);
- if (drv && drv->funcname.pattern) {
+ }
+ return *drv;
+}
+
+static int match_funcname(struct grep_opt *opt, const char *name,
+ struct userdiff_driver **drv_p,
+ char *bol, char *eol)
+{
+ xdemitconf_t *xecfg = opt->priv;
+ if (xecfg && !xecfg->find_func) {
+ struct userdiff_driver *drv = get_cached_userdiff(opt, name, drv_p);
+ if (drv->funcname.pattern) {
const struct userdiff_funcname *pe = &drv->funcname;
xdiff_set_find_func(xecfg, pe->pattern, pe->cflags);
} else {
@@ -859,6 +872,7 @@ static int match_funcname(struct grep_opt *opt, const char *name, char *bol, cha
}

static void show_funcname_line(struct grep_opt *opt, const char *name,
+ struct userdiff_driver **drv_p,
char *buf, char *bol, unsigned lno)
{
while (bol > buf) {
@@ -871,20 +885,21 @@ static void show_funcname_line(struct grep_opt *opt, const char *name,
if (lno <= opt->last_shown)
break;

- if (match_funcname(opt, name, bol, eol)) {
+ if (match_funcname(opt, name, drv_p, bol, eol)) {
show_line(opt, bol, eol, name, lno, '=');
break;
}
}
}

-static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
- char *bol, char *end, unsigned lno)
+static void show_pre_context(struct grep_opt *opt, const char *name,
+ struct userdiff_driver **drv_p,
+ char *buf, char *bol, char *end, unsigned lno)
{
unsigned cur = lno, from = 1, funcname_lno = 0;
int funcname_needed = !!opt->funcname;

- if (opt->funcbody && !match_funcname(opt, name, bol, end))
+ if (opt->funcbody && !match_funcname(opt, name, drv_p, bol, end))
funcname_needed = 2;

if (opt->pre_context < lno)
@@ -900,7 +915,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
while (bol > buf && bol[-1] != '\n')
bol--;
cur--;
- if (funcname_needed && match_funcname(opt, name, bol, eol)) {
+ if (funcname_needed && match_funcname(opt, name, drv_p, bol, eol)) {
funcname_lno = cur;
funcname_needed = 0;
}
@@ -908,7 +923,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,

/* We need to look even further back to find a function signature. */
if (opt->funcname && funcname_needed)
- show_funcname_line(opt, name, buf, bol, cur);
+ show_funcname_line(opt, name, drv_p, buf, bol, cur);

/* Back forward. */
while (cur < lno) {
@@ -983,6 +998,17 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
fwrite(buf, size, 1, stdout);
}

+static int grep_buffer_is_binary(struct grep_opt *opt,
+ const char *path,
+ char *buf, unsigned long size,
+ struct userdiff_driver **drv_p)
+{
+ struct userdiff_driver *drv = get_cached_userdiff(opt, path, drv_p);
+ if (drv && drv->binary != -1)
+ return drv->binary;
+ return buffer_is_binary(buf, size);
+}
+
static int grep_buffer_1(struct grep_opt *opt, const char *name,
char *buf, unsigned long size, int collect_hits)
{
@@ -996,6 +1022,7 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
int show_function = 0;
enum grep_context ctx = GREP_CONTEXT_HEAD;
xdemitconf_t xecfg;
+ struct userdiff_driver *drv = NULL;

if (!opt->output)
opt->output = std_output;
@@ -1017,11 +1044,11 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,

switch (opt->binary) {
case GREP_BINARY_DEFAULT:
- if (buffer_is_binary(buf, size))
+ if (grep_buffer_is_binary(opt, name, buf, size, &drv))
binary_match_only = 1;
break;
case GREP_BINARY_NOMATCH:
- if (buffer_is_binary(buf, size))
+ if (grep_buffer_is_binary(opt, name, buf, size, &drv))
return 0; /* Assume unmatch */
break;
case GREP_BINARY_TEXT:
@@ -1099,16 +1126,16 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
* pre-context lines, we would need to show them.
*/
if (opt->pre_context || opt->funcbody)
- show_pre_context(opt, name, buf, bol, eol, lno);
+ show_pre_context(opt, name, &drv, buf, bol, eol, lno);
else if (opt->funcname)
- show_funcname_line(opt, name, buf, bol, lno);
+ show_funcname_line(opt, name, &drv, buf, bol, lno);
show_line(opt, bol, eol, name, lno, ':');
last_hit = lno;
if (opt->funcbody)
show_function = 1;
goto next_line;
}
- if (show_function && match_funcname(opt, name, bol, eol))
+ if (show_function && match_funcname(opt, name, &drv, bol, eol))
show_function = 0;
if (show_function ||
(last_hit && lno <= last_hit + opt->post_context)) {
Conrad Irwin
2012-02-01 09:28:47 UTC
Permalink
Post by Jeff King
Actually, it's a little bit more complicated. I was looking at a
slightly old version of grep.c. Since 0579f91 (grep: enable threading
with -p and -W using lazy attribute lookup, 2011-12-12), the lookup
happens in lots of sub-functions, and locking is required.
Heh, you just beat me to it.
Post by Jeff King
But there's more. Respecting binary attributes does mean looking up
attributes for _every_ file. And that has a noticeable impact. My
best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.39=
2s.
Post by Jeff King
Yuck.
The first time I introduced this behaviour[1], I made it conditional
on a preference =97 those who wanted "good" grep could set the
preference, while those who wanted "fast" grep could not. I think
that's not a good idea, though if the performance issues are
show-stoppers, I'd suggest the opposite preference (so speed-freaks
can disable the checks).

Tests from [1] included below in case they're still useful (they pass
with your change)

[1] http://article.gmane.org/gmane.comp.version-control.git/179299/matc=
h=3Dgrep
---

diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index 917a264..4d94461 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -99,4 +99,23 @@ test_expect_success 'git grep y<NUL>x a' "
test_must_fail git grep -f f a
"

+test_expect_success 'git -c grep.binaryFiles=3D1 grep ina a' "
+ echo 'a diff' > .gitattributes &&
+ printf 'binaryQfile' | q_to_nul >a &&
+ echo 'a:binaryQfile' | q_to_nul >expect &&
+ git -c grep.binaryFiles=3D1 grep ina a > actual &&
+ rm .gitattributes &&
+ test_cmp expect actual
+"
+test_expect_success 'git -c grep.binaryFiles=3D1 grep tex t' "
+ echo 'text' > t &&
+ git add t &&
+ echo 't -diff' > .gitattributes &&
+ echo Binary file t matches >expect &&
+ git -c grep.binaryFiles=3D1 grep tex t >actual &&
+ rm .gitattributes &&
+ test_cmp expect actual
+"
+
+
test_done
Jeff King
2012-02-01 22:14:37 UTC
Permalink
Post by Conrad Irwin
Post by Jeff King
But there's more. Respecting binary attributes does mean looking up
attributes for _every_ file. And that has a noticeable impact. My
best-of-five for "git grep foo" on linux-2.6 went from 0.302s to 0.=
392s.
Post by Conrad Irwin
Post by Jeff King
Yuck.
=20
The first time I introduced this behaviour[1], I made it conditional
on a preference =E2=80=94 those who wanted "good" grep could set the
preference, while those who wanted "fast" grep could not. I think
that's not a good idea, though if the performance issues are
show-stoppers, I'd suggest the opposite preference (so speed-freaks
can disable the checks).
I've been able to get somewhat better performance by hoisting the
attribute lookup into the parent thread. That means it happens in order
(which lets the attr code's stack optimizations work), and there's no
lock contention.

I'll post finished patches with numbers in a few minutes.
Post by Conrad Irwin
Tests from [1] included below in case they're still useful (they pass
with your change)
Thanks, I'll include them.

-Peff
Jeff King
2012-02-01 23:20:27 UTC
Permalink
Post by Jeff King
The first time I introduced this behaviour[1], I made it conditiona=
l
Post by Jeff King
on a preference =E2=80=94 those who wanted "good" grep could set th=
e
Post by Jeff King
preference, while those who wanted "fast" grep could not. I think
that's not a good idea, though if the performance issues are
show-stoppers, I'd suggest the opposite preference (so speed-freaks
can disable the checks).
=20
I've been able to get somewhat better performance by hoisting the
attribute lookup into the parent thread. That means it happens in ord=
er
Post by Jeff King
(which lets the attr code's stack optimizations work), and there's no
lock contention.
=20
I'll post finished patches with numbers in a few minutes.
OK, here they are. After playing with some options, I'm satisfied this
is a sane way to do it. I don't think it's worth having a config option=
=2E
There is a measurable slowdown, but it's simply not that big.

[1/2]: grep: let grep_buffer callers specify a binary flag
[2/2]: grep: respect diff attributes for binary-ness

There are a few optimizations I didn't do that you could put on top:

1. When "-a" is given, we can avoid the attribute lookup altogether.

2. When "-I" is given, we can actually check attributes _before_
loading the file or blob into memory. This can help with very larg=
e
binaries.

3. When "-I" is given but we have no attribute, we can stream the
beginning of the file or blob to check for binary-ness, and then
avoid loading the whole thing if it turns out to be binary.

I think (1) and (2) should be easy. Doing (3) is a little messier,
because binary detection happens inside grep_buffer, but we can hoist i=
t
out. However, for large files, it might be nice to have a streaming gre=
p
interface anyway, and (3) could be part of that.

-Peff
Junio C Hamano
2012-02-02 02:03:51 UTC
Permalink
Post by Jeff King
1. When "-a" is given, we can avoid the attribute lookup altogether.
Correct.
Post by Jeff King
2. When "-I" is given, we can actually check attributes _before_
loading the file or blob into memory. This can help with very large
binaries.
Nice.
Post by Jeff King
... However, for large files, it might be nice to have a streaming grep
interface anyway, and (3) could be part of that.
Even nicer ;-)
Jeff King
2012-02-01 23:21:09 UTC
Permalink
The caller of grep_buffer may have extra information about
whether a buffer is binary or not (e.g., from configuration).
Let's give them a chance to pass along that information and
override our binary auto-detection.

Callers can still pass "-1" to get the regular
auto-detection (and all callers are converted to do this,
meaning there should be no behavior change yet).

We could maintain source compatibility for callers by adding
a new "grep_buffer_with_flags" and leaving "grep_buffer" as
a wrapper that always passes "-1". But there are only 5
callers of grep_buffer, and only 1 of those (grepping commit
buffers) will not be converted to pass something useful in
the next patch. So it's simpler to just add a "-1" there.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/grep.c | 8 ++++----
grep.c | 23 ++++++++++++++++-------
grep.h | 2 +-
revision.c | 1 +
4 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 9ce064a..e328316 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -221,14 +221,14 @@ static void *run(void *arg)
void* data = load_sha1(w->identifier, &sz, w->name);

if (data) {
- hit |= grep_buffer(opt, w->name, data, sz);
+ hit |= grep_buffer(opt, w->name, -1, data, sz);
free(data);
}
} else if (w->type == WORK_FILE) {
size_t sz;
void* data = load_file(w->identifier, &sz);
if (data) {
- hit |= grep_buffer(opt, w->name, data, sz);
+ hit |= grep_buffer(opt, w->name, -1, data, sz);
free(data);
}
} else {
@@ -421,7 +421,7 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
if (!data)
hit = 0;
else
- hit = grep_buffer(opt, name, data, sz);
+ hit = grep_buffer(opt, name, -1, data, sz);

free(data);
free(name);
@@ -483,7 +483,7 @@ static int grep_file(struct grep_opt *opt, const char *filename)
if (!data)
hit = 0;
else
- hit = grep_buffer(opt, name, data, sz);
+ hit = grep_buffer(opt, name, -1, data, sz);

free(data);
free(name);
diff --git a/grep.c b/grep.c
index 486230b..e547db2 100644
--- a/grep.c
+++ b/grep.c
@@ -983,8 +983,16 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
fwrite(buf, size, 1, stdout);
}

+static int grep_buffer_is_binary(char *buf, unsigned long size, int flag)
+{
+ if (flag == -1)
+ flag = buffer_is_binary(buf, size);
+ return flag;
+}
+
static int grep_buffer_1(struct grep_opt *opt, const char *name,
- char *buf, unsigned long size, int collect_hits)
+ int is_binary, char *buf, unsigned long size,
+ int collect_hits)
{
char *bol = buf;
unsigned long left = size;
@@ -1017,11 +1025,11 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,

switch (opt->binary) {
case GREP_BINARY_DEFAULT:
- if (buffer_is_binary(buf, size))
+ if (grep_buffer_is_binary(buf, size, is_binary))
binary_match_only = 1;
break;
case GREP_BINARY_NOMATCH:
- if (buffer_is_binary(buf, size))
+ if (grep_buffer_is_binary(buf, size, is_binary))
return 0; /* Assume unmatch */
break;
case GREP_BINARY_TEXT:
@@ -1182,23 +1190,24 @@ static int chk_hit_marker(struct grep_expr *x)
}
}

-int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+int grep_buffer(struct grep_opt *opt, const char *name, int is_binary,
+ char *buf, unsigned long size)
{
/*
* we do not have to do the two-pass grep when we do not check
* buffer-wide "all-match".
*/
if (!opt->all_match)
- return grep_buffer_1(opt, name, buf, size, 0);
+ return grep_buffer_1(opt, name, is_binary, buf, size, 0);

/* Otherwise the toplevel "or" terms hit a bit differently.
* We first clear hit markers from them.
*/
clr_hit_marker(opt->pattern_expression);
- grep_buffer_1(opt, name, buf, size, 1);
+ grep_buffer_1(opt, name, is_binary, buf, size, 1);

if (!chk_hit_marker(opt->pattern_expression))
return 0;

- return grep_buffer_1(opt, name, buf, size, 0);
+ return grep_buffer_1(opt, name, is_binary, buf, size, 0);
}
diff --git a/grep.h b/grep.h
index fb205f3..8447e4c 100644
--- a/grep.h
+++ b/grep.h
@@ -128,7 +128,7 @@ extern void append_grep_pattern(struct grep_opt *opt, const char *pat, const cha
extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field, const char *);
extern void compile_grep_patterns(struct grep_opt *opt);
extern void free_grep_patterns(struct grep_opt *opt);
-extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
+extern int grep_buffer(struct grep_opt *opt, const char *name, int is_binary, char *buf, unsigned long size);

extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
extern int grep_threads_ok(const struct grep_opt *opt);
diff --git a/revision.c b/revision.c
index c97d834..3dcd968 100644
--- a/revision.c
+++ b/revision.c
@@ -2150,6 +2150,7 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
return 1;
return grep_buffer(&opt->grep_filter,
NULL, /* we say nothing, not even filename */
+ -1,
commit->buffer, strlen(commit->buffer));
}
--
1.7.9.3.gc3fce1.dirty
Junio C Hamano
2012-02-02 00:47:38 UTC
Permalink
Post by Jeff King
The caller of grep_buffer may have extra information about
whether a buffer is binary or not (e.g., from configuration).
Let's give them a chance to pass along that information and
override our binary auto-detection.
Hrm, I would have expected a patch that turns "const char *name" into a
structure that has name and drv as its members, so that later we can tell
the function more about the nature of the contents. Or a separate pointer
to drv in place of your "binary" flag word.

I am not saying that your patch is wrong. It was just somewhat unexpected
that "binary" is the only additional thing we want to tell the function.
Jeff King
2012-02-02 00:52:09 UTC
Permalink
Post by Junio C Hamano
Post by Jeff King
The caller of grep_buffer may have extra information about
whether a buffer is binary or not (e.g., from configuration).
Let's give them a chance to pass along that information and
override our binary auto-detection.
Hrm, I would have expected a patch that turns "const char *name" into a
structure that has name and drv as its members, so that later we can tell
the function more about the nature of the contents. Or a separate pointer
to drv in place of your "binary" flag word.
Hmm. Yeah, I would be OK with that, as it's really encapsulating the
idea of "stuff we want to tell grep about this buffer".

What I really didn't want to do was pass the userdiff driver directly,
as that feels way too much like an implementation detail (and while it
can be used to avoid further lookups, it doesn't seem to make a
difference in practice -- see the following patch). And passing it
around became a big messy chore.

But if it were a "struct grep_context" that encapsulated those things, I
think it would be much nicer (it could even carry a pointer to "struct
grep_opt" in it, making the code _more_ pleasant to read, not less).

I'll take a look at re-working it that way.

-Peff
Jeff King
2012-02-02 08:17:47 UTC
Permalink
[+cc Thomas, as I am mangling some of his recent work with my
refactoring]
Post by Jeff King
Post by Junio C Hamano
Hrm, I would have expected a patch that turns "const char *name" into a
structure that has name and drv as its members, so that later we can tell
the function more about the nature of the contents. Or a separate pointer
to drv in place of your "binary" flag word.
[...]
I'll take a look at re-working it that way.
Thanks for a dose of sanity. The result turned out much easier to read
(and explain in the commit messages, as it was simple to break into
smaller commits). In particular, the "don't read binary-marked files at
all with -I" optimization became very natural.

I implemented all of the other optimizations I mentioned except the
"only stream the first few bytes when auto-detecting binary-ness" one.
However, it should be easy to do on top of these changes. I need to
re-visit the similar change to diff_filespec_is_binary, and I'll do both
at the same time.

The patches are:

[1/9]: grep: make locking flag global
[2/9]: grep: move sha1-reading mutex into low-level code
[3/9]: grep: refactor the concept of "grep source" into an object
[4/9]: convert git-grep to use grep_source interface
[5/9]: grep: drop grep_buffer's "name" parameter
[6/9]: grep: cache userdiff_driver in grep_source

These are all refactoring that should have no behavior change.

[7/9]: grep: respect diff attributes for binary-ness

This is the point of the series. :)

[8/9]: grep: load file data after checking binary-ness
[9/9]: grep: pre-load userdiff drivers when threaded

And these two are simple optimizations.

-Peff
Jeff King
2012-02-02 08:18:29 UTC
Permalink
The low-level grep code traditionally didn't care about
threading, as it doesn't do any threading itself and didn't
call out to other non-thread-safe code. That changed with
0579f91 (grep: enable threading with -p and -W using lazy
attribute lookup, 2011-12-12), which pushed the lookup of
funcname attributes (which is not thread-safe) into the
low-level grep code.

As a result, the low-level code learned about a new global
"grep_attr_mutex" to serialize access to the attribute code.
A multi-threaded caller (e.g., builtin/grep.c) is expected
to initialize the mutex and set "use_threads" in the
grep_opt structure. The low-level code only uses the lock if
use_threads is set.

However, putting the use_threads flag into the grep_opt
struct is not the most logical place. Whether threading is
in use is not something that matters for each call to
grep_buffer, but is instead global to the whole program
(i.e., if any thread is doing multi-threaded grep, every
other thread, even if it thinks it is doing its own
single-threaded grep, would need to use the locking). In
practice, this distinction isn't a problem for us, because
the only user of multi-threaded grep is "git-grep", which
does nothing except call grep.

This patch turns the opt->use_threads flag into a global
flag. More important than the nit-picking semantic argument
above is that this means that the locking functions don't
need to actually have access to a grep_opt to know whether
to lock. Which in turn can make adding new locks simpler, as
we don't need to pass around a grep_opt.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/grep.c | 4 ++--
grep.c | 18 ++++++++++--------
grep.h | 2 +-
3 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 5c2ae94..06983f9 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -259,6 +259,7 @@ static void start_threads(struct grep_opt *opt)
pthread_cond_init(&cond_add, NULL);
pthread_cond_init(&cond_write, NULL);
pthread_cond_init(&cond_result, NULL);
+ grep_use_locks = 1;

for (i = 0; i < ARRAY_SIZE(todo); i++) {
strbuf_init(&todo[i].out, 0);
@@ -307,6 +308,7 @@ static int wait_all(void)
pthread_cond_destroy(&cond_add);
pthread_cond_destroy(&cond_write);
pthread_cond_destroy(&cond_result);
+ grep_use_locks = 0;

return hit;
}
@@ -1030,8 +1032,6 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
use_threads = 0;
#endif

- opt.use_threads = use_threads;
-
#ifndef NO_PTHREADS
if (use_threads) {
if (!(opt.name_only || opt.unmatch_name_only || opt.count)
diff --git a/grep.c b/grep.c
index 486230b..7a67c2f 100644
--- a/grep.c
+++ b/grep.c
@@ -807,26 +807,28 @@ static void show_line(struct grep_opt *opt, char *bol, char *eol,
}

#ifndef NO_PTHREADS
+int grep_use_locks;
+
/*
* This lock protects access to the gitattributes machinery, which is
* not thread-safe.
*/
pthread_mutex_t grep_attr_mutex;

-static inline void grep_attr_lock(struct grep_opt *opt)
+static inline void grep_attr_lock(void)
{
- if (opt->use_threads)
+ if (grep_use_locks)
pthread_mutex_lock(&grep_attr_mutex);
}

-static inline void grep_attr_unlock(struct grep_opt *opt)
+static inline void grep_attr_unlock(void)
{
- if (opt->use_threads)
+ if (grep_use_locks)
pthread_mutex_unlock(&grep_attr_mutex);
}
#else
-#define grep_attr_lock(opt)
-#define grep_attr_unlock(opt)
+#define grep_attr_lock()
+#define grep_attr_unlock()
#endif

static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
@@ -834,9 +836,9 @@ static int match_funcname(struct grep_opt *opt, const char *name, char *bol, cha
xdemitconf_t *xecfg = opt->priv;
if (xecfg && !xecfg->find_func) {
struct userdiff_driver *drv;
- grep_attr_lock(opt);
+ grep_attr_lock();
drv = userdiff_find_by_path(name);
- grep_attr_unlock(opt);
+ grep_attr_unlock();
if (drv && drv->funcname.pattern) {
const struct userdiff_funcname *pe = &drv->funcname;
xdiff_set_find_func(xecfg, pe->pattern, pe->cflags);
diff --git a/grep.h b/grep.h
index fb205f3..3653bb3 100644
--- a/grep.h
+++ b/grep.h
@@ -116,7 +116,6 @@ struct grep_opt {
int show_hunk_mark;
int file_break;
int heading;
- int use_threads;
void *priv;

void (*output)(struct grep_opt *opt, const void *data, size_t size);
@@ -138,6 +137,7 @@ extern int grep_threads_ok(const struct grep_opt *opt);
* Mutex used around access to the attributes machinery if
* opt->use_threads. Must be initialized/destroyed by callers!
*/
+extern int grep_use_locks;
extern pthread_mutex_t grep_attr_mutex;
#endif
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:18:41 UTC
Permalink
The multi-threaded git-grep code needs to serialize access
to the thread-unsafe read_sha1_file call. It does this with
a mutex that is local to builtin/grep.c.

Let's instead push this down into grep.c, where it can be
used by both builtin/grep.c and grep.c. This will let us
safely teach the low-level grep.c code tricks that involve
reading from the object db.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/grep.c | 29 ++++++-----------------------
grep.c | 6 ++++++
grep.h | 17 +++++++++++++++++
3 files changed, 29 insertions(+), 23 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 06983f9..f4402fa 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -85,21 +85,6 @@ static inline void grep_unlock(void)
pthread_mutex_unlock(&grep_mutex);
}

-/* Used to serialize calls to read_sha1_file. */
-static pthread_mutex_t read_sha1_mutex;
-
-static inline void read_sha1_lock(void)
-{
- if (use_threads)
- pthread_mutex_lock(&read_sha1_mutex);
-}
-
-static inline void read_sha1_unlock(void)
-{
- if (use_threads)
- pthread_mutex_unlock(&read_sha1_mutex);
-}
-
/* Signalled when a new work_item is added to todo. */
static pthread_cond_t cond_add;

@@ -254,7 +239,7 @@ static void start_threads(struct grep_opt *opt)
int i;

pthread_mutex_init(&grep_mutex, NULL);
- pthread_mutex_init(&read_sha1_mutex, NULL);
+ pthread_mutex_init(&grep_read_mutex, NULL);
pthread_mutex_init(&grep_attr_mutex, NULL);
pthread_cond_init(&cond_add, NULL);
pthread_cond_init(&cond_write, NULL);
@@ -303,7 +288,7 @@ static int wait_all(void)
}

pthread_mutex_destroy(&grep_mutex);
- pthread_mutex_destroy(&read_sha1_mutex);
+ pthread_mutex_destroy(&grep_read_mutex);
pthread_mutex_destroy(&grep_attr_mutex);
pthread_cond_destroy(&cond_add);
pthread_cond_destroy(&cond_write);
@@ -313,8 +298,6 @@ static int wait_all(void)
return hit;
}
#else /* !NO_PTHREADS */
-#define read_sha1_lock()
-#define read_sha1_unlock()

static int wait_all(void)
{
@@ -376,9 +359,9 @@ static void *lock_and_read_sha1_file(const unsigned char *sha1, enum object_type
{
void *data;

- read_sha1_lock();
+ grep_read_lock();
data = read_sha1_file(sha1, type, size);
- read_sha1_unlock();
+ grep_read_unlock();
return data;
}

@@ -617,10 +600,10 @@ static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
struct strbuf base;
int hit, len;

- read_sha1_lock();
+ grep_read_lock();
data = read_object_with_reference(obj->sha1, tree_type,
&size, NULL);
- read_sha1_unlock();
+ grep_read_unlock();

if (!data)
die(_("unable to read tree (%s)"), sha1_to_hex(obj->sha1));
diff --git a/grep.c b/grep.c
index 7a67c2f..db58a29 100644
--- a/grep.c
+++ b/grep.c
@@ -826,6 +826,12 @@ static inline void grep_attr_unlock(void)
if (grep_use_locks)
pthread_mutex_unlock(&grep_attr_mutex);
}
+
+/*
+ * Same as git_attr_mutex, but protecting the thread-unsafe object db access.
+ */
+pthread_mutex_t grep_read_mutex;
+
#else
#define grep_attr_lock()
#define grep_attr_unlock()
diff --git a/grep.h b/grep.h
index 3653bb3..4f1b025 100644
--- a/grep.h
+++ b/grep.h
@@ -139,6 +139,23 @@ extern int grep_threads_ok(const struct grep_opt *opt);
*/
extern int grep_use_locks;
extern pthread_mutex_t grep_attr_mutex;
+extern pthread_mutex_t grep_read_mutex;
+
+static inline void grep_read_lock(void)
+{
+ if (grep_use_locks)
+ pthread_mutex_lock(&grep_read_mutex);
+}
+
+static inline void grep_read_unlock(void)
+{
+ if (grep_use_locks)
+ pthread_mutex_unlock(&grep_read_mutex);
+}
+
+#else
+#define grep_read_lock()
+#define grep_read_unlock()
#endif

#endif
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:19:28 UTC
Permalink
The main interface to the low-level grep code is
grep_buffer, which takes a pointer to a buffer and a size.
This is convenient and flexible (we use it to grep commit
bodies, files on disk, and blobs by sha1), but it makes it
hard to pass extra information about what we are grepping
(either for correctness, like overriding binary
auto-detection, or for optimizations, like lazily loading
blob contents).

Instead, let's encapsulate the idea of a "grep source",
including the buffer, its size, and where the data is coming
from. This is similar to the diff_filespec structure used by
the diff code (unsurprising, since future patches will
implement some of the same optimizations found there).

The diffstat is slightly scarier than the actual patch
content. Most of the modified lines are simply replacing
access to raw variables with their counterparts that are now
in a "struct grep_source". Most of the added lines were
taken from builtin/grep.c, which partially abstracted the
idea of grep sources (for file vs sha1 sources).

Instead of dropping the now-redundant code, this patch
leaves builtin/grep.c using the traditional grep_buffer
interface (which now wraps the grep_source interface). That
makes it easy to test that there is no change of behavior
(yet).

Signed-off-by: Jeff King <***@peff.net>
---
grep.c | 198 +++++++++++++++++++++++++++++++++++++++++++++++++++++-----------
grep.h | 22 +++++++
2 files changed, 186 insertions(+), 34 deletions(-)

diff --git a/grep.c b/grep.c
index db58a29..8204ca2 100644
--- a/grep.c
+++ b/grep.c
@@ -837,13 +837,13 @@ pthread_mutex_t grep_read_mutex;
#define grep_attr_unlock()
#endif

-static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
+static int match_funcname(struct grep_opt *opt, struct grep_source *gs, char *bol, char *eol)
{
xdemitconf_t *xecfg = opt->priv;
if (xecfg && !xecfg->find_func) {
struct userdiff_driver *drv;
grep_attr_lock();
- drv = userdiff_find_by_path(name);
+ drv = userdiff_find_by_path(gs->name);
grep_attr_unlock();
if (drv && drv->funcname.pattern) {
const struct userdiff_funcname *pe = &drv->funcname;
@@ -866,33 +866,33 @@ static int match_funcname(struct grep_opt *opt, const char *name, char *bol, cha
return 0;
}

-static void show_funcname_line(struct grep_opt *opt, const char *name,
- char *buf, char *bol, unsigned lno)
+static void show_funcname_line(struct grep_opt *opt, struct grep_source *gs,
+ char *bol, unsigned lno)
{
- while (bol > buf) {
+ while (bol > gs->buf) {
char *eol = --bol;

- while (bol > buf && bol[-1] != '\n')
+ while (bol > gs->buf && bol[-1] != '\n')
bol--;
lno--;

if (lno <= opt->last_shown)
break;

- if (match_funcname(opt, name, bol, eol)) {
- show_line(opt, bol, eol, name, lno, '=');
+ if (match_funcname(opt, gs, bol, eol)) {
+ show_line(opt, bol, eol, gs->name, lno, '=');
break;
}
}
}

-static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
+static void show_pre_context(struct grep_opt *opt, struct grep_source *gs,
char *bol, char *end, unsigned lno)
{
unsigned cur = lno, from = 1, funcname_lno = 0;
int funcname_needed = !!opt->funcname;

- if (opt->funcbody && !match_funcname(opt, name, bol, end))
+ if (opt->funcbody && !match_funcname(opt, gs, bol, end))
funcname_needed = 2;

if (opt->pre_context < lno)
@@ -901,14 +901,14 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
from = opt->last_shown + 1;

/* Rewind. */
- while (bol > buf &&
+ while (bol > gs->buf &&
cur > (funcname_needed == 2 ? opt->last_shown + 1 : from)) {
char *eol = --bol;

- while (bol > buf && bol[-1] != '\n')
+ while (bol > gs->buf && bol[-1] != '\n')
bol--;
cur--;
- if (funcname_needed && match_funcname(opt, name, bol, eol)) {
+ if (funcname_needed && match_funcname(opt, gs, bol, eol)) {
funcname_lno = cur;
funcname_needed = 0;
}
@@ -916,7 +916,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,

/* We need to look even further back to find a function signature. */
if (opt->funcname && funcname_needed)
- show_funcname_line(opt, name, buf, bol, cur);
+ show_funcname_line(opt, gs, bol, cur);

/* Back forward. */
while (cur < lno) {
@@ -924,7 +924,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,

while (*eol != '\n')
eol++;
- show_line(opt, bol, eol, name, cur, sign);
+ show_line(opt, bol, eol, gs->name, cur, sign);
bol = eol + 1;
cur++;
}
@@ -991,11 +991,10 @@ static void std_output(struct grep_opt *opt, const void *buf, size_t size)
fwrite(buf, size, 1, stdout);
}

-static int grep_buffer_1(struct grep_opt *opt, const char *name,
- char *buf, unsigned long size, int collect_hits)
+static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int collect_hits)
{
- char *bol = buf;
- unsigned long left = size;
+ char *bol;
+ unsigned long left;
unsigned lno = 1;
unsigned last_hit = 0;
int binary_match_only = 0;
@@ -1023,13 +1022,16 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
}
opt->last_shown = 0;

+ if (grep_source_load(gs) < 0)
+ return 0;
+
switch (opt->binary) {
case GREP_BINARY_DEFAULT:
- if (buffer_is_binary(buf, size))
+ if (buffer_is_binary(gs->buf, gs->size))
binary_match_only = 1;
break;
case GREP_BINARY_NOMATCH:
- if (buffer_is_binary(buf, size))
+ if (buffer_is_binary(gs->buf, gs->size))
return 0; /* Assume unmatch */
break;
case GREP_BINARY_TEXT:
@@ -1043,6 +1045,8 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,

try_lookahead = should_lookahead(opt);

+ bol = gs->buf;
+ left = gs->size;
while (left) {
char *eol, ch;
int hit;
@@ -1091,14 +1095,14 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
if (opt->status_only)
return 1;
if (opt->name_only) {
- show_name(opt, name);
+ show_name(opt, gs->name);
return 1;
}
if (opt->count)
goto next_line;
if (binary_match_only) {
opt->output(opt, "Binary file ", 12);
- output_color(opt, name, strlen(name),
+ output_color(opt, gs->name, strlen(gs->name),
opt->color_filename);
opt->output(opt, " matches\n", 9);
return 1;
@@ -1107,23 +1111,23 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
* pre-context lines, we would need to show them.
*/
if (opt->pre_context || opt->funcbody)
- show_pre_context(opt, name, buf, bol, eol, lno);
+ show_pre_context(opt, gs, bol, eol, lno);
else if (opt->funcname)
- show_funcname_line(opt, name, buf, bol, lno);
- show_line(opt, bol, eol, name, lno, ':');
+ show_funcname_line(opt, gs, bol, lno);
+ show_line(opt, bol, eol, gs->name, lno, ':');
last_hit = lno;
if (opt->funcbody)
show_function = 1;
goto next_line;
}
- if (show_function && match_funcname(opt, name, bol, eol))
+ if (show_function && match_funcname(opt, gs, bol, eol))
show_function = 0;
if (show_function ||
(last_hit && lno <= last_hit + opt->post_context)) {
/* If the last hit is within the post context,
* we need to show this line.
*/
- show_line(opt, bol, eol, name, lno, '-');
+ show_line(opt, bol, eol, gs->name, lno, '-');
}

next_line:
@@ -1141,7 +1145,7 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
return 0;
if (opt->unmatch_name_only) {
/* We did not see any hit, so we want to show this */
- show_name(opt, name);
+ show_name(opt, gs->name);
return 1;
}

@@ -1155,7 +1159,7 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
*/
if (opt->count && count) {
char buf[32];
- output_color(opt, name, strlen(name), opt->color_filename);
+ output_color(opt, gs->name, strlen(gs->name), opt->color_filename);
output_sep(opt, ':');
snprintf(buf, sizeof(buf), "%u\n", count);
opt->output(opt, buf, strlen(buf));
@@ -1190,23 +1194,149 @@ static int chk_hit_marker(struct grep_expr *x)
}
}

-int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+int grep_source(struct grep_opt *opt, struct grep_source *gs)
{
/*
* we do not have to do the two-pass grep when we do not check
* buffer-wide "all-match".
*/
if (!opt->all_match)
- return grep_buffer_1(opt, name, buf, size, 0);
+ return grep_source_1(opt, gs, 0);

/* Otherwise the toplevel "or" terms hit a bit differently.
* We first clear hit markers from them.
*/
clr_hit_marker(opt->pattern_expression);
- grep_buffer_1(opt, name, buf, size, 1);
+ grep_source_1(opt, gs, 1);

if (!chk_hit_marker(opt->pattern_expression))
return 0;

- return grep_buffer_1(opt, name, buf, size, 0);
+ return grep_source_1(opt, gs, 0);
+}
+
+int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+{
+ struct grep_source gs;
+ int r;
+
+ grep_source_init(&gs, GREP_SOURCE_BUF, name, NULL);
+ gs.buf = buf;
+ gs.size = size;
+
+ r = grep_source(opt, &gs);
+
+ grep_source_clear(&gs);
+ return r;
+}
+
+void grep_source_init(struct grep_source *gs, enum grep_source_type type,
+ const char *name, const void *identifier)
+{
+ gs->type = type;
+ gs->name = name ? xstrdup(name) : NULL;
+ gs->buf = NULL;
+ gs->size = 0;
+
+ switch (type) {
+ case GREP_SOURCE_FILE:
+ gs->identifier = xstrdup(identifier);
+ break;
+ case GREP_SOURCE_SHA1:
+ gs->identifier = xmalloc(20);
+ memcpy(gs->identifier, identifier, 20);
+ break;
+ case GREP_SOURCE_BUF:
+ gs->identifier = NULL;
+ }
+}
+
+void grep_source_clear(struct grep_source *gs)
+{
+ free(gs->name);
+ gs->name = NULL;
+ free(gs->identifier);
+ gs->identifier = NULL;
+ grep_source_clear_data(gs);
+}
+
+void grep_source_clear_data(struct grep_source *gs)
+{
+ switch (gs->type) {
+ case GREP_SOURCE_FILE:
+ case GREP_SOURCE_SHA1:
+ free(gs->buf);
+ gs->buf = NULL;
+ gs->size = 0;
+ break;
+ case GREP_SOURCE_BUF:
+ /* leave user-provided buf intact */
+ break;
+ }
+}
+
+static int grep_source_load_sha1(struct grep_source *gs)
+{
+ enum object_type type;
+
+ grep_read_lock();
+ gs->buf = read_sha1_file(gs->identifier, &type, &gs->size);
+ grep_read_unlock();
+
+ if (!gs->buf)
+ return error(_("'%s': unable to read %s"),
+ gs->name,
+ sha1_to_hex(gs->identifier));
+ return 0;
+}
+
+static int grep_source_load_file(struct grep_source *gs)
+{
+ const char *filename = gs->identifier;
+ struct stat st;
+ char *data;
+ size_t size;
+ int i;
+
+ if (lstat(filename, &st) < 0) {
+ err_ret:
+ if (errno != ENOENT)
+ error(_("'%s': %s"), filename, strerror(errno));
+ return -1;
+ }
+ if (!S_ISREG(st.st_mode))
+ return -1;
+ size = xsize_t(st.st_size);
+ i = open(filename, O_RDONLY);
+ if (i < 0)
+ goto err_ret;
+ data = xmalloc(size + 1);
+ if (st.st_size != read_in_full(i, data, size)) {
+ error(_("'%s': short read %s"), filename, strerror(errno));
+ close(i);
+ free(data);
+ return -1;
+ }
+ close(i);
+ data[size] = 0;
+
+ gs->buf = data;
+ gs->size = size;
+ return 0;
+}
+
+int grep_source_load(struct grep_source *gs)
+{
+ if (gs->buf)
+ return 0;
+
+ switch (gs->type) {
+ case GREP_SOURCE_FILE:
+ return grep_source_load_file(gs);
+ case GREP_SOURCE_SHA1:
+ return grep_source_load_sha1(gs);
+ case GREP_SOURCE_BUF:
+ return gs->buf ? 0 : -1;
+ }
+ die("BUG: invalid grep_source type");
}
diff --git a/grep.h b/grep.h
index 4f1b025..e386ca4 100644
--- a/grep.h
+++ b/grep.h
@@ -129,6 +129,28 @@ extern void compile_grep_patterns(struct grep_opt *opt);
extern void free_grep_patterns(struct grep_opt *opt);
extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);

+struct grep_source {
+ char *name;
+
+ enum grep_source_type {
+ GREP_SOURCE_SHA1,
+ GREP_SOURCE_FILE,
+ GREP_SOURCE_BUF,
+ } type;
+ void *identifier;
+
+ char *buf;
+ unsigned long size;
+};
+
+void grep_source_init(struct grep_source *gs, enum grep_source_type type,
+ const char *name, const void *identifier);
+int grep_source_load(struct grep_source *gs);
+void grep_source_clear_data(struct grep_source *gs);
+void grep_source_clear(struct grep_source *gs);
+
+int grep_source(struct grep_opt *opt, struct grep_source *gs);
+
extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
extern int grep_threads_ok(const struct grep_opt *opt);
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:19:37 UTC
Permalink
The grep_source interface (as opposed to grep_buffer) will
eventually gives us a richer interface for telling the
low-level grep code about our buffers. Eventually this will
lead to things like better binary-file handling. For now, it
lets us drop a lot of now-redundant code.

The conversion is mostly straight-forward. One thing to note
is that the memory ownership rules for "struct grep_source"
are different than the "struct work_item" found here (the
former will copy things like the filename, rather than
taking ownership). Therefore you will also see some slight
tweaking of when filename buffers are released.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/grep.c | 142 +++++++++-----------------------------------------------
1 files changed, 23 insertions(+), 119 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index f4402fa..bc85a20 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -29,25 +29,12 @@ static int use_threads = 1;
#define THREADS 8
static pthread_t threads[THREADS];

-static void *load_sha1(const unsigned char *sha1, unsigned long *size,
- const char *name);
-static void *load_file(const char *filename, size_t *sz);
-
-enum work_type {WORK_SHA1, WORK_FILE};
-
/* We use one producer thread and THREADS consumer
* threads. The producer adds struct work_items to 'todo' and the
* consumers pick work items from the same array.
*/
struct work_item {
- enum work_type type;
- char *name;
-
- /* if type == WORK_SHA1, then 'identifier' is a SHA1,
- * otherwise type == WORK_FILE, and 'identifier' is a NUL
- * terminated filename.
- */
- void *identifier;
+ struct grep_source source;
char done;
struct strbuf out;
};
@@ -98,7 +85,8 @@ static pthread_cond_t cond_result;

static int skip_first_line;

-static void add_work(enum work_type type, char *name, void *id)
+static void add_work(enum grep_source_type type, const char *name,
+ const void *id)
{
grep_lock();

@@ -106,9 +94,7 @@ static void add_work(enum work_type type, char *name, void *id)
pthread_cond_wait(&cond_write, &grep_mutex);
}

- todo[todo_end].type = type;
- todo[todo_end].name = name;
- todo[todo_end].identifier = id;
+ grep_source_init(&todo[todo_end].source, type, name, id);
todo[todo_end].done = 0;
strbuf_reset(&todo[todo_end].out);
todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
@@ -136,21 +122,6 @@ static struct work_item *get_work(void)
return ret;
}

-static void grep_sha1_async(struct grep_opt *opt, char *name,
- const unsigned char *sha1)
-{
- unsigned char *s;
- s = xmalloc(20);
- memcpy(s, sha1, 20);
- add_work(WORK_SHA1, name, s);
-}
-
-static void grep_file_async(struct grep_opt *opt, char *name,
- const char *filename)
-{
- add_work(WORK_FILE, name, xstrdup(filename));
-}
-
static void work_done(struct work_item *w)
{
int old_done;
@@ -177,8 +148,7 @@ static void work_done(struct work_item *w)

write_or_die(1, p, len);
}
- free(w->name);
- free(w->identifier);
+ grep_source_clear(&w->source);
}

if (old_done != todo_done)
@@ -201,25 +171,8 @@ static void *run(void *arg)
break;

opt->output_priv = w;
- if (w->type == WORK_SHA1) {
- unsigned long sz;
- void* data = load_sha1(w->identifier, &sz, w->name);
-
- if (data) {
- hit |= grep_buffer(opt, w->name, data, sz);
- free(data);
- }
- } else if (w->type == WORK_FILE) {
- size_t sz;
- void* data = load_file(w->identifier, &sz);
- if (data) {
- hit |= grep_buffer(opt, w->name, data, sz);
- free(data);
- }
- } else {
- assert(0);
- }
-
+ hit |= grep_source(opt, &w->source);
+ grep_source_clear_data(&w->source);
work_done(w);
}
free_grep_patterns(arg);
@@ -365,23 +318,10 @@ static void *lock_and_read_sha1_file(const unsigned char *sha1, enum object_type
return data;
}

-static void *load_sha1(const unsigned char *sha1, unsigned long *size,
- const char *name)
-{
- enum object_type type;
- void *data = lock_and_read_sha1_file(sha1, &type, size);
-
- if (!data)
- error(_("'%s': unable to read %s"), name, sha1_to_hex(sha1));
-
- return data;
-}
-
static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
const char *filename, int tree_name_len)
{
struct strbuf pathbuf = STRBUF_INIT;
- char *name;

if (opt->relative && opt->prefix_length) {
quote_path_relative(filename + tree_name_len, -1, &pathbuf,
@@ -391,87 +331,51 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
strbuf_addstr(&pathbuf, filename);
}

- name = strbuf_detach(&pathbuf, NULL);
-
#ifndef NO_PTHREADS
if (use_threads) {
- grep_sha1_async(opt, name, sha1);
+ add_work(GREP_SOURCE_SHA1, pathbuf.buf, sha1);
+ strbuf_release(&pathbuf);
return 0;
} else
#endif
{
+ struct grep_source gs;
int hit;
- unsigned long sz;
- void *data = load_sha1(sha1, &sz, name);
- if (!data)
- hit = 0;
- else
- hit = grep_buffer(opt, name, data, sz);

- free(data);
- free(name);
- return hit;
- }
-}
+ grep_source_init(&gs, GREP_SOURCE_SHA1, pathbuf.buf, sha1);
+ strbuf_release(&pathbuf);
+ hit = grep_source(opt, &gs);

-static void *load_file(const char *filename, size_t *sz)
-{
- struct stat st;
- char *data;
- int i;
-
- if (lstat(filename, &st) < 0) {
- err_ret:
- if (errno != ENOENT)
- error(_("'%s': %s"), filename, strerror(errno));
- return NULL;
- }
- if (!S_ISREG(st.st_mode))
- return NULL;
- *sz = xsize_t(st.st_size);
- i = open(filename, O_RDONLY);
- if (i < 0)
- goto err_ret;
- data = xmalloc(*sz + 1);
- if (st.st_size != read_in_full(i, data, *sz)) {
- error(_("'%s': short read %s"), filename, strerror(errno));
- close(i);
- free(data);
- return NULL;
+ grep_source_clear(&gs);
+ return hit;
}
- close(i);
- data[*sz] = 0;
- return data;
}

static int grep_file(struct grep_opt *opt, const char *filename)
{
struct strbuf buf = STRBUF_INIT;
- char *name;

if (opt->relative && opt->prefix_length)
quote_path_relative(filename, -1, &buf, opt->prefix);
else
strbuf_addstr(&buf, filename);
- name = strbuf_detach(&buf, NULL);

#ifndef NO_PTHREADS
if (use_threads) {
- grep_file_async(opt, name, filename);
+ add_work(GREP_SOURCE_FILE, buf.buf, filename);
+ strbuf_release(&buf);
return 0;
} else
#endif
{
+ struct grep_source gs;
int hit;
- size_t sz;
- void *data = load_file(filename, &sz);
- if (!data)
- hit = 0;
- else
- hit = grep_buffer(opt, name, data, sz);

- free(data);
- free(name);
+ grep_source_init(&gs, GREP_SOURCE_FILE, buf.buf, filename);
+ strbuf_release(&buf);
+ hit = grep_source(opt, &gs);
+
+ grep_source_clear(&gs);
return hit;
}
}
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:20:10 UTC
Permalink
Before the grep_source interface existed, grep_buffer was
used by two types of callers:

1. Ones which pulled a file into a buffer, and then wanted
to supply the file's name for the output (i.e.,
git grep).

2. Ones which really just wanted to grep a buffer (i.e.,
git log --grep).

Callers in set (1) should now be using grep_source. Callers
in set (2) always pass NULL for the "name" parameter of
grep_buffer. We can therefore get rid of this now-useless
parameter.

Signed-off-by: Jeff King <***@peff.net>
---
This one isn't necessary, obviously, but I think it's a nice clean-up
after the last two patches.

grep.c | 4 ++--
grep.h | 2 +-
revision.c | 1 -
3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/grep.c b/grep.c
index 8204ca2..2a3fe7c 100644
--- a/grep.c
+++ b/grep.c
@@ -1215,12 +1215,12 @@ int grep_source(struct grep_opt *opt, struct grep_source *gs)
return grep_source_1(opt, gs, 0);
}

-int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size)
+int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size)
{
struct grep_source gs;
int r;

- grep_source_init(&gs, GREP_SOURCE_BUF, name, NULL);
+ grep_source_init(&gs, GREP_SOURCE_BUF, NULL, NULL);
gs.buf = buf;
gs.size = size;

diff --git a/grep.h b/grep.h
index e386ca4..8bf3001 100644
--- a/grep.h
+++ b/grep.h
@@ -127,7 +127,7 @@ extern void append_grep_pattern(struct grep_opt *opt, const char *pat, const cha
extern void append_header_grep_pattern(struct grep_opt *, enum grep_header_field, const char *);
extern void compile_grep_patterns(struct grep_opt *opt);
extern void free_grep_patterns(struct grep_opt *opt);
-extern int grep_buffer(struct grep_opt *opt, const char *name, char *buf, unsigned long size);
+extern int grep_buffer(struct grep_opt *opt, char *buf, unsigned long size);

struct grep_source {
char *name;
diff --git a/revision.c b/revision.c
index c97d834..819ff01 100644
--- a/revision.c
+++ b/revision.c
@@ -2149,7 +2149,6 @@ static int commit_match(struct commit *commit, struct rev_info *opt)
if (!opt->grep_filter.pattern_list && !opt->grep_filter.header_list)
return 1;
return grep_buffer(&opt->grep_filter,
- NULL, /* we say nothing, not even filename */
commit->buffer, strlen(commit->buffer));
}
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:20:43 UTC
Permalink
Right now, grep only uses the userdiff_driver for one thing:
looking up funcname patterns for "-p" and "-W". As new uses
for userdiff drivers are added to the grep code, we want to
minimize attribute lookups, which can be expensive.

It might seem at first that this would also optimize multiple
lookups when the funcname pattern for a file is needed
multiple times. However, the compiled funcname pattern is
already cached in struct grep_opt's "priv" member, so
multiple lookups are already suppressed.

Signed-off-by: Jeff King <***@peff.net>
---
grep.c | 22 ++++++++++++++++------
grep.h | 4 ++++
2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/grep.c b/grep.c
index 2a3fe7c..bb18569 100644
--- a/grep.c
+++ b/grep.c
@@ -841,12 +841,9 @@ static int match_funcname(struct grep_opt *opt, struct grep_source *gs, char *bo
{
xdemitconf_t *xecfg = opt->priv;
if (xecfg && !xecfg->find_func) {
- struct userdiff_driver *drv;
- grep_attr_lock();
- drv = userdiff_find_by_path(gs->name);
- grep_attr_unlock();
- if (drv && drv->funcname.pattern) {
- const struct userdiff_funcname *pe = &drv->funcname;
+ grep_source_load_driver(gs);
+ if (gs->driver->funcname.pattern) {
+ const struct userdiff_funcname *pe = &gs->driver->funcname;
xdiff_set_find_func(xecfg, pe->pattern, pe->cflags);
} else {
xecfg = opt->priv = NULL;
@@ -1237,6 +1234,7 @@ void grep_source_init(struct grep_source *gs, enum grep_source_type type,
gs->name = name ? xstrdup(name) : NULL;
gs->buf = NULL;
gs->size = 0;
+ gs->driver = NULL;

switch (type) {
case GREP_SOURCE_FILE:
@@ -1340,3 +1338,15 @@ int grep_source_load(struct grep_source *gs)
}
die("BUG: invalid grep_source type");
}
+
+void grep_source_load_driver(struct grep_source *gs)
+{
+ if (gs->driver)
+ return;
+
+ grep_attr_lock();
+ gs->driver = userdiff_find_by_path(gs->name);
+ if (!gs->driver)
+ gs->driver = userdiff_find_by_name("default");
+ grep_attr_unlock();
+}
diff --git a/grep.h b/grep.h
index 8bf3001..73b28c2 100644
--- a/grep.h
+++ b/grep.h
@@ -9,6 +9,7 @@ typedef int pcre_extra;
#endif
#include "kwset.h"
#include "thread-utils.h"
+#include "userdiff.h"

enum grep_pat_token {
GREP_PATTERN,
@@ -141,6 +142,8 @@ struct grep_source {

char *buf;
unsigned long size;
+
+ struct userdiff_driver *driver;
};

void grep_source_init(struct grep_source *gs, enum grep_source_type type,
@@ -148,6 +151,7 @@ void grep_source_init(struct grep_source *gs, enum grep_source_type type,
int grep_source_load(struct grep_source *gs);
void grep_source_clear_data(struct grep_source *gs);
void grep_source_clear(struct grep_source *gs);
+void grep_source_load_driver(struct grep_source *gs);

int grep_source(struct grep_opt *opt, struct grep_source *gs);
--
1.7.9.3.gc3fce1.dirty
Junio C Hamano
2012-02-02 18:34:07 UTC
Permalink
Post by Jeff King
- grep_attr_lock();
- drv = userdiff_find_by_path(gs->name);
- grep_attr_unlock();
- if (drv && drv->funcname.pattern) {
- const struct userdiff_funcname *pe = &drv->funcname;
+ grep_source_load_driver(gs);
+ if (gs->driver->funcname.pattern) {
+ const struct userdiff_funcname *pe = &gs->driver->funcname;
When we load driver, gs->driver gets at least "default" driver, so we no
longer need to check for drv != NULL as we used to? Is that the reason
for the slight difference here?
Post by Jeff King
@@ -1237,6 +1234,7 @@ void grep_source_init(struct grep_source *gs, enum grep_source_type type,
gs->name = name ? xstrdup(name) : NULL;
gs->buf = NULL;
gs->size = 0;
+ gs->driver = NULL;
switch (type) {
Jeff King
2012-02-02 19:37:56 UTC
Permalink
Post by Junio C Hamano
Post by Jeff King
- grep_attr_lock();
- drv = userdiff_find_by_path(gs->name);
- grep_attr_unlock();
- if (drv && drv->funcname.pattern) {
- const struct userdiff_funcname *pe = &drv->funcname;
+ grep_source_load_driver(gs);
+ if (gs->driver->funcname.pattern) {
+ const struct userdiff_funcname *pe = &gs->driver->funcname;
When we load driver, gs->driver gets at least "default" driver, so we no
longer need to check for drv != NULL as we used to? Is that the reason
for the slight difference here?
Yes, exactly.

We could just leave gs->driver NULL instead of looking up "default", and
then use NULL to signal to the calling code that defaults should be
used. But NULL is interpreted by grep_source_load_driver as "we did not
look up the driver yet", so the common case of "no driver" would mean we
accidentally do the lookup multiple times. The diff_filespec code uses
the same convention to solve the same problem.

Speaking of which, there was some notion in my mind that a "grep_source"
and a "diff_filespec" were very similar objects, and that we could
possibly unify the implementations. I decided against that route with
this series, as it would have involved pretty heavy refactoring of the
diff code to prevent a fairly small amount of code duplication.

-Peff
Jeff King
2012-02-02 08:21:02 UTC
Permalink
There is currently no way for users to tell git-grep that a
particular path is or is not a binary file; instead, grep
always relies on its auto-detection (or the user specifying
"-a" to treat all binary-looking files like text).

This patch teaches git-grep to use the same attribute lookup
that is used by git-diff. We could add a new "grep" flag,
but that is unnecessarily complex and unlikely to be useful.
Despite the name, the "-diff" attribute (or "diff=foo" and
the associated diff.foo.binary config option) are really
about describing the contents of the path. It's simply
historical that diff was the only thing that cared about
these attributes in the past.

And if this simple approach turns out to be insufficient, we
still have a backwards-compatible path forward: we can add a
separate "grep" attribute, and fall back to respecting
"diff" if it is unset.

Signed-off-by: Jeff King <***@peff.net>
---
grep.c | 16 ++++++++++++++--
grep.h | 1 +
t/t7008-grep-binary.sh | 24 ++++++++++++++++++++++++
3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/grep.c b/grep.c
index bb18569..a50d161 100644
--- a/grep.c
+++ b/grep.c
@@ -1024,11 +1024,11 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle

switch (opt->binary) {
case GREP_BINARY_DEFAULT:
- if (buffer_is_binary(gs->buf, gs->size))
+ if (grep_source_is_binary(gs))
binary_match_only = 1;
break;
case GREP_BINARY_NOMATCH:
- if (buffer_is_binary(gs->buf, gs->size))
+ if (grep_source_is_binary(gs))
return 0; /* Assume unmatch */
break;
case GREP_BINARY_TEXT:
@@ -1350,3 +1350,15 @@ void grep_source_load_driver(struct grep_source *gs)
gs->driver = userdiff_find_by_name("default");
grep_attr_unlock();
}
+
+int grep_source_is_binary(struct grep_source *gs)
+{
+ grep_source_load_driver(gs);
+ if (gs->driver->binary != -1)
+ return gs->driver->binary;
+
+ if (!grep_source_load(gs))
+ return buffer_is_binary(gs->buf, gs->size);
+
+ return 0;
+}
diff --git a/grep.h b/grep.h
index 73b28c2..36e49d8 100644
--- a/grep.h
+++ b/grep.h
@@ -152,6 +152,7 @@ int grep_source_load(struct grep_source *gs);
void grep_source_clear_data(struct grep_source *gs);
void grep_source_clear(struct grep_source *gs);
void grep_source_load_driver(struct grep_source *gs);
+int grep_source_is_binary(struct grep_source *gs);

int grep_source(struct grep_opt *opt, struct grep_source *gs);

diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index 917a264..fd6410f 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -99,4 +99,28 @@ test_expect_success 'git grep y<NUL>x a' "
test_must_fail git grep -f f a
"

+test_expect_success 'grep respects binary diff attribute' '
+ echo text >t &&
+ git add t &&
+ echo t:text >expect &&
+ git grep text t >actual &&
+ test_cmp expect actual &&
+ echo "t -diff" >.gitattributes &&
+ echo "Binary file t matches" >expect &&
+ git grep text t >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'grep respects not-binary diff attribute' '
+ echo binQary | q_to_nul >b &&
+ git add b &&
+ echo "Binary file b matches" >expect &&
+ git grep bin b >actual &&
+ test_cmp expect actual &&
+ echo "b diff" >.gitattributes &&
+ echo "b:binQary" >expect &&
+ git grep bin b | nul_to_q >actual &&
+ test_cmp expect actual
+'
+
test_done
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:21:11 UTC
Permalink
Usually we load each file to grep into memory, check whether
it's binary, and then either grep it (the default) or not
(if "-I" was given).

In the "-I" case, we can skip loading the file entirely if
it is marked as binary via gitattributes. On my giant
3-gigabyte media repository, doing "git grep -I foo" went
from:

real 0m0.712s
user 0m0.044s
sys 0m4.780s

to:

real 0m0.026s
user 0m0.016s
sys 0m0.020s

Obviously this is an extreme example. The repo is almost
entirely binary files, and you can see that we spent all of
our time asking the kernel to read() the data. However, with
a cold disk cache, even avoiding a few binary files can have
an impact.

Signed-off-by: Jeff King <***@peff.net>
---
grep.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index a50d161..3821400 100644
--- a/grep.c
+++ b/grep.c
@@ -1019,9 +1019,6 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle
}
opt->last_shown = 0;

- if (grep_source_load(gs) < 0)
- return 0;
-
switch (opt->binary) {
case GREP_BINARY_DEFAULT:
if (grep_source_is_binary(gs))
@@ -1042,6 +1039,9 @@ static int grep_source_1(struct grep_opt *opt, struct grep_source *gs, int colle

try_lookahead = should_lookahead(opt);

+ if (grep_source_load(gs) < 0)
+ return 0;
+
bol = gs->buf;
left = gs->size;
while (left) {
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:24:28 UTC
Permalink
The low-level grep_source code will automatically load the
userdiff driver to see whether a file is binary. However,
when we are threaded, it will load the drivers in a
non-deterministic order, handling each one as its assigned
thread happens to be scheduled.

Meanwhile, the attribute lookup code (which underlies the
userdiff driver lookup) is optimized to handle paths in
sequential order (because they tend to share the same
gitattributes files). Multi-threading the lookups destroys
the locality and makes this optimization less effective.

We can fix this by pre-loading the userdiff driver in the
main thread, before we hand off the file to a worker thread.
My best-of-five for "git grep foo" on the linux-2.6
repository went from:

real 0m0.391s
user 0m1.708s
sys 0m0.584s

to:

real 0m0.360s
user 0m1.576s
sys 0m0.572s

Not a huge speedup, but it's quite easy to do. The only
trick is that we shouldn't perform this optimization if "-a"
was used, in which case we won't bother checking whether
the files are binary at all.

Signed-off-by: Jeff King <***@peff.net>
---
The speedup is especially unimpressive when you consider that it won't
grow as the grep load grows. This is a pretty fast grep. If you used a
real regex, the whole thing would take even longer, and you will still
only be shaving off a few tens of milliseconds. So I wouldn't be
heart-broken if this patch was dropped. I included it because it's easy
to do, and maybe somebody with a slower machine would find the absolute
time difference more noticeable.

builtin/grep.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index bc85a20..9fc3e95 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -85,8 +85,8 @@ static pthread_cond_t cond_result;

static int skip_first_line;

-static void add_work(enum grep_source_type type, const char *name,
- const void *id)
+static void add_work(struct grep_opt *opt, enum grep_source_type type,
+ const char *name, const void *id)
{
grep_lock();

@@ -95,6 +95,8 @@ static void add_work(enum grep_source_type type, const char *name,
}

grep_source_init(&todo[todo_end].source, type, name, id);
+ if (opt->binary != GREP_BINARY_TEXT)
+ grep_source_load_driver(&todo[todo_end].source);
todo[todo_end].done = 0;
strbuf_reset(&todo[todo_end].out);
todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
@@ -333,7 +335,7 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,

#ifndef NO_PTHREADS
if (use_threads) {
- add_work(GREP_SOURCE_SHA1, pathbuf.buf, sha1);
+ add_work(opt, GREP_SOURCE_SHA1, pathbuf.buf, sha1);
strbuf_release(&pathbuf);
return 0;
} else
@@ -362,7 +364,7 @@ static int grep_file(struct grep_opt *opt, const char *filename)

#ifndef NO_PTHREADS
if (use_threads) {
- add_work(GREP_SOURCE_FILE, buf.buf, filename);
+ add_work(opt, GREP_SOURCE_FILE, buf.buf, filename);
strbuf_release(&buf);
return 0;
} else
--
1.7.9.3.gc3fce1.dirty
Jeff King
2012-02-02 08:30:09 UTC
Permalink
Post by Jeff King
I implemented all of the other optimizations I mentioned except the
"only stream the first few bytes when auto-detecting binary-ness" one.
However, it should be easy to do on top of these changes. I need to
re-visit the similar change to diff_filespec_is_binary, and I'll do both
at the same time.
Oh, and I didn't even think about implementing streaming grep. The
context-finding code relies on being able to backtrack through the file
in memory. We _could_ implement streaming only for binary files (i.e.,
when we will just print "Binary file foo matches"). However, I suspect
people with big binary files will want to be using "-I" anyway, so as to
avoid even pulling the data from disk at all.

We might eventually want to add a config-option version of "-I" for
people who have repositories of mixed source code and large binary
assets.

-Peff
Thomas Rast
2012-02-02 11:00:47 UTC
Permalink
Post by Jeff King
[+cc Thomas, as I am mangling some of his recent work with my
refactoring]
Mangling? I think it all looks very good.

My original plan was to make use_threads git-global, instead of
grep-global (and shift responsibility to the subsystems instead of their
users), but that's just me and the patches aren't ready yet.
--
Thomas Rast
trast@{inf,student}.ethz.ch
Jeff King
2012-02-02 11:07:19 UTC
Permalink
Post by Thomas Rast
My original plan was to make use_threads git-global, instead of
grep-global (and shift responsibility to the subsystems instead of their
users), but that's just me and the patches aren't ready yet.
Yeah, having just dug into the threading code in grep a bit, I agree
that would be a saner approach. The locking is all bolted-on, so you end
up with these weird contracts between code, like the low-level grep code
asking anybody who might be multi-threading it to initialize the mutexes
to cover access to a totally different subsystem. I'd much rather each
subsystem just take care of itself.

-Peff
Junio C Hamano
2012-02-02 18:39:24 UTC
Permalink
... The result turned out much easier to read
(and explain in the commit messages, as it was simple to break into
smaller commits)....
Indeed the series is very nicely done ;-)

Thanks.
Pete Wyckoff
2012-02-04 19:22:52 UTC
Permalink
I took a look at this series. It's nice. My worry was that the
extra open() of non-existent .gitattributes files in all the
directories would cause performance problems across networked
filesystems like NFS.

My usual (non-public) repository has order:

100k files
10k directories

and no files marked as binary. The grep string is such that it
is disk-bound, and not expected to match in any file (or binary):
"time ~/src/git/bin-wrappers/git grep unfindable-string".

With your change, there are 10k new open() calls looking for
.gitattributes in each directory, all of which return ENOENT.
This turns out to have an insignificant impact on performance due
to the much bigger time sink of stat()-ing all the files.

I think this happens to be true because the gitattributes lookups
run in parallel to all the file stat work, as the main thread
dispatches file work while doing its own gitattributes lookups.

It could be plausible that deep directory structures with few
grep-able files will suffer with this change. For example, many
big binary blobs in deep directory hierarchies, but also some
useful files here and there.

One could argue that with the use of .gitattributes to specify
which blobs should not be searched, this series makes this faster
by not having to to read the binary blobs at all. And I'd be
okay with that.

Just FYI that there may be a performance impact on certain
repositories.

-- Pete
Jeff King
2012-02-04 23:18:25 UTC
Permalink
Post by Pete Wyckoff
I took a look at this series. It's nice. My worry was that the
extra open() of non-existent .gitattributes files in all the
directories would cause performance problems across networked
filesystems like NFS.
Yeah, I was able to measure a small slow-down on a quick grep even with
a warm cache. So it does take some extra effort, but I think the
correctness is worth it (and note that the slow down is in the tens of
milliseconds if you have a reasonable stat()).

If people have big trees on NFS (or some other slow-stat system) where
these lookups are actually a problem, I'd rather see a global option to
disable .gitattributes lookups for both diff and grep (i.e., a "trust
me, I'm not using gitattributes, and don't bother with stat" flag). In
practice, though, I think such a thing is not necessary because the
stat() is local to the file being examined (e.g., for "foo/bar/baz", we
look only at "foo/bar/.gitattributes", "foo/.gitattributes", and
".gitattributes", without having to touch other parts of the tree).

Anyway, thanks for doing some performance testing. More data is always
good.
Post by Pete Wyckoff
It could be plausible that deep directory structures with few
grep-able files will suffer with this change. For example, many
big binary blobs in deep directory hierarchies, but also some
useful files here and there.
One could argue that with the use of .gitattributes to specify
which blobs should not be searched, this series makes this faster
by not having to to read the binary blobs at all. And I'd be
okay with that.
Yes, exactly. I think this will end up being a big win for such cases,
because the cost of loading even one large binary file from disk will
dwarf all of the stats. But it does depend on people marking their
binaries and using "-I".

-Peff

Jeff King
2012-02-01 23:21:29 UTC
Permalink
There is currently no way for users to tell git-grep that a
particular path is or is not a binary file; instead, grep
always relies on its auto-detection (or the user specifying
"-a" to treat all binary-looking files like text).

This patch teaches git-grep to use the same attribute lookup
that is used by git-diff. We could add a new "grep" flag,
but that is unnecessarily complex and unlikely to be useful.
Despite the name, the "-diff" attribute (or "diff=foo" and
the associated diff.foo.binary config option) are really
about describing the contents of the path. It's simply
historical that diff was the only thing that cared about
these attributes in the past.

And if this simple approach turns out to be insufficient, we
still have a backwards-compatible path forward: we can add a
separate "grep" attribute, and fall back to respecting
"diff" if it is unset.

There are a few things worth nothing about the
implementation.

One is that the attribute lookup happens outside of the
grep.[ch] interface (i.e., outside of grep_buffer). We could
do it at a lower level, which would be slightly more
convenient for callers. However, this interacts badly with
threading. The attribute-lookup code performs best when
lookup order matches the filesystem order (so looking up
"a/b/c" is cheaper if we just looked up "a/b/d" than if we
just did "x/y/z"). Because we issue many simultaneous
requests to grep_buffer, performing the attribute lookup at
that level will cause requests from unrelated paths to
interleave, and we lose the locality that makes the lookup
optimization work.

Instead, in the threaded case we check the attributes as
they are added to the work queue, meaning they are looked up
in the optimal order (in the single threaded case, this is a
non-issue, as we process the files serially in the optimal
order).

Here are a few numbers showing the difference. The first is
a best-of-five time for "git grep foo" on the linux-2.6 repo
before this patch (on a 4-core HT processor, using 8
threads):

real 0m0.306s
user 0m1.512s
sys 0m0.412s

Now here's the time for the same operation with a trial
implementation looking up attributes in grep_buffer,
showing a 31% slowdown:

real 0m0.401s
user 0m1.760s
sys 0m0.636s

And here's the same operation with this patch, with only an
11% slowdown:

real 0m0.339s
user 0m1.444s
sys 0m0.584s

Note that while the percentages are big, the absolute
numbers are pretty small. In particular, this is a very
inexpensive grep to do. A more complex regex should have the
same absolute slowdown from the attribute lookup, but it
would be a much smaller percentage of the total processing
time. So doing it this way is not a huge win, but it does
help on small greps.

The second issue worth noting is that while we do a full
attribute lookup, we pass along only the binary flag to
grep_buffer. When the "-p" flag is given to grep, we will
actually look up the same attributes to find funcname
patterns of matches. We could pass along the pointer to the
userdiff driver for reuse.

However, it's not worth doing this. It clutters the code, as
the driver has to be passed through a large number of helper
functions (and pollutes the grep_buffer interface with
userdiff code). And in my tests, it didn't actually improve
performance. Because we only have to look up the attribute
for a grep hit, in most cases we will only do the funcname
lookup for a small subset of files. The cost of the extra
lookups turns out to be negligible.

Signed-off-by: Jeff King <***@peff.net>
---
builtin/grep.c | 26 ++++++++++++++++++++++----
t/t7008-grep-binary.sh | 24 ++++++++++++++++++++++++
2 files changed, 46 insertions(+), 4 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index e328316..bb38804 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -42,6 +42,7 @@ enum work_type {WORK_SHA1, WORK_FILE};
struct work_item {
enum work_type type;
char *name;
+ int is_binary;

/* if type == WORK_SHA1, then 'identifier' is a SHA1,
* otherwise type == WORK_FILE, and 'identifier' is a NUL
@@ -113,6 +114,14 @@ static pthread_cond_t cond_result;

static int skip_first_line;

+static int path_is_binary(const char *path)
+{
+ struct userdiff_driver *drv = userdiff_find_by_path(path);
+ if (drv)
+ return drv->binary;
+ return -1;
+}
+
static void add_work(enum work_type type, char *name, void *id)
{
grep_lock();
@@ -123,6 +132,11 @@ static void add_work(enum work_type type, char *name, void *id)

todo[todo_end].type = type;
todo[todo_end].name = name;
+
+ pthread_mutex_lock(&grep_attr_mutex);
+ todo[todo_end].is_binary = path_is_binary(name);
+ pthread_mutex_unlock(&grep_attr_mutex);
+
todo[todo_end].identifier = id;
todo[todo_end].done = 0;
strbuf_reset(&todo[todo_end].out);
@@ -221,14 +235,16 @@ static void *run(void *arg)
void* data = load_sha1(w->identifier, &sz, w->name);

if (data) {
- hit |= grep_buffer(opt, w->name, -1, data, sz);
+ hit |= grep_buffer(opt, w->name, w->is_binary,
+ data, sz);
free(data);
}
} else if (w->type == WORK_FILE) {
size_t sz;
void* data = load_file(w->identifier, &sz);
if (data) {
- hit |= grep_buffer(opt, w->name, -1, data, sz);
+ hit |= grep_buffer(opt, w->name, w->is_binary,
+ data, sz);
free(data);
}
} else {
@@ -421,7 +437,8 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
if (!data)
hit = 0;
else
- hit = grep_buffer(opt, name, -1, data, sz);
+ hit = grep_buffer(opt, name, path_is_binary(name),
+ data, sz);

free(data);
free(name);
@@ -483,7 +500,8 @@ static int grep_file(struct grep_opt *opt, const char *filename)
if (!data)
hit = 0;
else
- hit = grep_buffer(opt, name, -1, data, sz);
+ hit = grep_buffer(opt, name, path_is_binary(name),
+ data, sz);

free(data);
free(name);
diff --git a/t/t7008-grep-binary.sh b/t/t7008-grep-binary.sh
index 917a264..fd6410f 100755
--- a/t/t7008-grep-binary.sh
+++ b/t/t7008-grep-binary.sh
@@ -99,4 +99,28 @@ test_expect_success 'git grep y<NUL>x a' "
test_must_fail git grep -f f a
"

+test_expect_success 'grep respects binary diff attribute' '
+ echo text >t &&
+ git add t &&
+ echo t:text >expect &&
+ git grep text t >actual &&
+ test_cmp expect actual &&
+ echo "t -diff" >.gitattributes &&
+ echo "Binary file t matches" >expect &&
+ git grep text t >actual &&
+ test_cmp expect actual
+'
+
+test_expect_success 'grep respects not-binary diff attribute' '
+ echo binQary | q_to_nul >b &&
+ git add b &&
+ echo "Binary file b matches" >expect &&
+ git grep bin b >actual &&
+ test_cmp expect actual &&
+ echo "b diff" >.gitattributes &&
+ echo "b:binQary" >expect &&
+ git grep bin b | nul_to_q >actual &&
+ test_cmp expect actual
+'
+
test_done
--
1.7.9.3.gc3fce1.dirty
Junio C Hamano
2012-02-01 16:28:46 UTC
Permalink
Post by Jeff King
Part of the problem, I suspect, is that the attribute lookup code is
optimized for locality. We only unwind as much of the stack as we need,
so looking at "foo/bar/baz.c" after "foo/bar/bleep.c" is much cheaper
than looking at "some/other/directory.c". But with threaded grep, that
locality is likely lost, as we are mixing up attribute requests from
different threads.
Given that binary lookup means we need every file's gitattribute, it
might be better to look them up serially at the beginning of the
program, and then pass the resulting userdiff driver to grep_buffer
along with each path.
Yeah, that was my impression when the performance of threaded grep was
discussed, which was before this "let's honor binary attribute".
Continue reading on narkive:
Loading...