GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Package: grep;

Reported by: Vincent Lefevre <vincent <at> vinc17.net>

Date: Fri, 12 Sep 2014 01:26:02 UTC

Severity: normal

Done: Paul Eggert <eggert <at> cs.ucla.edu>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 18454 in the body.
You can then email your comments to 18454 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 12 Sep 2014 01:26:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Vincent Lefevre <vincent <at> vinc17.net>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Fri, 12 Sep 2014 01:26:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: bug-grep <at> gnu.org
Subject: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Fri, 12 Sep 2014 03:24:49 +0200

With the patch that fixes bug 18266, grep -P works again on binary
files (with invalid UTF-8 sequences), but it is now significantly
slower than old versions (which could yield undefined behavior).

Timings with the Debian packages on my personal svn working copy
(binary + text files):

2.18-2   0.9s with -P, 0.4s without -P
2.20-3  11.6s with -P, 0.4s without -P

On this example, that's a 13x slowdown! Though the performance issue
would better be fixed in libpcre3, I suppose that it is not so simple
and won't occur any time soon. Things could be done in grep:

1. Ignore -P when the pattern would have the same meaning without -P
   (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
   at least for the simplest cases).

2. Call PCRE in the C locale when this is equivalent.

3. Transform invalid bytes to null bytes in-place before the PCRE
   call. This changes the current semantic, but:
   * the semantic on invalid bytes has never been specified, AFAIK;
   * the best *practical* behavior may not be the current one
     (I personally prefer to be able to match invalid bytes, just
     like one can match top-bit-set characters in the C locale, and
     seeing such invalid bytes as equivalent to null bytes would
     not be a problem for most users, IMHO -- things can also be
     configurable).

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 12 Sep 2014 02:54:02 GMT) Full text and rfc822 format available.

Message #8 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 11 Sep 2014 19:53:23 -0700

Vincent Lefevre wrote:
> Things could be done in grep:
>
> 1. Ignore -P when the pattern would have the same meaning without -P
>     (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
>     at least for the simplest cases).
>
> 2. Call PCRE in the C locale when this is equivalent.

I had already considered these ideas along with several others, but they 
would require grep to parse and analyze the Perl regular expression.  I 
don't know the PCRE syntax and it would take some time to write a 
parser.  And even if I wrote one, the next PCRE release would likely 
change the syntax.  It sounds very painful to maintain.

> 3. Transform invalid bytes to null bytes in-place before the PCRE
>     call. This changes the current semantic, but:
>     * the semantic on invalid bytes has never been specified, AFAIK;
>     * the best *practical* behavior may not be the current one

As we've already discussed, this would be incompatible with how invalid 
bytes are treated by other matchers.  And would have undesirable 
practical effects, e.g., the pattern 'a..*b' would match data that would 
look like "ab" on many screens (because the null byte would vanish). 
It's a real kludge that will bite users.

Even if we went along with the kludge, grep does not know what bytes 
PCRE considers to be invalid without invoking PCRE, which is what it's 
doing now.  (Yes, PCRE says it's parsing UTF-8, but there are different 
ways to do that and they don't all agree.)  I suppose grep could 
reengineer libpcre's internals, to exactly duplicate the algorithm that 
libpcre uses to decide when bytes are invalid (except to do it 10X 
faster :-), but then that'd be another thing to maintain in parallel 
with libpcre.

All of these changes sound like a lot of work, which nobody is willing 
to do.

Here's a different idea.  How about invoking grep with the 
--binary-files=without-match option?  This should avoid much of the 
libpcre performance problem, without having to change 'grep'.

Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Fri, 12 Sep 2014 06:30:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 12 Sep 2014 08:21:02 GMT) Full text and rfc822 format available.

Message #13 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 12 Sep 2014 10:20:08 +0200

On 2014-09-11 19:53:23 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >Things could be done in grep:
> >
> >1. Ignore -P when the pattern would have the same meaning without -P
> >    (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
> >    at least for the simplest cases).
> >
> >2. Call PCRE in the C locale when this is equivalent.
> 
> I had already considered these ideas along with several others, but they
> would require grep to parse and analyze the Perl regular expression.  I
> don't know the PCRE syntax and it would take some time to write a parser.
> And even if I wrote one, the next PCRE release would likely change the
> syntax.  It sounds very painful to maintain.

I think that (1) is rather simple, even though optimization could
be missed on some patterns: ERE and PCRE have a large equivalent
subclass. The pattern could be examined left to right and would
consist of:
  - Normal characters.
  - ".", "^" at the beginning, "$" (alone) at the end.
  - [] with normal characters inside.
  - "*", "+", "?", "{...}" form not followed by one of "*+?{".
  - "|" and "(" not followed by one of these 4 characters.
  - "\" followed by one of ".^$[*+?{".
  - Some "\" + letter sequences could be recognised as well.

Something like that (I haven't checked carefully). There could be
another option to allow such an optimization or not.

> >3. Transform invalid bytes to null bytes in-place before the PCRE
> >    call. This changes the current semantic, but:
> >    * the semantic on invalid bytes has never been specified, AFAIK;
> >    * the best *practical* behavior may not be the current one
> 
> As we've already discussed, this would be incompatible with how invalid
> bytes are treated by other matchers.

The same thing could be done with other matchers (in an optional way).

> And would have undesirable practical effects, e.g., the pattern
> 'a..*b' would match data that would look like "ab" on many screens
> (because the null byte would vanish). It's a real kludge that will
> bite users.

But this is already the case:

$ printf "a\0b\n"
ab
$ printf "a\0b\n" | grep 'a..*b'
Binary file (standard input) matches

The transformation won't touch null bytes. It would just interpret
invalid bytes as null bytes, so that they get matched by ".".

> Even if we went along with the kludge, grep does not know what bytes
> PCRE considers to be invalid without invoking PCRE, which is what
> it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are
> different ways to do that and they don't all agree.)

It would be a bug in PCRE. Parsing UTF-8 is standard. This is
sumarized by:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

(from the Linux utf-8(7) man page), everything else being invalid.

Note that the pcre_exec(3) man page even say:

    PCRE_NO_UTF8_CHECK  Do not check the subject for UTF-8 validity

assuming that the check can be done on the user's side, i.e. in a
standard way.

> Here's a different idea. How about invoking grep with the
> --binary-files=without-match option? This should avoid much of the
> libpcre performance problem, without having to change 'grep'.

I often want to take binary files into account, for instance because
executables can contain text I search for (error messages...). There
may be other examples.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 12 Sep 2014 16:49:01 GMT) Full text and rfc822 format available.

Message #16 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 12 Sep 2014 09:48:08 -0700

Vincent Lefevre wrote:

> I think that (1) is rather simple

You may think it simple for the REs you're interested in, but someone 
else might say "hey! that doesn't cover the REs *I'm* interested in!". 
Solving the problem in general is nontrivial.

> But this is already the case:

I was assuming the case where the input data contains an encoding error 
(not a null byte) that is transformed to a null byte before the user 
sees it.

Really, this null-byte-replacement business would be just too weird.  I 
don't see it as a viable general-purpose solution.

> Parsing UTF-8 is standard.

It's a standard that keeps evolving, different releases of libpcre have 
done it differently, and I expect things to continue to evolve.  It's 
not something I would want to maintain separately from libpcre itself.

Have you investigated why libpcre is so *slow* when doing UTF-8 
checking?  Why would libpcre be 10x slower than grep's checking by 
hand?!?  I don't get it.  Surely there's a simple fix on the libpcre side.

> I often want to take binary files into account

In those cases I suggest using a unibyte C locale.  This should solve 
the performance problem.  Really, unibyte is the way to go here; it's 
gonna be faster for large binary scanning no matter what is done about 
this UTF-8 business.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 12 Sep 2014 22:09:01 GMT) Full text and rfc822 format available.

Message #19 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 13 Sep 2014 00:08:55 +0200

On 2014-09-12 09:48:08 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >I think that (1) is rather simple
> 
> You may think it simple for the REs you're interested in, but someone else
> might say "hey! that doesn't cover the REs *I'm* interested in!". Solving
> the problem in general is nontrivial.

This is still better than no optimization at all.

> >But this is already the case:
> 
> I was assuming the case where the input data contains an encoding error (not
> a null byte) that is transformed to a null byte before the user sees it.
> 
> Really, this null-byte-replacement business would be just too weird.  I
> don't see it as a viable general-purpose solution.

Anyway since the problem can exist with null bytes, the problem
needs to be solved for null bytes. But this is also already the
case:

$ printf "a\0b\n" | grep -a 'a..*b'
a^@b

(where the "^@" is in reverse video). So, the only "issue" would
be that

$ printf "a\x91b\n" | grep -a 'a..*b'

would output "a^@b" instead of... possibly something worse. Indeed,
outputting invalid UTF-8 sequences to the terminal is bad. Ideally
you would output "a<91>b" with "<91>" in reverse video. At some
price (this would be slower).

Now, if the behavior is chosen by an option, the user would be aware
of the meaning of the output, so that this won't really matter.

> >Parsing UTF-8 is standard.
> 
> It's a standard that keeps evolving, different releases of libpcre
> have done it differently, and I expect things to continue to evolve.

Could you give some reference? IMHO, this looks more like a bug.

Anyway, UTF-8 sequences that are valid today will still be valid in
the future. The only possible change is that new sequences become
valid in the future. So, the only possible problem is that such new
sequences would be converted to null bytes while this shouldn't be
done. This doesn't introduce undefined behavior, just a different
behavior (note that this difference would also exist between two
libpcre versions, thus not a big problem, and this will be fixable).

> Have you investigated why libpcre is so *slow* when doing UTF-8 checking?

AFAIK, this is not due to libpcre UTF-8 checking, otherwise it would
also be very slow on valid text files too. I suppose that this is due
to the many retries from the pcresearch.c code on binary files (the
line is split into many sublines, many often consisting of a single
byte), i.e. the problem is on the grep side. I don't see how this
could be solved except by doing the UTF-8 check on the grep side.

> >I often want to take binary files into account
> 
> In those cases I suggest using a unibyte C locale.

But I still want "." to match a single (valid) UTF-8 character.
Well, using the C locale on binary files and UTF-8 on text files
might be acceptable. But how can one do that with a recursive
grep?

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 13 Sep 2014 01:00:04 GMT) Full text and rfc822 format available.

Message #22 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 12 Sep 2014 17:59:41 -0700

Vincent Lefevre wrote:

> This is still better than no optimization at all.

We'd have to see; not every optimization is worth the trouble.

> if the behavior is chosen by an option, the user would be aware
> of the meaning of the output, so that this won't really matter.

It'd be better if there wasn't a new grep option simply to avoid a 
libpcre performance bug.

> Could you give some reference?

The pcreunicode man page mentions some of this issue under "Validity of 
UTF-8 string".  My impression is that the actual history of behavior 
changes is more complicated than what that simple summary would suggest.

> This doesn't introduce undefined behavior, just a different
> behavior

Again, it'd be better if grep Just Worked.

> I suppose that this is due
> to the many retries from the pcresearch.c code on binary files (the
> line is split into many sublines, many often consisting of a single
> byte), i.e. the problem is on the grep side.

libpcre is not giving 'grep' an efficient way to search data that can 
contain encoding errors.  This does not mean "the problem is on the grep 
side".

> I don't see how this
> could be solved except by doing the UTF-8 check on the grep side.

There's another way: fix libpcre so that it works on arbitrary binary 
data, without the need for prescreening the data.  That's the 
fundamental problem here.

>>> I often want to take binary files into account
>>
>> In those cases I suggest using a unibyte C locale.
>
> I still want "." to match a single (valid) UTF-8 character.

How about this idea instead?  Use a unibyte C locale, and write a 
unibyte regular expression C that matches a single valid UTF-8 character 
(using whatever definition you like for UTF-8).  Then, you can use . to 
match single bytes and C to match characters.  This gives you all the 
power you need, without the slowdown due to UTF-8 processing, a slowdown 
that will be inevitable no matter how we change grep or libpcre.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Wed, 17 Sep 2014 01:44:02 GMT) Full text and rfc822 format available.

Message #25 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 18454 <at> debbugs.gnu.org, Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Tue, 16 Sep 2014 18:43:26 -0700

[Message part 1 (text/plain, inline)]

I worked on this some more, and came up with the attached patches 
proposed against the current grep Savannah master (commit 
9ea9254ea58456b84ed2f0c1481ca91cdd325bf7).

For years I've been wanting to write that last patch and I finally got 
around to it.  It improves grep -P's performance by a factor of 1.2 
trillion on one (admittedly artificial) benchmark.  I hope its 1 ZB/s 
scan rate is some kind of record.  The last patch probably won't help 
your test cases, though I hope the other patches do help somewhat.

[0001-grep-refactor-binary-vs-unknown-vs-text-flags-for-cl.patch (text/plain, attachment)]

[0002-grep-z-no-longer-considers-200-to-be-binary-data.patch (text/plain, attachment)]

[0003-grep-non-text-bytes-in-binary-data-may-be-treated-as.patch (text/plain, attachment)]

[0004-grep-minor-P-speedup-with-jit_stack.patch (text/plain, attachment)]

[0005-grep-improve-P-performance-in-typical-cases.patch (text/plain, attachment)]

[0006-grep-skip-past-holes-efficiently.patch (text/plain, attachment)]

Added tag(s) patch. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Wed, 17 Sep 2014 01:45:02 GMT) Full text and rfc822 format available.

Severity set to 'normal' from 'wishlist' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Wed, 17 Sep 2014 01:49:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Wed, 17 Sep 2014 04:59:02 GMT) Full text and rfc822 format available.

Message #32 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Tue, 16 Sep 2014 21:57:53 -0700

On Tue, Sep 16, 2014 at 6:43 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> I worked on this some more, and came up with the attached patches proposed
> against the current grep Savannah master (commit
> 9ea9254ea58456b84ed2f0c1481ca91cdd325bf7).
>
> For years I've been wanting to write that last patch and I finally got
> around to it.  It improves grep -P's performance by a factor of 1.2 trillion
> on one (admittedly artificial) benchmark.  I hope its 1 ZB/s scan rate is
> some kind of record.  The last patch probably won't help your test cases,
> though I hope the other patches do help somewhat.

Awesome :-)  I found time to look through all but the 5th.
Slightly surprised that 4/6 makes a measurable performance
difference (didn't check), but moving away from file-scoped
is an improvement in any case.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Wed, 17 Sep 2014 05:19:01 GMT) Full text and rfc822 format available.

Message #35 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Tue, 16 Sep 2014 22:18:29 -0700

Jim Meyering wrote:
> Slightly surprised that 4/6 makes a measurable performance
> difference (didn't check),

It's not measurable.  I should have written it up as a cleanup more than 
as a speedup.

I did like breaking the nominal ZB/s barrier, though, in patch 6/6. 
That's waaaay more than the total throughput of the Internet.  It tops 
even the hypothetical throughput if one recruited the entire US freight 
industry to move nothing but MicroSD cards, which xkcd estimates would 
get only 0.06 ZB/s or so.

http://what-if.xkcd.com/31/

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Wed, 17 Sep 2014 14:19:02 GMT) Full text and rfc822 format available.

Message #38 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Wed, 17 Sep 2014 23:17:50 +0900

Thanks for many improvements.  I applied six patches to grep and tried to
compile it, but after the sixth patch, I recevied 'SEEK_DATA' undeclared
error.  I looked for it on CentOS 5.10, but I couldn't find it in standard
header files (glibc 2.5.1) and gnulib files.

==
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I..  -I../lib -I../lib    -g -O2 -MT
searchutils.o -MD -MP -MF $depbase.Tpo -c -o searchutils.o  searchutils.c && \
mv -f $depbase.Tpo $depbase.Po
grep.c: In function 'fillbuf':
grep.c:718: error: 'SEEK_DATA' undeclared (first use in this function)
grep.c:718: error: (Each undeclared identifier is reported only once
grep.c:718: error: for each function it appears in.)
Makefile:1309: recipe for target 'grep.o' failed
make[2]: *** [grep.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/b/grep-2.20/src'
Makefile:1238: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/b/grep-2.20'
Makefile:1179: recipe for target 'all' failed
make: *** [all] Error 2

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Wed, 17 Sep 2014 14:26:02 GMT) Full text and rfc822 format available.

Message #41 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Wed, 17 Sep 2014 08:25:24 -0600

[Message part 1 (text/plain, inline)]

On 09/17/2014 08:17 AM, Norihiro Tanaka wrote:
> Thanks for many improvements.  I applied six patches to grep and tried to
> compile it, but after the sixth patch, I recevied 'SEEK_DATA' undeclared
> error.  I looked for it on CentOS 5.10, but I couldn't find it in standard
> header files (glibc 2.5.1) and gnulib files.
> 

It should be fairly easy for gnulib to fake SEEK_DATA/SEEK_HOLE (by
treating all files as non-sparse).  I guess we haven't needed to do that
before now, because other GNU clients (such as coreutils and tar) of
this have been doing conditional compilation.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Wed, 17 Sep 2014 19:56:02 GMT) Full text and rfc822 format available.

Message #44 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Wed, 17 Sep 2014 12:55:50 -0700

[Message part 1 (text/plain, inline)]

Thanks for reporting that, I forgot that the code defaulted SEEK_HOLE 
but not SEEK_DATA.  The first attached patch should fix it.  The second 
one should improve performance further on Solaris for files that end in 
holes.

[0001-grep-port-to-platforms-lacking-SEEK_DATA.patch (text/x-patch, attachment)]

[0002-grep-speed-up-processing-of-holes-before-EOF-on-Sola.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Thu, 18 Sep 2014 06:02:01 GMT) Full text and rfc822 format available.

Message #47 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 18454 <at> debbugs.gnu.org
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Wed, 17 Sep 2014 23:00:46 -0700

I've installed all the patches mentioned so far.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Thu, 18 Sep 2014 08:34:02 GMT) Full text and rfc822 format available.

Message #50 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Santiago Ruano Rincón <santiago <at> debian.org>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 18 Sep 2014 10:33:27 +0200

El 17/09/14 a las 23:00, Paul Eggert escribió:
> I've installed all the patches mentioned so far.
> 

I've successfully build the latest commit
(f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any
performance boost. Rather the opposite.

Comparing with debian's grep 2.20-3, that includes your first patch to solve
this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch:

grep -P asdf /usr/bin/*  12,42s user 0,12s system 99% cpu 12,545 total
src/grep -P asdf /usr/bin/*  14,37s user 0,12s system 99% cpu 14,492 total

Note that basic grep also slowdowns:

grep asdf /usr/bin/*  0,22s user 0,16s system 99% cpu 0,382 total
src/grep asdf /usr/bin/*  1,26s user 0,12s system 99% cpu 1,384 total

Cheers, and thanks for your work,

Santiago

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Thu, 18 Sep 2014 19:38:01 GMT) Full text and rfc822 format available.

Message #53 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Santiago Ruano Rincón <santiago <at> debian.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 18 Sep 2014 12:36:57 -0700

On Thu, Sep 18, 2014 at 1:33 AM, Santiago Ruano Rincón
<santiago <at> debian.org> wrote:
> El 17/09/14 a las 23:00, Paul Eggert escribió:
>> I've installed all the patches mentioned so far.
>>
>
> I've successfully build the latest commit
> (f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any
> performance boost. Rather the opposite.
>
> Comparing with debian's grep 2.20-3, that includes your first patch to solve
> this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch:
>
> grep -P asdf /usr/bin/*  12,42s user 0,12s system 99% cpu 12,545 total
> src/grep -P asdf /usr/bin/*  14,37s user 0,12s system 99% cpu 14,492 total
>
> Note that basic grep also slowdowns:
>
> grep asdf /usr/bin/*  0,22s user 0,16s system 99% cpu 0,382 total
> src/grep asdf /usr/bin/*  1,26s user 0,12s system 99% cpu 1,384 total

Thank you for running timing comparisons.

Once I verified that I had no large, sparse files in my grep working directory,
I ran the same test there (du -sh . reports 176M, du --app -sh . reports 139M)

The following shows a performance regression when searching files
like those in my grep working directory.
The new grep (v2.20-46-gf6de00f) takes 2.5x longer than 2.20.14.
This is with a hot cache (best of several runs) on a
Intel(R) Xeon(R) CPU E5-2660, compiled with gcc-5.x

$ diff -u <(env time grep -r asdf . 2>&1) <(PATH=src:$PATH env time
grep -r asdf . 2>&1)
--- /proc/self/fd/11    2014-09-18 12:07:43.169721947 -0700
+++ /proc/self/fd/12    2014-09-18 12:07:43.169721947 -0700
@@ -1,3 +1,3 @@
 ./src/grep.c:               printf 'asdfqwerzxcv\rASDF\tZXCV\n'
 -0.08user 0.10system 0:00.18elapsed 100%CPU (0avgtext+0avgdata
6256maxresident)k
 -0inputs+0outputs (0major+670minor)pagefaults 0swaps
 +0.40user 0.11system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 5328maxresident)k
 +0inputs+0outputs (0major+634minor)pagefaults 0swaps

It looks like most of the difference is the result of
commit cd36abd46c5e0768606979ea75a51732062f5624,
"grep: treat a file as binary if its prefix contains encoding errors",
with its new,
locale-sensitive "is_binary" test. I saw the above timing difference
even with LC_ALL=C, so one quick fix would be to skip the use of
mbrlen when possible.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 19 Sep 2014 16:08:02 GMT) Full text and rfc822 format available.

Message #56 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Santiago Ruano Rincón <santiago <at> debian.org>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 19 Sep 2014 09:06:47 -0700

On Thu, Sep 18, 2014 at 12:36 PM, Jim Meyering <jim <at> meyering.net> wrote:
> It looks like most of the difference is the result of
> commit cd36abd46c5e0768606979ea75a51732062f5624,
> "grep: treat a file as binary if its prefix contains encoding errors",

Hi Paul,

I found that the above commit induces a large performance hit.
Over 50x in this example:

  seq 99999999 > k
  LC_ALL=C diff -u \
    <(PATH=.bin/2.20-31:$PATH env time -f %e grep asdf k 2>&1) \
    <(PATH=.bin/2.20-32:$PATH env time -f %e grep asdf k 2>&1)
  ...
  -0.21
  +11.47

The problem is that the new function is processing all of
the input, not just a prefix.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sun, 21 Sep 2014 21:24:01 GMT) Full text and rfc822 format available.

Message #59 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: bug-grep <at> gnu.org
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sun, 21 Sep 2014 08:46:39 +0200 (CEST)

Hi,

I am the developer of the JIT compiler in PCRE. I am frequently checking the discussions about PCRE and found this comment here on bug-grep <at> gnu.org:

> There's another way: fix libpcre so that it works on arbitrary binary data, without the need for prescreening
> the data. That's the fundamental problem here.

This requires too much effort with no benefit. Reasons:

- what should you do if you encounter an invalid UTF-8 opcode: ignore it? decode it to some random value? For example, what should happen if you find a stray 0xe9? Does it match \xe9? Everybody has different opinion about handling invalid UTF opcodes, and this would lead to never ending arguing on pcre-dev.

- the bigger problem is performance. Handling invalid UTF codes require a lot of extra checks and kills many optimizations. For example, when we encounter a 0xc5, we know that the input buffer has at least one more byte. We did not check the input buffer size. We also assume that the highest 2 bits are 10 for the second byte, and did not check this when we decode that character. This would also kill other optimizations like boyer-moore like search in JIT. The major problem is, everybody would suffer this performance regression, including those, who pass valid UTF strings.

Therefore such change will never happen due to these reasons.

But there are alternatives.

* The best solution is multi-threaded grepping: one thread reads file data, and replace/remove invalid UTF8 opcodes to something valid. The other thread runs PCRE on the filtered thread. Alternatively, you can convert everything to UTF32, and use pcre32.

* The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed, i.e. if you search /ab/, and the invalid UTF sequence contains ab, this might not be found (or might be found with interpreter, but not with JIT or vice versa). If you use pcre32, there is no need for any extra byte extension.

Regards,
Zoltan

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 00:24:01 GMT) Full text and rfc822 format available.

Message #62 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, Santiago Ruano Rincón <santiago <at> debian.org>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 25 Sep 2014 17:23:02 -0700

[Message part 1 (text/plain, inline)]

Thanks for looking into that.  The attached patches solve those 
performance problems for me.

[0001-grep-scan-for-valid-multibyte-strings-more-quickly.patch (text/plain, attachment)]

[0002-grep-don-t-check-extensively-for-invalid-prefix-byte.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 01:20:02 GMT) Full text and rfc822 format available.

Message #65 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>, 
 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 25 Sep 2014 18:19:20 -0700

Zoltán, thanks for your comments on this subject.  Some thoughts and 
suggestions:

> - what should you do if you encounter an invalid UTF-8 opcode

Do whatever plain 'grep' does, which is what the glibc regular 
expression matcher does.  If I recall correctly, an encoding error in 
the pattern matches the same encoding error in the string.  It shouldn't 
be that complicated.

> Everybody has different opinion about handling invalid UTF opcodes

I doubt whether users would care all that much, so long as the default 
is reasonable.  We don't get complaints about it with 'grep', anyway. 
But if it's a real problem in the PCRE world, you could provide 
compile-time or run-time options to satisfy the different opinions.

> everybody would suffer this performance regression, including those, who pass valid UTF strings.

I don't see why.  libpcre can continue with its current implementation, 
for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
that's not a problem.  The problem is the case where users pass 
possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
a slow implementation for this case, and this slow implementation's 
performance should be improvable without affecting the performance for 
the PCRE_NO_UTF8_CHECK case.

> * The best solution is multi-threaded grepping

That would chew up CPU resources unnecessarily, by requiring two passes 
over the input, one for checking UTF-8, the other for doing the actual 
match.  Granted, it might be faster in real-time than what we have now, 
but overall it'd probably be more expensive (e.g., more energy 
consumption) than what we have now, and this doesn't sound promising.

> * The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed

That doesn't sound like a win, I'm afraid.  The use case that prompted 
this bug report is someone using 'grep -r' to search for strings like 
'foobar' in binary data, and this use case would not work with this 
suggested solution.


I'm hoping that the recent set of changes to 'grep' lessens the urgency 
of improving libpcre.  On my platform (Fedora 20 x86-64) Jim Meyering's 
benchmark <http://bugs.gnu.org/18454#56> says that with grep 2.18, grep 
-P is 6.4x slower than plain grep, and that with the latest experimental 
grep (including the patches I just posted in 
<http://bugs.gnu.org/18454#62>), grep -P is 5.6x slower than plain grep. 
 So it's plausible that the latest set of fixes is good enough, in the 
sense that, sure, PCRE is slower, but it's always been slower and if 
that used to be good enough then it should still be good enough.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 06:38:03 GMT) Full text and rfc822 format available.

Message #68 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 26 Sep 2014 08:36:40 +0200 (CEST)

Hi Paul,

thank you for the feedback.

>I doubt whether users would care all that much, so long as the default 
>is reasonable.  We don't get complaints about it with 'grep', anyway. 
>But if it's a real problem in the PCRE world, you could provide 
>compile-time or run-time options to satisfy the different opinions.

The situation is worse :( Reasonable has a different meaning for everybody.

Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint:

/(?<=\x9c)#/

Does it match \xd5\x9c# starting from #? Noticing errors during a backward scan is complicated.

/[\x9c-\x{ffff}]/

What does this range defines exactly? What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?

Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence? In general, UTF defines several character properties. What unicode properties does an invalid codepoint have?

Believe me, depending on their needs, everybody has different answers to these questions. We don't want to force the view of one group to others, since PCRE is a library. You have much more freedom to define any behavior, since grep is an end-user program.

>I don't see why.  libpcre can continue with its current implementation, 
>for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
>that's not a problem.  The problem is the case where users pass 
>possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
>a slow implementation for this case, and this slow implementation's 
>performance should be improvable without affecting the performance for 
>the PCRE_NO_UTF8_CHECK case.

Regarding performance, this comes from the interpreter:

#define GETUTF8(c, eptr) \
    { \
    if ((c & 0x20) == 0) \
      c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
    else if ((c & 0x10) == 0) \
      c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
    else if ((c & 0x08) == 0) \
      c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
      ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
    else if ((c & 0x04) == 0) \
      c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
          ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
          (eptr[4] & 0x3f); \
    else \
      c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
          ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
          ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
    }

Imagine if you would need to add buffer end and other bit checks. Furthermore unicode expects that any character should be encoded with the least amount of bytes. More checks. You also need to check the current mode. Of course we have several macros similar like this (due to performance reasons), and there are code paths where we have assumptions about valid UTF strings. This would increase complexity a lot, we would need a lot of extra regression tests, we need a correct JIT implementation, and so on. This would also kill optimizations. For example, if you define a character range, where all characters are two byte long, JIT cleverly detect this, and use a fast case to discard any non-two byte UTF codepoints.

The question is, who would be willing to do this work.

>That would chew up CPU resources unnecessarily, by requiring two passes 
>over the input, one for checking UTF-8, the other for doing the actual 
>match.  Granted, it might be faster in real-time than what we have now, 
>but overall it'd probably be more expensive (e.g., more energy 
>consumption) than what we have now, and this doesn't sound promising.

Yeah but you could add a flag to enable this :) I feel this would be much less work than the former.

>That doesn't sound like a win, I'm afraid.  The use case that prompted 
>this bug report is someone using 'grep -r' to search for strings like 
>'foobar' in binary data, and this use case would not work with this 
>suggested solution.

In this case, I would simply disable UTF-8 decoding. You could search UTF character codes encoded in ascii (i.e. searching \xe9 as \xc3\xa9)

Regards,
Zoltan

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 06:41:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 08:49:02 GMT) Full text and rfc822 format available.

Message #74 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 26 Sep 2014 01:48:00 -0700

Zoltán Herczeg wrote:

> Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint:
>
> /(?<=\x9c)#/
>
> Does it match \xd5\x9c# starting from #?

No, because the input does not contain a \x9c encoding error.  Encoding errors 
match only themselves, not parts of other characters.  That is how the glibc 
matchers behave, and it's what users expect.

> Noticing errors during a backward scan is complicated.

It's doable, and it's the right thing to do.

> /[\x9c-\x{ffff}]/
>
> What does this range defines exactly?

Range expressions have implementation-defined semantics in POSIX.  For PCRE you 
can do what you like.  I suggest mapping encoding-error bytes into characters 
outside the Unicode range; that's what Emacs does, I think, and it's simple and 
easy to explain to users.  It's not a big deal either way.

> What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?

Just treat encoding-error bytes like everything else.  In effect, extend the 
encoding to allow any byte sequence, and add a few "characters" outside the 
Unicode range, one for each invalid UTF-8 byte.

> Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence?

Sorry, I don't quite follow, but encoding errors aren't letters and don't have 
case.  They match only themselves.

> What unicode properties does an invalid codepoint have?

The minimal ones.

> depending on their needs, everybody has different answers to these questions.

That's fine.  Just implement reasonable defaults, and provide options if people 
have needs that differ from the defaults.  That's easier for libpcre than for 
grep, since libpcre users (who are programmers) can reasonably be expected to be 
more sophisticated about this sort of thing than grep users (who are not 
necessarily programmers).

> Imagine if you would need to add buffer end and other bit checks.

Of course it will be more expensive to check for UTF-8 as you go, than to assume 
the input is valid UTF-8.  But again, we're not talking about the 
PCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talking 
about the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether the 
input is valid UTF-8, and currently does so inefficiently.  In the 
non-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you go, 
than to have a prepass that checks for UTF-8.  This is because the prepass must 
be stupid (it must check the entire input buffer) whereas the matcher can be 
smart (it often can do its work without checking the entire input buffer).  This 
is one reason libpcre is slower than the glibc matchers.

Obviously it would be some work to build a libpcre that runs faster in the 
non-PCRE_NO_UTF8_CHECK case, without hurting performance in the 
PCRE_NO_UTF8_CHECK case.  But it could be done, if someone had the time to do it.

> The question is, who would be willing to do this work.

Not me.  :-)

>> That would chew up CPU resources unnecessarily

> Yeah but you could add a flag to enable this :)

I'm not sure it'd be popular to add a --drain-battery option to grep. :)

>> The use case that prompted
>> this bug report is someone using 'grep -r' to search for strings like
>> 'foobar' in binary data, and this use case would not work with this
>> suggested solution.
>
> In this case, I would simply disable UTF-8 decoding.

I suggested that already, but the user (e.g., see the last paragraph of 
<http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated 
UTF-8 patterns in binary data.  For example, I expect the user wants the pattern 
'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file.  So he can't 
simply use unibyte processing.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 08:49:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 18:05:02 GMT) Full text and rfc822 format available.

Message #80 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 26 Sep 2014 20:04:33 +0200 (CEST)

Hi,

this is a very interesting discussion.

>> /(?<=\x9c)#/
>>
>> Does it match \xd5\x9c# starting from #?
>
>No, because the input does not contain a \x9c encoding error.  Encoding errors 
>match only themselves, not parts of other characters.  That is how the glibc 
>matchers behave, and it's what users expect.

Why \xc9 is part of another character? It depends how you interpret \xd5. And this was just a simple example.

>> Noticing errors during a backward scan is complicated.
>
>It's doable, and it's the right thing to do.

The problem is, you do it some way, and others need something else. Just think about the example above.

>Range expressions have implementation-defined semantics in POSIX.  For PCRE you 
>can do what you like.  I suggest mapping encoding-error bytes into characters 
>outside the Unicode range; that's what Emacs does, I think, and it's simple and 
>easy to explain to users.  It's not a big deal either way.

This mapping idea is clever. Basically invalid codepoints are converted to something valid.

>> What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?
>
>Just treat encoding-error bytes like everything else.  In effect, extend the 
>encoding to allow any byte sequence, and add a few "characters" outside the 
>Unicode range, one for each invalid UTF-8 byte.

In other words, \xc9 really is an encoding error (since it is an invalid UTF-8 byte, following another invalid UTF-8 byte). This is what I said from the beginning, depending on the context, people choose different interpretations of handling UTF fragments. Usually they choose what is more convenient from that viewpoint. But if you put all pieces together, the result is full of contradictions.

>Sorry, I don't quite follow, but encoding errors aren't letters and don't have 
>case.  They match only themselves.

Not necessarily. It depends on your mapping: if more than one invalid UTF fragment is mapped to the same codepoint, they will match. Especially when you define range of characters.

> > What unicode properties does an invalid codepoint have?
>
>The minimal ones.

We could use the same flags as for characters between \x{d800}–\x{dfff}

>> The question is, who would be willing to do this work.
>
>Not me.  :-)

I know this would be a lot of work. And I have doubts that slowing down PCRE would increase grep performance. Regardless, if somebody is willing to work on this, I can help. Please keep in mind that PCRE1 is considered done, and our efforts are limited to bugfixing. We are currently busy with PCRE2, and such a big change could only go there.

>I'm not sure it'd be popular to add a --drain-battery option to grep. :)

I don't think on performance hungry desktop or server environments this really matters. On phone, you likely don't need this feature.

>I suggested that already, but the user (e.g., see the last paragraph of 
><http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated 
>UTF-8 patterns in binary data.  For example, I expect the user wants the pattern 
>'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file.  So he can't 
>simply use unibyte processing.

This is exactly the use case where filtering is needed. His input is a combination of binary and UTF data, and he needs matches only in the UTF part.

Regards,
Zoltan

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 18:06:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 19:21:01 GMT) Full text and rfc822 format available.

Message #86 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 26 Sep 2014 12:20:45 -0700

On 09/26/2014 11:04 AM, Zoltán Herczeg wrote:
> this is a very interesting discussion.

Yes, I have a lot of other things I'm *supposed* to be doing, but this 
thread is more fun....

>>> /(?<=\x9c)#/
>>>
>>> Does it match \xd5\x9c# starting from #?
>> No, because the input does not contain a \x9c encoding error.  Encoding errors
>> match only themselves, not parts of other characters.  That is how the glibc
>> matchers behave, and it's what users expect.
> Why \xc9 is part of another character? It depends how you interpret \xd5.

Sorry, I assume you meant \x9c here?  Anyway, the point is that 
conceptually you walk through the input byte sequence left-to-right, 
converting it to characters as you go, and if you encounter an encoding 
error in the process you convert the error to the corresponding 
"character" outside the Unicode range.  You then do all matching against 
the converted sequence.  So there is no question about interpretation: 
it's the left-to-right interpretation.  This simple and easy-to-explain 
approach is used by grep's other matchers, by Emacs, etc.

Obviously you don't want to *implement* it the way I described; instead, 
you want to convert on-the-fly, lazily.  But whatever optimizations you 
do, you do consistently with the conceptual model.

> The problem is, you do it some way, and others need something else.

In practice, the simple approach explained above works well enough to 
satisfy the vast majority of users.  It's conceivable some special cases 
in the PCRE world would have trouble fitting into this model, but to be 
honest I expect this won't be a problem, and that there won't be any 
serious conceptual issues here, though admittedly there will be some 
nontrivial programming effort.
.
> I have doubts that slowing down PCRE would increase grep performance.

Again, the proposed change should not slow down libpcre.  It should 
speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre 
could use exactly the same code it has now, so performance would be 
unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should 
typically be faster than it is now, because it would avoid unnecessary 
UTF-8 validation for the parts of the input string that it does not examine.

> This is exactly the use case where filtering is needed. His input is a 
> combination of binary and UTF data, and he needs matches only in the 
> UTF part. Regards, Zoltan 

Filtering would not be needed if libpcre were like grep's other matchers 
and simply worked with arbitrary binary data.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 26 Sep 2014 19:25:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 27 Sep 2014 17:53:01 GMT) Full text and rfc822 format available.

Message #92 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Santiago Ruano Rincón <santiago <at> debian.org>,
 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 27 Sep 2014 10:52:15 -0700

[Message part 1 (text/plain, inline)]

On Thu, Sep 25, 2014 at 5:23 PM, Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Thanks for looking into that.  The attached patches solve those performance
> problems for me.

I've pushed this follow-up patch to suppress a new warning:

[0001-maint-suppress-a-false-positive-Wcast-align-warning.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 27 Sep 2014 18:17:02 GMT) Full text and rfc822 format available.

Message #95 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 27 Sep 2014 20:16:45 +0200 (CEST)

Hi,

>Sorry, I assume you meant \x9c here?  Anyway, the point is that 
>conceptually you walk through the input byte sequence left-to-right, 
>converting it to characters as you go, and if you encounter an encoding 
>error in the process you convert the error to the corresponding 
>"character" outside the Unicode range.  You then do all matching against 
>the converted sequence.  So there is no question about interpretation: 
>it's the left-to-right interpretation.  This simple and easy-to-explain 
>approach is used by grep's other matchers, by Emacs, etc.

This was one of my proposal, we need a converter before we run PCRE. To be more precise, we likely need several converters, and users can select the appropriate for their use case.

>Obviously you don't want to *implement* it the way I described; instead, 
>you want to convert on-the-fly, lazily.  But whatever optimizations you 
>do, you do consistently with the conceptual model.

I would implement exactly as you described. PCRE is a complex backtracking engine, we need to decode the input forward and backward directions from any starting position, and several characters parsed multiple times depending on the pattern and input. We also employ many optimizations to make this as fast as possible, especially in JIT. The overhead of decoding invalid characters accumulates for every character regardless they are valid or not.

>In practice, the simple approach explained above works well enough to 
>satisfy the vast majority of users.  It's conceivable some special cases 
>in the PCRE world would have trouble fitting into this model, but to be 
>honest I expect this won't be a problem, and that there won't be any 
>serious conceptual issues here, though admittedly there will be some 
>nontrivial programming effort.

The approach might sound simple, but its side effects are non-trivial. For example, if we would implement the way suggested before, the guy, who you quoted, would not be satisfied. He said 'I still want "." to match a single (valid) UTF-8 character.' The dot character matches anything but newline. According to you, the invalid code points should have a "minimal" type, so they would match.

In the regex world, matching performance is the key aspect of an engine, and this is our focus in PCRE. But every advantage has a trade-of. In PCRE, this is source code complexity. A "simple" change like this would require a major redesign of the engine.

>Again, the proposed change should not slow down libpcre.  It should 
>speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre 
>could use exactly the same code it has now, so performance would be 
>unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should 
>typically be faster than it is now, because it would avoid unnecessary 
>UTF-8 validation for the parts of the input string that it does not examine.

Partial matching was invented exactly for this purpose. You can divide the input into small chunks, filter them, and perform matches. Btw, how partial matching is affected by your proposed solution? What should happen, if the starting offset is inside an otherwise valid UTF character?

>Filtering would not be needed if libpcre were like grep's other matchers 
>and simply worked with arbitrary binary data.

This might be efficient for engines which scans the input only forward direction and read every character once. This is not true for PCRE.

Regards,
Zoltan

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 27 Sep 2014 18:25:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 27 Sep 2014 20:55:02 GMT) Full text and rfc822 format available.

Message #101 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 27 Sep 2014 13:54:24 -0700

Zoltán Herczeg wrote:
> He said 'I still want "." to match a single (valid) UTF-8 character.'

That's what the GNU matchers do, yes.  '.' does not match an invalid byte.  It's 
a reasonable default.  If you have some users who want '.' to match an invalid 
byte, you can add a flag for them, just as there's a PCRE_DOTALL flag for users 
who want '.' to match newline.  That being said, I doubt whether users will care 
enough to need such a flag.  (After all, they're evidently not caring *now*, as 
libpcre can't search such data at *all*.)

> In the regex world, matching performance is the key aspect of an engine

Absolutely.  That's why we're having this discussion: libpcre is slow when 
matching binary data.

> A "simple" change like this would require a major redesign of the engine.

It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm volunteering....)

> What should happen, if the starting offset is inside an otherwise valid UTF character?

The same thing that would happen if an input file started with the tail end of a 
UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this already; 
it's not a problem.

>> Filtering would not be needed if libpcre were like grep's other matchers
>> and simply worked with arbitrary binary data.
>
> This might be efficient for engines which scans the input only forward direction
> and read every character once.

It can also be efficient for matchers, like grep's, that don't necessarily do 
that.  It just takes more implementation work, that's all.  It's not rocket 
science to go backwards through a UTF-8 string and to catch decoding errors as 
you go.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 27 Sep 2014 20:56:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 27 Sep 2014 22:37:01 GMT) Full text and rfc822 format available.

Message #107 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>
Cc: Santiago Ruano Rincón <santiago <at> debian.org>,
 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 27 Sep 2014 15:36:37 -0700

[Message part 1 (text/plain, inline)]

Jim Meyering wrote:

> I've pushed this follow-up patch to suppress a new warning:

Thanks, I expect I didn't get that warning because I built on x86-64, which 
allows unaligned accesses so GCC doesn't complain.  (Incidentally, I had already 
tried modifying the code to exploit the fact that unaligned accesses are OK on 
x86ish platforms, but that made the word-by-word loop go slower, so no dice.)

Too bad GCC isn't smart enough to notice that the pointer must be aligned.  It 
strikes me that this problem must come up elsewhere, and that it's worth writing 
a macro to encapsulate the situation.  I pushed the attached follow-up patch, 
which is an attempt to move in that direction.

[0001-maint-generalize-the-Wcast-align-fix.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sun, 28 Sep 2014 10:12:01 GMT) Full text and rfc822 format available.

Message #110 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sun, 28 Sep 2014 12:11:16 +0200 (CEST)

>> In the regex world, matching performance is the key aspect of an engine
>
>Absolutely.  That's why we're having this discussion: libpcre is slow when 
>matching binary data.

For me the question is whether binary search needs to supported on PCRE level. There are other questions like this. People ask string replacement support in PCRE from time to time. It can be implemented of course, we could add a whole string handling API to PCRE. But we feel this is outside the scope of PCRE. Any string management library can use PCRE for string replacement, we don't need another one.

Binary matching is similar thing. PCRE is used by some (I think closed source) projects for network data filtering, which is obviously binary data. They use some kind of pre-filtering, data arranging and partial matching to efficiently check TCP stream data (without waiting for the whole stream to arrive).

>> A "simple" change like this would require a major redesign of the engine.
>
>It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm volunteering....)

Anything can be done, which as an algorithmic solution, this was never a question. The question is whether it is worth to do it on PCRE or higher level. Perl/PCRE is all about text processing, characters which has meanings, types, sub-types, other case(s). Unicode is also about defining characters, not binary data. UTF is an encoding format, not mapping random bytes to characters.

If this task would be trivial, I wouldn't mind doing it myself. But it seems this task is about destroying what we built so far. A lot of extra checks to process invalid bytes, a large code size increase, and removing a lot of optimizations. The result might be much slower than using clever pre-filtering.

>> What should happen, if the starting offset is inside an otherwise valid UTF character?
>
>The same thing that would happen if an input file started with the tail end of a 
>UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this already; 
>it's not a problem.

The question was about intermediate start offsets. You have a 100 byte long buffer, and you start matching from byte 50. That is part of a valid UTF byte. Your pattern starts with an invalid character, which matches to that UTF fragment. You said invalid UTF character matches only themselves, not part of other characters. A lot of extra check again, preparing for the worst case.

>> This might be efficient for engines which scans the input only forward direction
> > and read every character once.
>
>It can also be efficient for matchers, like grep's, that don't necessarily do 
>that.  It just takes more implementation work, that's all.  It's not rocket 
>science to go backwards through a UTF-8 string and to catch decoding errors as 
>you go.

My problem is the lot of "ifs", you need to execute. Lets compare the current and the proposed solution.

char* c_ptr /* Current string position. */

Current:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  while (*cptr & 0xc0 == 0x80)
   cptr--;

Proposed solution:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  char* saved_c_ptr = c_ptr; /* We need to save the starting position, loosing a CPU register for that. */
  while (*cptr & 0xc0 == 0x80) {
    if (c_ptr == input_start)
      return FAIL;
    cptr--;
  }

  /* We moved back a lot, we don't know where are we. Check character length. */
  int length = utf_length[*cptr]; /* Another lost register. Compiler life is difficult. */
  if (cptr + length != saved_c_ptr + 1) 
    c_ptr = saved_c_ptr;
  else {
    /* We need to check whether the character is encoded in the minimum number of bytes. */
    if (length == 1) {
      /* Great, nothing to do. */
    }
    else if (length == 2) {
      if (*c_ptr < 0xc2) /* Character is <= 127, can be encoded in a single byte. */
        c_ptr = saved_c_ptr;
    } else if (length == 3) {
      if (*c_ptr == 0xe0 && cptr[1] < 0xa0) /* Character is <= 0x800, can be encoded in less bytes. */
        c_ptr = saved_c_ptr;
    } else
      ....
  }

For me this is way too much checks, and affects compiler optimizations too much.

Regards,
Zoltan

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sun, 28 Sep 2014 10:12:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sun, 28 Sep 2014 15:10:03 GMT) Full text and rfc822 format available.

Message #116 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sun, 28 Sep 2014 08:09:33 -0700

Zoltán Herczeg wrote:

> For me the question is whether binary search needs to supported on PCRE level.

It's purely a performance question.  GNU grep already uses libpcre to search 
binary data, and it works now.  It's just slow, that all.  I'm willing to live 
with this, and tell users "Sorry, but libpcre is not designed to search binary 
data quickly; if you want speed then don't use grep's -P option."  If you're 
willing to live with this too, we're done.

> removing a lot of optimizations.

You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK case. 
 Keep them all.  It should be just as fast before.  The idea is to have one 
matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and another 
matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as it 
goes).  The former matcher will be just as fast as now, and the latter matcher 
will be faster than what libpcre has now.  I readily concede that this will 
require some nontrivial coding, but I don't concede that it will remove 
optimizations or make libpcre slower.  It should make libpcre faster; that's the 
point.

> You have a 100 byte long buffer, and you start matching from byte 50.

Grep already does that sort of thing.  And it's smart enough to start matching 
only at character boundaries.  It's not libpcre's job to worry about this; the 
caller can worry about it.

> For me this is way too much checks, and affects compiler optimizations too much.

The code you posted could be made faster than that; among other things there 
should not be an unbounded backward scan.  And even the code you posted would 
often be faster than what's in libpcre now.  That early UTF-8 validity prepass 
is a killer.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sun, 28 Sep 2014 15:11:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sun, 28 Sep 2014 22:08:01 GMT) Full text and rfc822 format available.

Message #122 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Santiago Ruano Rincón <santiago <at> debian.org>,
 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sun, 28 Sep 2014 15:06:36 -0700

Nice! I didn't know about _Pragma. It's much better to encapsulate
that, keeping the #pragma directives out of function bodies.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Tue, 30 Sep 2014 18:12:02 GMT) Full text and rfc822 format available.

Message #125 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Zoltán Herczeg <hzmester <at> freemail.hu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: bug-grep <at> gnu.org, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Tue, 30 Sep 2014 20:10:58 +0200 (CEST)

Hi,

>It's purely a performance question.  GNU grep already uses libpcre to search 
>binary data, and it works now.  It's just slow, that all.  I'm willing to live 
>with this, and tell users "Sorry, but libpcre is not designed to search binary 
>data quickly; if you want speed then don't use grep's -P option."  If you're 
>willing to live with this too, we're done.

Yes, PCRE is not designed for matching binary data as UTF. Too much complexity for too little gain. Normal search can be used on binary data without limitations.

>Grep already does that sort of thing.  And it's smart enough to start matching 
>only at character boundaries.  It's not libpcre's job to worry about this; the 
>caller can worry about it.

Thank you for bringing this up. I don't see any point of reimplementing what is already there. However, if PCRE says it supports UTF matching in binary data, it should. Because the "what is there" depends on the environment. This clearly the best answer why the environment is responsible for handling the binary part of the data. Most environment needs some kind of validating, and we would just duplicate code. It is good to hear that everything is in grep, perhaps a few more lines are needed to do it in a thread.

>The code you posted could be made faster than that; among other things there 
>should not be an unbounded backward scan.  And even the code you posted would 
>often be faster than what's in libpcre now.  That early UTF-8 validity prepass 
>is a killer.

I would recommend to disable it. It's only purpose is returning early for invalid buffers. I am sure grep already knows that a buffer is invalid, since it scans the buffer.

Regards,
Zoltan

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Tue, 30 Sep 2014 18:15:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Tue, 30 Sep 2014 19:40:01 GMT) Full text and rfc822 format available.

Message #131 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Zoltán Herczeg <hzmester <at> freemail.hu>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Tue, 30 Sep 2014 12:39:17 -0700

On 09/30/2014 11:10 AM, Zoltán Herczeg wrote:
>
>> Grep already does that sort of thing.  And it's smart enough to start matching
>> only at character boundaries.  It's not libpcre's job to worry about this; the
>> caller can worry about it.
> Thank you for bringing this up. I don't see any point of reimplementing what is already there.

Sorry, it sounds like my earlier comment was unclear.  GNU grep is smart 
enough to start matching at character boundaries without checking the 
validity of the input data.  This helps it run faster.  However, because 
libpcre requires a validity prepass, grep -P must slow down and do the 
validity check one way or another.  Grep does this only when libpcre is 
used, and that's one reason grep -P is slower than plain grep.

It's not a question of duplicating code: grep already has code to 
validate binary data.  It's a question of performance. Requiring a 
prepass for validity checking is typically slower (or takes more energy, 
or whatever) than checking validity on the fly.  And in many cases going 
multithreaded would just make matters worse.

I can understand that you don't want to take on the burden of making a 
nontrivial libpcre performance improvement.  Also, I hope 'grep -P' 
performance, though not great, is good enough now to satisfy most 
users.  So perhaps we should just give the topic a rest.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 28 Nov 2014 03:00:03 GMT) Full text and rfc822 format available.

Message #134 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: 18454 <at> debbugs.gnu.org
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Fri, 28 Nov 2014 03:59:18 +0100

[Message part 1 (text/plain, inline)]

On binary files, it seems that testing the UTF-8 sequences in
pcresearch.c is faster than asking pcre_exec to do that (because
of the retry I assume); see attached patch. It actually checks
UTF-8 only if an invalid sequence was already found by pcre_exec,
assuming that pcre_exec can check the validity of a valid text
file in a faster way.

On some file similar to PDF (test 1):

Before: 1.77s
After:  1.38s

But now, the main problem is the many pcre_exec. Indeed, if I replace
the non-ASCII bytes by \n with:

  LC_ALL=C tr \\200-\\377 \\n

(now, one has a valid file but with many short lines), the grep -P time
is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
with:

  LC_ALL=C tr \\200-\\377 \\000

the grep -P time is 0.30s (test 3), thus it is much faster.

Note also that libpcre is much slower than normal grep on simple words,
but on "a[0-9]b", it can be faster:

          grep      PCRE   PCRE+patch
test 1    4.31      1.90      1.53
test 2    0.18      1.61      1.63
test 3    3.28      0.39      0.39

With grep, I wonder why test 2 is much faster.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

[grep221-pcresearch.patch (text/plain, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 28 Nov 2014 14:33:02 GMT) Full text and rfc822 format available.

Message #137 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 28 Nov 2014 23:31:49 +0900

On Fri, 28 Nov 2014 03:59:18 +0100
Vincent Lefevre <vincent <at> vinc17.net> wrote:

> On binary files, it seems that testing the UTF-8 sequences in
> pcresearch.c is faster than asking pcre_exec to do that (because
> of the retry I assume); see attached patch. It actually checks
> UTF-8 only if an invalid sequence was already found by pcre_exec,
> assuming that pcre_exec can check the validity of a valid text
> file in a faster way.
> 
> On some file similar to PDF (test 1):
> 
> Before: 1.77s
> After:  1.38s
> 
> But now, the main problem is the many pcre_exec. Indeed, if I replace
> the non-ASCII bytes by \n with:
> 
>   LC_ALL=C tr \\200-\\377 \\n
> 
> (now, one has a valid file but with many short lines), the grep -P time
> is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
> with:
> 
>   LC_ALL=C tr \\200-\\377 \\000
> 
> the grep -P time is 0.30s (test 3), thus it is much faster.
> 
> Note also that libpcre is much slower than normal grep on simple words,
> but on "a[0-9]b", it can be faster:
> 
>           grep      PCRE   PCRE+patch
> test 1    4.31      1.90      1.53
> test 2    0.18      1.61      1.63
> test 3    3.28      0.39      0.39
> 
> With grep, I wonder why test 2 is much faster.
> 
> -- 
> Vincent Lefevre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Thanks for the patch.  However, I seem that valid_utf() in PCRE also
considers 5 and 6 bytes characters in PCRE.

IMHO, We assume that grep doesn't know how to check for an input text in
valid_utf(), althouth we know PCRE checks whether an input text is valid
utf8 or not, so that even when PCRE changes behaviour of valid_utf(),
grep should run.

If we do not check invalid utf8 characters with valid_utf8() in advance,
grep may cause core dump with PCRE_NO_UTF8_CHECK.
See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586

So we can not avoid for checking invalid utf8 characters with valid_utf8().
Further more, we must perform to check as PCRE expects, but grep does
not know how to PCRE to check invalid_utf8 characters due to an above
assumption.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 28 Nov 2014 15:51:01 GMT) Full text and rfc822 format available.

Message #140 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 28 Nov 2014 16:50:29 +0100

On 2014-11-28 23:31:49 +0900, Norihiro Tanaka wrote:
> Thanks for the patch.  However, I seem that valid_utf() in PCRE also
> considers 5 and 6 bytes characters in PCRE.

In any case, even if PCRE considers these sequences as valid UTF-8,
they shouldn't match because they are not part of Unicode (if they
can match, this would be a bug in libpcre). My patch considers that
these sequences do not match, which is consistent with the expected
behavior.

> IMHO, We assume that grep doesn't know how to check for an input text in
> valid_utf(), althouth we know PCRE checks whether an input text is valid
> utf8 or not, so that even when PCRE changes behaviour of valid_utf(),
> grep should run.
> 
> If we do not check invalid utf8 characters with valid_utf8() in advance,
> grep may cause core dump with PCRE_NO_UTF8_CHECK.
> See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586
> 
> So we can not avoid for checking invalid utf8 characters with valid_utf8().
> Further more, we must perform to check as PCRE expects, but grep does
> not know how to PCRE to check invalid_utf8 characters due to an above
> assumption.

What matters is whether a sequence corresponds to a valid UTF-8
encoded Unicode character. My patch ensures that pcre_exec is called
on a string with only such characters, which implies that this is
also valid UTF-8 for PCRE (whether Unicode validity is also considered
in valid_utf8() or not). So, there's no valid reason why grep would
crash under such a condition.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 29 Nov 2014 02:59:02 GMT) Full text and rfc822 format available.

Message #143 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 29 Nov 2014 11:58:48 +0900

On Fri, 28 Nov 2014 16:50:29 +0100
Vincent Lefevre <vincent <at> vinc17.net> wrote:
> What matters is whether a sequence corresponds to a valid UTF-8
> encoded Unicode character. My patch ensures that pcre_exec is called
> on a string with only such characters, which implies that this is
> also valid UTF-8 for PCRE (whether Unicode validity is also considered
> in valid_utf8() or not). So, there's no valid reason why grep would
> crash under such a condition.

It seems that PCRE treats e.g. following character as invalid.  It means
we should not   these characters into pcre_exec with PCRE_NO_UTF8_CHECK
option.

  0xE0 0xC2 0xFF
  0xED 0xA0 0xFF
  0xF0 0xBF 0xFF 0xFF
  0xF4 0xBF 0xBF 0xBF

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Thu, 18 Dec 2014 13:47:01 GMT) Full text and rfc822 format available.

Message #146 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Thu, 18 Dec 2014 14:45:58 +0100

Sorry for the late reply.

On 2014-11-29 11:58:48 +0900, Norihiro Tanaka wrote:
> On Fri, 28 Nov 2014 16:50:29 +0100
> Vincent Lefevre <vincent <at> vinc17.net> wrote:
> > What matters is whether a sequence corresponds to a valid UTF-8
> > encoded Unicode character. My patch ensures that pcre_exec is called
> > on a string with only such characters, which implies that this is
> > also valid UTF-8 for PCRE (whether Unicode validity is also considered
> > in valid_utf8() or not). So, there's no valid reason why grep would
> > crash under such a condition.
> 
> It seems that PCRE treats e.g. following character as invalid.  It means
> we should not   these characters into pcre_exec with PCRE_NO_UTF8_CHECK
> option.
> 
>   0xE0 0xC2 0xFF
>   0xED 0xA0 0xFF
>   0xF0 0xBF 0xFF 0xFF

If I'm not mistaken, these first three are also treated as invalid by
my patch (and should be treated as invalid by any tool).

>   0xF4 0xBF 0xBF 0xBF

(corresponding to U+0013ffff).

Well, I followed some comment in the grep source, which is currently
incorrect.

pcreunicode(3) specifies that it follows RFC 3629, and that only
values in the range U+0 to U+10FFFF, excluding the surrogate area,
are allowed. I'll try to update my patch. But IMHO, it would be
better to get PCRE improved, and I had opened a bug:

  http://bugs.exim.org/show_bug.cgi?id=1554

BTW,

  printf "\xF4\xBF\xBF\xBF\n" | grep .

finds a match, and this appears to be a bug (grep should follow
the current standard).

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Fri, 19 Dec 2014 14:01:02 GMT) Full text and rfc822 format available.

Message #149 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 19 Dec 2014 23:00:38 +0900

On Thu, 18 Dec 2014 14:45:58 +0100
Vincent Lefevre <vincent <at> vinc17.net> wrote:
> > 
> >   0xE0 0xC2 0xFF
> >   0xED 0xA0 0xFF
> >   0xF0 0xBF 0xFF 0xFF
> 
> If I'm not mistaken, these first three are also treated as invalid by
> my patch (and should be treated as invalid by any tool).

I got them from pcre_valid_utf8(), but I made some mistakes.  They are
as following.

  0xE0 0xAF 0xBF
  0xED 0xA0 0xBF
  0xF0 0x8F 0xBF 0xBF

By the way, they are correspond with following codes in pcre_valid_utf8().

    if (c == 0xe0 && (d & 0x20) == 0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR16;
      }
    if (c == 0xed && d >= 0xa0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR14;
      }

    ........

    if (c == 0xf0 && (d & 0x30) == 0)
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR17;
      }
    if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR13;
      }

> BTW,
> 
>   printf "\xF4\xBF\xBF\xBF\n" | grep .
> 
> finds a match, and this appears to be a bug (grep should follow
> the current standard).

I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
for the sequence.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 00:14:02 GMT) Full text and rfc822 format available.

Message #152 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 20 Dec 2014 01:13:39 +0100

On 2014-12-19 23:00:38 +0900, Norihiro Tanaka wrote:
> I got them from pcre_valid_utf8(), but I made some mistakes.  They are
> as following.
> 
>   0xE0 0xAF 0xBF

This one is valid UTF-8 and corresponds to the code point U+0BFF, and
the following matches:

$ printf "\xE0\xAF\xBF\n" | grep -P .
௿

>   0xED 0xA0 0xBF

OK, this is in the surrogate area, and it doesn't match with PCRE.

>   0xF0 0x8F 0xBF 0xBF

This would be U+7FF4FFFF, larger than U+10FFFF.

> > BTW,
> > 
> >   printf "\xF4\xBF\xBF\xBF\n" | grep .
> > 
> > finds a match, and this appears to be a bug (grep should follow
> > the current standard).
> 
> I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
> for the sequence.

Ditto with:

  printf "\xED\xA0\xBF\n" | grep .

(surrogate area).

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 01:24:02 GMT) Full text and rfc822 format available.

Message #155 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: 18454 <at> debbugs.gnu.org
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Date: Sat, 20 Dec 2014 02:23:27 +0100

On 2014-09-12 03:24:49 +0200, Vincent Lefevre wrote:
> Timings with the Debian packages on my personal svn working copy
> (binary + text files):
> 
> 2.18-2   0.9s with -P, 0.4s without -P
> 2.20-3  11.6s with -P, 0.4s without -P

I've done another test on a large PDF file. Let's forget grep 2.18,
which is indeed too buggy (I could reproduce a buffer overflow). But
let's compare with pcregrep, using the "zzz" pattern:

Debian grep 2.20-3      6.64s (with -P)
Upstream grep 2.21      5.39s (with -P)
Debian pcregrep 8.35    0.71s

In all cases, PCRE is used, but pcregrep is much faster than grep -P.

(Note: on this example, "grep" alone is much faster than pcregrep,
but this is not related to the invalid encoding, and depending on
the pattern, either grep or PCRE can be significantly faster.)

So, perhaps that the right method would be to do what pcregrep does,
even though "grep -P" can currently be a bit faster than pcregrep in
some cases.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 01:32:01 GMT) Full text and rfc822 format available.

Message #158 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: 18454 <at> debbugs.gnu.org
Cc: Vincent Lefevre <vincent <at> vinc17.net>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 20 Dec 2014 10:31:46 +0900

On Fri, 19 Dec 2014 23:00:38 +0900
Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
> for the sequence.

$ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G .
Binary file (standard input) matches
$ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P .
$

regex also behaves same as grep -G, e.g. sed only using regex returns the
line.  Therefore, I think that what a character in the surrogate area
matches a period with grep -G is not a bug, although the behavior might
not obey a standard.

$ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p'

By the way, mbrlen() returns (size_t) -1 for the character.

OTOH, if a character in the surrogate area does not match a period in
PCRE, I think that the character should not also match a period grep -P.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 01:36:02 GMT) Full text and rfc822 format available.

Message #161 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: bug-grep <at> gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 19 Dec 2014 17:35:13 -0800

On 12/19/2014 05:23 PM, Vincent Lefevre wrote:
> So, perhaps that the right method would be to do what pcregrep does,

What does pcregrep do?

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 02:14:02 GMT) Full text and rfc822 format available.

Message #164 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Vincent Lefevre <vincent <at> vinc17.net>
To: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 20 Dec 2014 03:13:39 +0100

On 2014-12-20 10:31:46 +0900, Norihiro Tanaka wrote:
> On Fri, 19 Dec 2014 23:00:38 +0900
> Norihiro Tanaka <noritnk <at> kcn.ne.jp> wrote:
> $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G .
> Binary file (standard input) matches
> $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P .
> $
> 
> regex also behaves same as grep -G, e.g. sed only using regex returns the
> line.  Therefore, I think that what a character in the surrogate area
> matches a period with grep -G is not a bug, although the behavior might
> not obey a standard.
> 
> $ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p'
> 
> By the way, mbrlen() returns (size_t) -1 for the character.

IMHO, both grep and sed should be fixed to obey RFC 3629, which
specifies UTF-8. And other tools too (iconv...).

> OTOH, if a character in the surrogate area does not match a period in
> PCRE, I think that the character should not also match a period grep -P.

I agree.

-- 
Vincent Lefèvre <vincent <at> vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 02:32:01 GMT) Full text and rfc822 format available.

Message #167 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Vincent Lefevre <vincent <at> vinc17.net>, Norihiro Tanaka <noritnk <at> kcn.ne.jp>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Fri, 19 Dec 2014 18:31:05 -0800

On 12/19/2014 06:13 PM, Vincent Lefevre wrote:
> both grep and sed should be fixed to obey RFC 3629

Shouldn't this be done in the C library code?  If mbrlen does the right 
thing, grep and sed should do the right thing.

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 02:46:02 GMT) Full text and rfc822 format available.

Message #170 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Vincent Lefevre <vincent <at> vinc17.net>
Cc: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 20 Dec 2014 11:45:15 +0900

On Sat, 20 Dec 2014 02:23:27 +0100
Vincent Lefevre <vincent <at> vinc17.net> wrote:

> Debian grep 2.20-3      6.64s (with -P)
> Upstream grep 2.21      5.39s (with -P)
> Debian pcregrep 8.35    0.71s

Did you use pcregrep --utf-8?  You should use pcregrep --utf-8 pcregrep
to compare.  By the way, pcregrep --utf-8 does not support binary files.
If pcregrep found 20 errors, it will exit without reading an input text
until the last.

$ yes src/grep | head -1000 | xargs cat > big_grep
$ ls -l big_grep
-rw-r--r--. 1 staff users 611453000 Dec 20 11:30 big_grep
$ time -p env LC_ALL=en_US.utf8 src/grep -P test big_grep
real 10.16
user 10.09
sys 0.07
$ time -p pcregrep --buffer-size=65536 test big_grep
real 1.50
user 1.41
sys 0.09
$ time -p pcregrep --buffer-size=65536 --utf-8 test big_grep 2>&1 | tail -1
pcregrep: Too many errors - abandoned.
real 0.00
user 0.00
sys 0.00
$ pcregrep --version
pcregrep version 8.36 2014-09-26

Information forwarded to bug-grep <at> gnu.org:
bug#18454; Package grep. (Sat, 20 Dec 2014 02:58:02 GMT) Full text and rfc822 format available.

Message #173 received at 18454 <at> debbugs.gnu.org (full text, mbox):

From: Norihiro Tanaka <noritnk <at> kcn.ne.jp>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Vincent Lefevre <vincent <at> vinc17.net>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Sat, 20 Dec 2014 11:57:39 +0900

On Fri, 19 Dec 2014 18:31:05 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> If mbrlen does the right thing, grep and sed should do the right thing.

mbrlen() already does the right thing.  So, perhaps, they depend on
behavior of regex.  Even if so, I think that they should also be fixed
in the C library.

cat <<EOF |
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>

int
main ()
{
  setlocale (LC_ALL, "");
  mbstate_t mbs = { 0 };
  char s[] = { 0xED, 0xA0, 0xBF };
  size_t len = mbrlen (s, 3, &mbs);
  printf ("mbrlen = %d\n", len);
  exit (EXIT_SUCCESS);
}
EOF
gcc -xc - && ./a.out

Removed tag(s) patch. Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Mon, 15 Aug 2016 22:54:02 GMT) Full text and rfc822 format available.

Reply sent to Paul Eggert <eggert <at> cs.ucla.edu>:
You have taken responsibility. (Wed, 24 Nov 2021 03:37:04 GMT) Full text and rfc822 format available.

Notification sent to Vincent Lefevre <vincent <at> vinc17.net>:
bug acknowledged by developer. (Wed, 24 Nov 2021 03:37:04 GMT) Full text and rfc822 format available.

Message #180 received at 18454-done <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: 18454-done <at> debbugs.gnu.org
Cc: Santiago Ruano Rincón <santiago <at> debian.org>,
 Norihiro Tanaka <noritnk <at> kcn.ne.jp>, Jim Meyering <jim <at> meyering.net>,
 Vincent Lefevre <vincent <at> vinc17.net>,
 Zoltán Herczeg <hzmester <at> freemail.hu>,
 Eric Blake <eblake <at> redhat.com>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Date: Tue, 23 Nov 2021 19:36:11 -0800

On 9/30/14 12:39, Paul Eggert wrote:

> GNU grep is smart 
> enough to start matching at character boundaries without checking the 
> validity of the input data.  This helps it run faster.  However, because 
> libpcre requires a validity prepass, grep -P must slow down and do the 
> validity check one way or another.  Grep does this only when libpcre is 
> used, and that's one reason grep -P is slower than plain grep.

Now that Grep master on Savannah has been changed to use PCRE2 instead 
of PCRE, the 'grep -P' performance problem seems to have been fixed, in 
that the following commands now take about the same amount of time:

grep -P zzzyyyxxx 10840.pdf
pcre2grep -U zzzyyyxxx 10840.pdf

where the file is from <http://research.nhm.org/pdfs/10840/10840.pdf>. 
Formerly, 'grep -P' was about 10x slower on this test.

My guess is that the grep -P performance boost comes from bleeding-edge 
grep using PCRE2's PCRE2_MATCH_INVALID_UTF option.

I'm closing this old bug report <https://bugs.gnu.org/18454>. We can 
always reopen it if there are still performance issues that I've missed.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 22 Dec 2021 12:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 191 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #18454 Improve performance when -P (PCRE) is used in UTF-8 locales

GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales