GNU bug report logs - #18454
Improve performance when -P (PCRE) is used in UTF-8 locales

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Reported by: Vincent Lefevre <vincent@HIDDEN>; dated Fri, 12 Sep 2014 01:26:02 UTC; Maintainer for grep is bug-grep@HIDDEN.
Removed tag(s) patch. Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:57:42 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 21:57:42 2014
Received: from localhost ([127.0.0.1]:52056 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y2AEg-0005Zd-F4
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:57:42 -0500
Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:51148)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <noritnk@HIDDEN>) id 1Y2AEe-0005ZT-GR
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:57:41 -0500
Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231])
 by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 92BE4C8009
 for <18454 <at> debbugs.gnu.org>; Sat, 20 Dec 2014 11:57:38 +0900 (JST)
Received: from mail06.kcn.ne.jp ([61.86.6.185]) by imp01 with bizsmtp
 id Vexe1p00J3zXHqt01exeBR; Sat, 20 Dec 2014 11:57:38 +0900
X-OrgRCPT: 18454 <at> debbugs.gnu.org
Received: from [10.120.1.71] (i118-21-128-66.s30.a048.ap.plala.or.jp
 [118.21.128.66])
 by mail06.kcn.ne.jp (Postfix) with ESMTPA id 4788D1BF0091;
 Sat, 20 Dec 2014 11:57:38 +0900 (JST)
Date: Sat, 20 Dec 2014 11:57:39 +0900
From: Norihiro Tanaka <noritnk@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
In-Reply-To: <5494DF69.8010509@HIDDEN>
References: <20141220021339.GN32684@HIDDEN>
 <5494DF69.8010509@HIDDEN>
Message-Id: <20141220115738.F2DC.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.65.07 [ja]
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: Vincent Lefevre <vincent@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On Fri, 19 Dec 2014 18:31:05 -0800
Paul Eggert <eggert@HIDDEN> wrote:

> If mbrlen does the right thing, grep and sed should do the right thing.

mbrlen() already does the right thing.  So, perhaps, they depend on
behavior of regex.  Even if so, I think that they should also be fixed
in the C library.

cat <<EOF |
#include <stdio.h>
#include <stdlib.h>
#include <wchar.h>
#include <locale.h>

int
main ()
{
  setlocale (LC_ALL, "");
  mbstate_t mbs = { 0 };
  char s[] = { 0xED, 0xA0, 0xBF };
  size_t len = mbrlen (s, 3, &mbs);
  printf ("mbrlen = %d\n", len);
  exit (EXIT_SUCCESS);
}
EOF
gcc -xc - && ./a.out





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:45:20 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 21:45:20 2014
Received: from localhost ([127.0.0.1]:52048 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y2A2h-0005Ha-Nn
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:45:20 -0500
Received: from mailgw04.kcn.ne.jp ([61.86.7.211]:36098)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <noritnk@HIDDEN>) id 1Y2A2f-0005HQ-Az
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:45:18 -0500
Received: from imp03 (mailgw7.kcn.ne.jp [61.86.15.238])
 by mailgw04.kcn.ne.jp (Postfix) with ESMTP id 624FF6C12C7
 for <18454 <at> debbugs.gnu.org>; Sat, 20 Dec 2014 11:45:15 +0900 (JST)
Received: from mail07.kcn.ne.jp ([61.86.6.186]) by imp03 with bizsmtp
 id VelF1p00B40oyB901elFvo; Sat, 20 Dec 2014 11:45:15 +0900
X-OrgRCPT: 18454 <at> debbugs.gnu.org
Received: from [10.120.1.71] (i118-21-128-66.s30.a048.ap.plala.or.jp
 [118.21.128.66])
 by mail07.kcn.ne.jp (Postfix) with ESMTPA id 0E76DD5009D;
 Sat, 20 Dec 2014 11:45:15 +0900 (JST)
Date: Sat, 20 Dec 2014 11:45:15 +0900
From: Norihiro Tanaka <noritnk@HIDDEN>
To: Vincent Lefevre <vincent@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
In-Reply-To: <20141220012326.GA2678@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <20141220012326.GA2678@HIDDEN>
Message-Id: <20141220114515.F2D4.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.65.07 [ja]
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On Sat, 20 Dec 2014 02:23:27 +0100
Vincent Lefevre <vincent@HIDDEN> wrote:

> Debian grep 2.20-3      6.64s (with -P)
> Upstream grep 2.21      5.39s (with -P)
> Debian pcregrep 8.35    0.71s

Did you use pcregrep --utf-8?  You should use pcregrep --utf-8 pcregrep
to compare.  By the way, pcregrep --utf-8 does not support binary files.
If pcregrep found 20 errors, it will exit without reading an input text
until the last.

$ yes src/grep | head -1000 | xargs cat > big_grep
$ ls -l big_grep
-rw-r--r--. 1 staff users 611453000 Dec 20 11:30 big_grep
$ time -p env LC_ALL=en_US.utf8 src/grep -P test big_grep
real 10.16
user 10.09
sys 0.07
$ time -p pcregrep --buffer-size=65536 test big_grep
real 1.50
user 1.41
sys 0.09
$ time -p pcregrep --buffer-size=65536 --utf-8 test big_grep 2>&1 | tail -1
pcregrep: Too many errors - abandoned.
real 0.00
user 0.00
sys 0.00
$ pcregrep --version
pcregrep version 8.36 2014-09-26





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:31:18 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 21:31:18 2014
Received: from localhost ([127.0.0.1]:52044 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y29p8-0004xC-FG
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:31:18 -0500
Received: from smtp.cs.ucla.edu ([131.179.128.62]:47377)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1Y29p7-0004x5-BX
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:31:17 -0500
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id AF888A60011;
 Fri, 19 Dec 2014 18:31:16 -0800 (PST)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id LRs8tzbXU-+R; Fri, 19 Dec 2014 18:31:08 -0800 (PST)
Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 17FEFA60001;
 Fri, 19 Dec 2014 18:31:08 -0800 (PST)
Message-ID: <5494DF69.8010509@HIDDEN>
Date: Fri, 19 Dec 2014 18:31:05 -0800
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: Vincent Lefevre <vincent@HIDDEN>, Norihiro Tanaka <noritnk@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20141218134558.GQ3818@HIDDEN>	<20141219230038.CE8D.27F6AC2D@HIDDEN>	<20141220103146.F2C0.27F6AC2D@HIDDEN>
 <20141220021339.GN32684@HIDDEN>
In-Reply-To: <20141220021339.GN32684@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)

On 12/19/2014 06:13 PM, Vincent Lefevre wrote:
> both grep and sed should be fixed to obey RFC 3629

Shouldn't this be done in the C library code?  If mbrlen does the right 
thing, grep and sed should do the right thing.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 02:13:43 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 21:13:43 2014
Received: from localhost ([127.0.0.1]:52039 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y29Y6-0004Wf-Pf
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:13:43 -0500
Received: from ioooi.vinc17.net ([92.243.22.117]:35555)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1Y29Y4-0004WW-4l
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 21:13:40 -0500
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id A282A444;
 Sat, 20 Dec 2014 03:13:39 +0100 (CET)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id 4420E21A07A; Sat, 20 Dec 2014 03:13:39 +0100 (CET)
Date: Sat, 20 Dec 2014 03:13:39 +0100
From: Vincent Lefevre <vincent@HIDDEN>
To: Norihiro Tanaka <noritnk@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Message-ID: <20141220021339.GN32684@HIDDEN>
References: <20141218134558.GQ3818@HIDDEN>
 <20141219230038.CE8D.27F6AC2D@HIDDEN>
 <20141220103146.F2C0.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20141220103146.F2C0.27F6AC2D@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04)
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On 2014-12-20 10:31:46 +0900, Norihiro Tanaka wrote:
> On Fri, 19 Dec 2014 23:00:38 +0900
> Norihiro Tanaka <noritnk@HIDDEN> wrote:
> $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G .
> Binary file (standard input) matches
> $ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P .
> $
> 
> regex also behaves same as grep -G, e.g. sed only using regex returns the
> line.  Therefore, I think that what a character in the surrogate area
> matches a period with grep -G is not a bug, although the behavior might
> not obey a standard.
> 
> $ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p'
> 
> By the way, mbrlen() returns (size_t) -1 for the character.

IMHO, both grep and sed should be fixed to obey RFC 3629, which
specifies UTF-8. And other tools too (iconv...).

> OTOH, if a character in the surrogate area does not match a period in
> PCRE, I think that the character should not also match a period grep -P.

I agree.

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 20 Dec 2014 01:35:52 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 20:35:52 2014
Received: from localhost ([127.0.0.1]:52032 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y28xU-0003ao-6x
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:35:52 -0500
Received: from eggs.gnu.org ([208.118.235.92]:48436)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1Y28xS-0003ag-Be
 for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:35:51 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1Y28xI-0001Cq-8l
 for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:35:50 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:41402)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1Y28xH-0001Ck-WA
 for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:35:40 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50101)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1Y28xA-00004y-FZ
 for bug-grep@HIDDEN; Fri, 19 Dec 2014 20:35:39 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1Y28x2-00018f-Lo
 for bug-grep@HIDDEN; Fri, 19 Dec 2014 20:35:32 -0500
Received: from smtp.cs.ucla.edu ([131.179.128.62]:35618)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1Y28x2-00017e-GF
 for bug-grep@HIDDEN; Fri, 19 Dec 2014 20:35:24 -0500
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 0A26139E8018
 for <bug-grep@HIDDEN>; Fri, 19 Dec 2014 17:35:16 -0800 (PST)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 0Fuf5Wdx2vRp for <bug-grep@HIDDEN>;
 Fri, 19 Dec 2014 17:35:13 -0800 (PST)
Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 40982A60088
 for <bug-grep@HIDDEN>; Fri, 19 Dec 2014 17:35:13 -0800 (PST)
Message-ID: <5494D251.5050403@HIDDEN>
Date: Fri, 19 Dec 2014 17:35:13 -0800
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.3.0
MIME-Version: 1.0
To: bug-grep@HIDDEN
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <20141220012326.GA2678@HIDDEN>
In-Reply-To: <20141220012326.GA2678@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

On 12/19/2014 05:23 PM, Vincent Lefevre wrote:
> So, perhaps that the right method would be to do what pcregrep does,

What does pcregrep do?




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 01:31:52 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 20:31:51 2014
Received: from localhost ([127.0.0.1]:52028 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y28tb-0003Up-Ih
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:31:51 -0500
Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:43532)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <noritnk@HIDDEN>) id 1Y28tY-0003Ue-AB
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:31:49 -0500
Received: from imp03 (mailgw7.kcn.ne.jp [61.86.15.238])
 by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 1392FC8009
 for <18454 <at> debbugs.gnu.org>; Sat, 20 Dec 2014 10:31:46 +0900 (JST)
Received: from mail09.kcn.ne.jp ([61.86.6.188]) by imp03 with bizsmtp
 id VdXm1p00343QJrh01dXmGP; Sat, 20 Dec 2014 10:31:46 +0900
X-OrgRCPT: 18454 <at> debbugs.gnu.org
Received: from [10.120.1.71] (i118-21-128-66.s30.a048.ap.plala.or.jp
 [118.21.128.66])
 by mail09.kcn.ne.jp (Postfix) with ESMTPA id DF0D51BD00C3;
 Sat, 20 Dec 2014 10:31:45 +0900 (JST)
Date: Sat, 20 Dec 2014 10:31:46 +0900
From: Norihiro Tanaka <noritnk@HIDDEN>
To: 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
In-Reply-To: <20141219230038.CE8D.27F6AC2D@HIDDEN>
References: <20141218134558.GQ3818@HIDDEN>
 <20141219230038.CE8D.27F6AC2D@HIDDEN>
Message-Id: <20141220103146.F2C0.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.65.07 [ja]
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: Vincent Lefevre <vincent@HIDDEN>
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On Fri, 19 Dec 2014 23:00:38 +0900
Norihiro Tanaka <noritnk@HIDDEN> wrote:
> I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
> for the sequence.

$ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -G .
Binary file (standard input) matches
$ printf "\xED\xA0\xBF\n" | LC_ALL=en_US.utf8 src/grep -P .
$

regex also behaves same as grep -G, e.g. sed only using regex returns the
line.  Therefore, I think that what a character in the surrogate area
matches a period with grep -G is not a bug, although the behavior might
not obey a standard.

$ printf "\xED\xA0\xBF\n" | LANG=en_US.utf8 sed -ne '/./p'

By the way, mbrlen() returns (size_t) -1 for the character.

OTOH, if a character in the surrogate area does not match a period in
PCRE, I think that the character should not also match a period grep -P.





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 01:23:31 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 20:23:30 2014
Received: from localhost ([127.0.0.1]:52024 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y28lW-0003Hw-HV
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:23:30 -0500
Received: from ioooi.vinc17.net ([92.243.22.117]:35543)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1Y28lU-0003Hn-Hg
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 20:23:29 -0500
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id 730DE444;
 Sat, 20 Dec 2014 02:23:27 +0100 (CET)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id 25FB221A07A; Sat, 20 Dec 2014 02:23:27 +0100 (CET)
Date: Sat, 20 Dec 2014 02:23:27 +0100
From: Vincent Lefevre <vincent@HIDDEN>
To: 18454 <at> debbugs.gnu.org
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Message-ID: <20141220012326.GA2678@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20140912012449.GB18162@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04)
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On 2014-09-12 03:24:49 +0200, Vincent Lefevre wrote:
> Timings with the Debian packages on my personal svn working copy
> (binary + text files):
> 
> 2.18-2   0.9s with -P, 0.4s without -P
> 2.20-3  11.6s with -P, 0.4s without -P

I've done another test on a large PDF file. Let's forget grep 2.18,
which is indeed too buggy (I could reproduce a buffer overflow). But
let's compare with pcregrep, using the "zzz" pattern:

Debian grep 2.20-3      6.64s (with -P)
Upstream grep 2.21      5.39s (with -P)
Debian pcregrep 8.35    0.71s

In all cases, PCRE is used, but pcregrep is much faster than grep -P.

(Note: on this example, "grep" alone is much faster than pcregrep,
but this is not related to the invalid encoding, and depending on
the pattern, either grep or PCRE can be significantly faster.)

So, perhaps that the right method would be to do what pcregrep does,
even though "grep -P" can currently be a bit faster than pcregrep in
some cases.

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 20 Dec 2014 00:13:44 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 19:13:44 2014
Received: from localhost ([127.0.0.1]:51991 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y27fz-0000Cs-NW
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 19:13:43 -0500
Received: from ioooi.vinc17.net ([92.243.22.117]:35531)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1Y27fw-0000Ci-Su
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 19:13:41 -0500
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id 83E7E444;
 Sat, 20 Dec 2014 01:13:39 +0100 (CET)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id 3FC6921A07A; Sat, 20 Dec 2014 01:13:39 +0100 (CET)
Date: Sat, 20 Dec 2014 01:13:39 +0100
From: Vincent Lefevre <vincent@HIDDEN>
To: Norihiro Tanaka <noritnk@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Message-ID: <20141220001339.GJ32684@HIDDEN>
References: <20141129115848.6DF7.27F6AC2D@HIDDEN>
 <20141218134558.GQ3818@HIDDEN>
 <20141219230038.CE8D.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20141219230038.CE8D.27F6AC2D@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04)
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On 2014-12-19 23:00:38 +0900, Norihiro Tanaka wrote:
> I got them from pcre_valid_utf8(), but I made some mistakes.  They are
> as following.
> 
>   0xE0 0xAF 0xBF

This one is valid UTF-8 and corresponds to the code point U+0BFF, and
the following matches:

$ printf "\xE0\xAF\xBF\n" | grep -P .
௿

>   0xED 0xA0 0xBF

OK, this is in the surrogate area, and it doesn't match with PCRE.

>   0xF0 0x8F 0xBF 0xBF

This would be U+7FF4FFFF, larger than U+10FFFF.

> > BTW,
> > 
> >   printf "\xF4\xBF\xBF\xBF\n" | grep .
> > 
> > finds a match, and this appears to be a bug (grep should follow
> > the current standard).
> 
> I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
> for the sequence.

Ditto with:

  printf "\xED\xA0\xBF\n" | grep .

(surrogate area).

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 19 Dec 2014 14:00:46 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Dec 19 09:00:46 2014
Received: from localhost ([127.0.0.1]:50945 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y1y6n-0001w4-FV
	for submit <at> debbugs.gnu.org; Fri, 19 Dec 2014 09:00:46 -0500
Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:40695)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <noritnk@HIDDEN>) id 1Y1y6i-0001vs-8I
 for 18454 <at> debbugs.gnu.org; Fri, 19 Dec 2014 09:00:41 -0500
Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232])
 by mailgw06.kcn.ne.jp (Postfix) with ESMTP id E0820C8004
 for <18454 <at> debbugs.gnu.org>; Fri, 19 Dec 2014 23:00:37 +0900 (JST)
Received: from mail06.kcn.ne.jp ([61.86.6.185]) by imp02 with bizsmtp
 id VS0d1p00g3zXHqt01S0d3R; Fri, 19 Dec 2014 23:00:37 +0900
X-OrgRCPT: 18454 <at> debbugs.gnu.org
Received: from [10.120.1.68] (i118-21-128-66.s30.a048.ap.plala.or.jp
 [118.21.128.66])
 by mail06.kcn.ne.jp (Postfix) with ESMTPA id B37C51BF0021;
 Fri, 19 Dec 2014 23:00:37 +0900 (JST)
Date: Fri, 19 Dec 2014 23:00:38 +0900
From: Norihiro Tanaka <noritnk@HIDDEN>
To: Vincent Lefevre <vincent@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
In-Reply-To: <20141218134558.GQ3818@HIDDEN>
References: <20141129115848.6DF7.27F6AC2D@HIDDEN>
 <20141218134558.GQ3818@HIDDEN>
Message-Id: <20141219230038.CE8D.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.65.07 [ja]
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On Thu, 18 Dec 2014 14:45:58 +0100
Vincent Lefevre <vincent@HIDDEN> wrote:
> > 
> >   0xE0 0xC2 0xFF
> >   0xED 0xA0 0xFF
> >   0xF0 0xBF 0xFF 0xFF
> 
> If I'm not mistaken, these first three are also treated as invalid by
> my patch (and should be treated as invalid by any tool).

I got them from pcre_valid_utf8(), but I made some mistakes.  They are
as following.

  0xE0 0xAF 0xBF
  0xED 0xA0 0xBF
  0xF0 0x8F 0xBF 0xBF

By the way, they are correspond with following codes in pcre_valid_utf8().

    if (c == 0xe0 && (d & 0x20) == 0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR16;
      }
    if (c == 0xed && d >= 0xa0)
      {
      *erroroffset = (int)(p - string) - 2;
      return PCRE_UTF8_ERR14;
      }

    ........

    if (c == 0xf0 && (d & 0x30) == 0)
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR17;
      }
    if (c > 0xf4 || (c == 0xf4 && d > 0x8f))
      {
      *erroroffset = (int)(p - string) - 3;
      return PCRE_UTF8_ERR13;
      }

> BTW,
> 
>   printf "\xF4\xBF\xBF\xBF\n" | grep .
> 
> finds a match, and this appears to be a bug (grep should follow
> the current standard).

I also see it is a bug as you say.  mbrlen() in glibc returns (size_t) -1
for the sequence.





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 18 Dec 2014 13:46:03 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Dec 18 08:46:03 2014
Received: from localhost ([127.0.0.1]:49553 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Y1bP1-0007VA-4G
	for submit <at> debbugs.gnu.org; Thu, 18 Dec 2014 08:46:03 -0500
Received: from ypig.lip.ens-lyon.fr ([140.77.13.48]:54810)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1Y1bOy-0007Ud-BA
 for 18454 <at> debbugs.gnu.org; Thu, 18 Dec 2014 08:46:01 -0500
Received: from vlefevre by ypig.lip.ens-lyon.fr with local (Exim 4.84)
 (envelope-from <vincent@HIDDEN>)
 id 1Y1bOw-0008O3-7E; Thu, 18 Dec 2014 14:45:58 +0100
Date: Thu, 18 Dec 2014 14:45:58 +0100
From: Vincent Lefevre <vincent@HIDDEN>
To: Norihiro Tanaka <noritnk@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Message-ID: <20141218134558.GQ3818@HIDDEN>
References: <20141128233148.7418.27F6AC2D@HIDDEN>
 <20141128155029.GB8207@HIDDEN>
 <20141129115848.6DF7.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20141129115848.6DF7.27F6AC2D@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6371-vl-r75100 (2014-11-04)
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Sorry for the late reply.

On 2014-11-29 11:58:48 +0900, Norihiro Tanaka wrote:
> On Fri, 28 Nov 2014 16:50:29 +0100
> Vincent Lefevre <vincent@HIDDEN> wrote:
> > What matters is whether a sequence corresponds to a valid UTF-8
> > encoded Unicode character. My patch ensures that pcre_exec is called
> > on a string with only such characters, which implies that this is
> > also valid UTF-8 for PCRE (whether Unicode validity is also considered
> > in valid_utf8() or not). So, there's no valid reason why grep would
> > crash under such a condition.
> 
> It seems that PCRE treats e.g. following character as invalid.  It means
> we should not   these characters into pcre_exec with PCRE_NO_UTF8_CHECK
> option.
> 
>   0xE0 0xC2 0xFF
>   0xED 0xA0 0xFF
>   0xF0 0xBF 0xFF 0xFF

If I'm not mistaken, these first three are also treated as invalid by
my patch (and should be treated as invalid by any tool).

>   0xF4 0xBF 0xBF 0xBF

(corresponding to U+0013ffff).

Well, I followed some comment in the grep source, which is currently
incorrect.

pcreunicode(3) specifies that it follows RFC 3629, and that only
values in the range U+0 to U+10FFFF, excluding the surrogate area,
are allowed. I'll try to update my patch. But IMHO, it would be
better to get PCRE improved, and I had opened a bug:

  http://bugs.exim.org/show_bug.cgi?id=1554

BTW,

  printf "\xF4\xBF\xBF\xBF\n" | grep .

finds a match, and this appears to be a bug (grep should follow
the current standard).

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 29 Nov 2014 02:58:55 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Nov 28 21:58:55 2014
Received: from localhost ([127.0.0.1]:48803 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XuYFK-00082c-Nf
	for submit <at> debbugs.gnu.org; Fri, 28 Nov 2014 21:58:54 -0500
Received: from mailgw05.kcn.ne.jp ([61.86.7.212]:55810)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <noritnk@HIDDEN>) id 1XuYFH-00082O-Ds
 for 18454 <at> debbugs.gnu.org; Fri, 28 Nov 2014 21:58:52 -0500
Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231])
 by mailgw05.kcn.ne.jp (Postfix) with ESMTP id EB3FB67C18
 for <18454 <at> debbugs.gnu.org>; Sat, 29 Nov 2014 11:58:48 +0900 (JST)
Received: from mail09.kcn.ne.jp ([61.86.6.188]) by imp01 with bizsmtp
 id MEyo1p00V43QJrh01EyonM; Sat, 29 Nov 2014 11:58:48 +0900
X-OrgRCPT: 18454 <at> debbugs.gnu.org
Received: from [10.120.1.56] (i118-21-128-66.s30.a048.ap.plala.or.jp
 [118.21.128.66])
 by mail09.kcn.ne.jp (Postfix) with ESMTPA id CCCD81BD0097;
 Sat, 29 Nov 2014 11:58:48 +0900 (JST)
Date: Sat, 29 Nov 2014 11:58:48 +0900
From: Norihiro Tanaka <noritnk@HIDDEN>
To: Vincent Lefevre <vincent@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
In-Reply-To: <20141128155029.GB8207@HIDDEN>
References: <20141128233148.7418.27F6AC2D@HIDDEN>
 <20141128155029.GB8207@HIDDEN>
Message-Id: <20141129115848.6DF7.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.65.07 [ja]
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)


On Fri, 28 Nov 2014 16:50:29 +0100
Vincent Lefevre <vincent@HIDDEN> wrote:
> What matters is whether a sequence corresponds to a valid UTF-8
> encoded Unicode character. My patch ensures that pcre_exec is called
> on a string with only such characters, which implies that this is
> also valid UTF-8 for PCRE (whether Unicode validity is also considered
> in valid_utf8() or not). So, there's no valid reason why grep would
> crash under such a condition.

It seems that PCRE treats e.g. following character as invalid.  It means
we should not   these characters into pcre_exec with PCRE_NO_UTF8_CHECK
option.

  0xE0 0xC2 0xFF
  0xED 0xA0 0xFF
  0xF0 0xBF 0xFF 0xFF
  0xF4 0xBF 0xBF 0xBF






Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 28 Nov 2014 15:50:35 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Nov 28 10:50:35 2014
Received: from localhost ([127.0.0.1]:48574 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XuNoZ-0005tp-5V
	for submit <at> debbugs.gnu.org; Fri, 28 Nov 2014 10:50:35 -0500
Received: from ioooi.vinc17.net ([92.243.22.117]:60239)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1XuNoU-0005te-Q1
 for 18454 <at> debbugs.gnu.org; Fri, 28 Nov 2014 10:50:31 -0500
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id AD5791E1;
 Fri, 28 Nov 2014 16:50:29 +0100 (CET)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id 300C121A07A; Fri, 28 Nov 2014 16:50:29 +0100 (CET)
Date: Fri, 28 Nov 2014 16:50:29 +0100
From: Vincent Lefevre <vincent@HIDDEN>
To: Norihiro Tanaka <noritnk@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Message-ID: <20141128155029.GB8207@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <20141128025918.GA26989@HIDDEN>
 <20141128233148.7418.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20141128233148.7418.27F6AC2D@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6365-vl-r59709 (2014-09-07)
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On 2014-11-28 23:31:49 +0900, Norihiro Tanaka wrote:
> Thanks for the patch.  However, I seem that valid_utf() in PCRE also
> considers 5 and 6 bytes characters in PCRE.

In any case, even if PCRE considers these sequences as valid UTF-8,
they shouldn't match because they are not part of Unicode (if they
can match, this would be a bug in libpcre). My patch considers that
these sequences do not match, which is consistent with the expected
behavior.

> IMHO, We assume that grep doesn't know how to check for an input text in
> valid_utf(), althouth we know PCRE checks whether an input text is valid
> utf8 or not, so that even when PCRE changes behaviour of valid_utf(),
> grep should run.
> 
> If we do not check invalid utf8 characters with valid_utf8() in advance,
> grep may cause core dump with PCRE_NO_UTF8_CHECK.
> See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586
> 
> So we can not avoid for checking invalid utf8 characters with valid_utf8().
> Further more, we must perform to check as PCRE expects, but grep does
> not know how to PCRE to check invalid_utf8 characters due to an above
> assumption.

What matters is whether a sequence corresponds to a valid UTF-8
encoded Unicode character. My patch ensures that pcre_exec is called
on a string with only such characters, which implies that this is
also valid UTF-8 for PCRE (whether Unicode validity is also considered
in valid_utf8() or not). So, there's no valid reason why grep would
crash under such a condition.

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 28 Nov 2014 14:32:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Nov 28 09:32:04 2014
Received: from localhost ([127.0.0.1]:48164 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XuMaZ-0003lv-S0
	for submit <at> debbugs.gnu.org; Fri, 28 Nov 2014 09:32:04 -0500
Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:49356)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <noritnk@HIDDEN>) id 1XuMaN-0003lQ-8L
 for 18454 <at> debbugs.gnu.org; Fri, 28 Nov 2014 09:31:55 -0500
Received: from imp02 (mailgw6.kcn.ne.jp [61.86.15.232])
 by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 116B5C800C
 for <18454 <at> debbugs.gnu.org>; Fri, 28 Nov 2014 23:31:49 +0900 (JST)
Received: from mail06.kcn.ne.jp ([61.86.6.185]) by imp02 with bizsmtp
 id M2Xp1p0033zXHqt012XpHh; Fri, 28 Nov 2014 23:31:49 +0900
X-OrgRCPT: 18454 <at> debbugs.gnu.org
Received: from [10.120.1.56] (i118-21-128-66.s30.a048.ap.plala.or.jp
 [118.21.128.66])
 by mail06.kcn.ne.jp (Postfix) with ESMTPA id A435A1BF0091;
 Fri, 28 Nov 2014 23:31:48 +0900 (JST)
Date: Fri, 28 Nov 2014 23:31:49 +0900
From: Norihiro Tanaka <noritnk@HIDDEN>
To: Vincent Lefevre <vincent@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
In-Reply-To: <20141128025918.GA26989@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <20141128025918.GA26989@HIDDEN>
Message-Id: <20141128233148.7418.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.65.07 [ja]
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

On Fri, 28 Nov 2014 03:59:18 +0100
Vincent Lefevre <vincent@HIDDEN> wrote:

> On binary files, it seems that testing the UTF-8 sequences in
> pcresearch.c is faster than asking pcre_exec to do that (because
> of the retry I assume); see attached patch. It actually checks
> UTF-8 only if an invalid sequence was already found by pcre_exec,
> assuming that pcre_exec can check the validity of a valid text
> file in a faster way.
> 
> On some file similar to PDF (test 1):
> 
> Before: 1.77s
> After:  1.38s
> 
> But now, the main problem is the many pcre_exec. Indeed, if I replace
> the non-ASCII bytes by \n with:
> 
>   LC_ALL=C tr \\200-\\377 \\n
> 
> (now, one has a valid file but with many short lines), the grep -P time
> is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
> with:
> 
>   LC_ALL=C tr \\200-\\377 \\000
> 
> the grep -P time is 0.30s (test 3), thus it is much faster.
> 
> Note also that libpcre is much slower than normal grep on simple words,
> but on "a[0-9]b", it can be faster:
> 
>           grep      PCRE   PCRE+patch
> test 1    4.31      1.90      1.53
> test 2    0.18      1.61      1.63
> test 3    3.28      0.39      0.39
> 
> With grep, I wonder why test 2 is much faster.
> 
> -- 
> Vincent Lefevre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
> 100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
> Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

Thanks for the patch.  However, I seem that valid_utf() in PCRE also
considers 5 and 6 bytes characters in PCRE.

IMHO, We assume that grep doesn't know how to check for an input text in
valid_utf(), althouth we know PCRE checks whether an input text is valid
utf8 or not, so that even when PCRE changes behaviour of valid_utf(),
grep should run.

If we do not check invalid utf8 characters with valid_utf8() in advance,
grep may cause core dump with PCRE_NO_UTF8_CHECK.
See http://debbugs.gnu.org/cgi/bugreport.cgi?bug=16586

So we can not avoid for checking invalid utf8 characters with valid_utf8().
Further more, we must perform to check as PCRE expects, but grep does
not know how to PCRE to check invalid_utf8 characters due to an above
assumption.





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 28 Nov 2014 02:59:24 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Nov 27 21:59:24 2014
Received: from localhost ([127.0.0.1]:48012 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XuBmF-0006T9-RK
	for submit <at> debbugs.gnu.org; Thu, 27 Nov 2014 21:59:24 -0500
Received: from ioooi.vinc17.net ([92.243.22.117]:60142)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1XuBmC-0006Sz-Fk
 for 18454 <at> debbugs.gnu.org; Thu, 27 Nov 2014 21:59:21 -0500
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id 5C6191E1;
 Fri, 28 Nov 2014 03:59:19 +0100 (CET)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id 6D1B821A07A; Fri, 28 Nov 2014 03:59:18 +0100 (CET)
Date: Fri, 28 Nov 2014 03:59:18 +0100
From: Vincent Lefevre <vincent@HIDDEN>
To: 18454 <at> debbugs.gnu.org
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Message-ID: <20141128025918.GA26989@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="zYM0uCDKw75PZbzx"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20140912012449.GB18162@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6365-vl-r59709 (2014-09-07)
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 18454
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)


--zYM0uCDKw75PZbzx
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

On binary files, it seems that testing the UTF-8 sequences in
pcresearch.c is faster than asking pcre_exec to do that (because
of the retry I assume); see attached patch. It actually checks
UTF-8 only if an invalid sequence was already found by pcre_exec,
assuming that pcre_exec can check the validity of a valid text
file in a faster way.

On some file similar to PDF (test 1):

Before: 1.77s
After:  1.38s

But now, the main problem is the many pcre_exec. Indeed, if I replace
the non-ASCII bytes by \n with:

  LC_ALL=C tr \\200-\\377 \\n

(now, one has a valid file but with many short lines), the grep -P time
is 1.52s (test 2). And if I replace the non-ASCII bytes by null bytes
with:

  LC_ALL=C tr \\200-\\377 \\000

the grep -P time is 0.30s (test 3), thus it is much faster.

Note also that libpcre is much slower than normal grep on simple words,
but on "a[0-9]b", it can be faster:

          grep      PCRE   PCRE+patch
test 1    4.31      1.90      1.53
test 2    0.18      1.61      1.63
test 3    3.28      0.39      0.39

With grep, I wonder why test 2 is much faster.

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)

--zYM0uCDKw75PZbzx
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="grep221-pcresearch.patch"

diff --git a/src/pcresearch.c b/src/pcresearch.c
index 5451029..6bff1e4 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -38,6 +38,8 @@ static pcre_extra *extra;
 # endif
 #endif
 
+#define INVALID(C) (to_uchar (C) < 0x80 || to_uchar (C) > 0xbf)
+
 /* Table, indexed by ! (flag & PCRE_NOTBOL), of whether the empty
    string matches when that flag is used.  */
 static int empty_match[2];
@@ -156,6 +158,7 @@ Pexecute (char const *buf, size_t size, size_t *match_size,
   char const *line_start = buf;
   int e = PCRE_ERROR_NOMATCH;
   char const *line_end;
+  int invalid = 0;
 
   /* If the input type is unknown, the caller is still testing the
      input, which means the current buffer cannot contain encoding
@@ -212,25 +215,54 @@ Pexecute (char const *buf, size_t size, size_t *match_size,
           if (multiline)
             options |= PCRE_NO_UTF8_CHECK;
 
-          e = pcre_exec (cre, extra, p, search_bytes, 0,
-                         options, sub, NSUB);
-          if (e != PCRE_ERROR_BADUTF8)
+          int valid_bytes = search_bytes;
+          if (invalid)
             {
-              if (0 < e && multiline && sub[1] - sub[0] != 0)
+              /* At least an encoding error was found. Other such errors
+                 are likely to occur, and detecting them here is faster
+                 in average than relying on pcre.  */
+              options |= PCRE_NO_UTF8_CHECK;
+              char const *p2 = p;
+              while (p2 != line_end)
                 {
-                  char const *nl = memchr (p + sub[0], eolbyte,
-                                           sub[1] - sub[0]);
-                  if (nl)
+                  unsigned char c = p2[0];
+                  size_t len =
+                    c < 0x80 ? 1 :
+                    c < 0xc2 || c > 0xf7 || INVALID(p2[1]) ? 0 :
+                    c < 0xe0 ? 2 : INVALID(p2[2]) ? 0 :
+                    c < 0xf0 ? 3 : INVALID(p2[3]) ? 0 : 4;
+                  if (len == 0)
                     {
-                      /* This match crosses a line boundary; reject it.  */
-                      p += sub[0];
-                      line_end = nl;
-                      continue;
+                      valid_bytes = p2 - p;
+                      break;
                     }
+                  p2 += len;
                 }
-              break;
             }
-          int valid_bytes = sub[0];
+
+          if (valid_bytes == search_bytes)
+            {
+              e = pcre_exec (cre, extra, p, search_bytes, 0,
+                             options, sub, NSUB);
+              if (e != PCRE_ERROR_BADUTF8)
+                {
+                  if (0 < e && multiline && sub[1] - sub[0] != 0)
+                    {
+                      char const *nl = memchr (p + sub[0], eolbyte,
+                                               sub[1] - sub[0]);
+                      if (nl)
+                        {
+                          /* This match crosses a line boundary; reject it.  */
+                          p += sub[0];
+                          line_end = nl;
+                          continue;
+                        }
+                    }
+                  break;
+                }
+              invalid = 1;
+              valid_bytes = sub[0];
+            }
 
           /* Try to match the string before the encoding error.
              Again, handle the empty-match case specially, for speed.  */

--zYM0uCDKw75PZbzx--




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 30 Sep 2014 19:39:25 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Sep 30 15:39:25 2014
Received: from localhost ([127.0.0.1]:56681 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XZ3Ge-0005Eq-J7
	for submit <at> debbugs.gnu.org; Tue, 30 Sep 2014 15:39:24 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:60820)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XZ3Gc-0005Ei-Lw
 for 18454 <at> debbugs.gnu.org; Tue, 30 Sep 2014 15:39:23 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 5604B39E801B;
 Tue, 30 Sep 2014 12:39:21 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id c1Gh7cps-811; Tue, 30 Sep 2014 12:39:18 -0700 (PDT)
Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 2DE1439E8018;
 Tue, 30 Sep 2014 12:39:18 -0700 (PDT)
Message-ID: <542B06E5.8040501@HIDDEN>
Date: Tue, 30 Sep 2014 12:39:17 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <freemail.20140930201058.51110.3@HIDDEN>
In-Reply-To: <freemail.20140930201058.51110.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

On 09/30/2014 11:10 AM, Zoltán Herczeg wrote:
>
>> Grep already does that sort of thing.  And it's smart enough to start matching
>> only at character boundaries.  It's not libpcre's job to worry about this; the
>> caller can worry about it.
> Thank you for bringing this up. I don't see any point of reimplementing what is already there.

Sorry, it sounds like my earlier comment was unclear.  GNU grep is smart 
enough to start matching at character boundaries without checking the 
validity of the input data.  This helps it run faster.  However, because 
libpcre requires a validity prepass, grep -P must slow down and do the 
validity check one way or another.  Grep does this only when libpcre is 
used, and that's one reason grep -P is slower than plain grep.

It's not a question of duplicating code: grep already has code to 
validate binary data.  It's a question of performance. Requiring a 
prepass for validity checking is typically slower (or takes more energy, 
or whatever) than checking validity on the fly.  And in many cases going 
multithreaded would just make matters worse.

I can understand that you don't want to take on the burden of making a 
nontrivial libpcre performance improvement.  Also, I hope 'grep -P' 
performance, though not great, is good enough now to satisfy most 
users.  So perhaps we should just give the topic a rest.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 30 Sep 2014 18:14:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Sep 30 14:14:53 2014
Received: from localhost ([127.0.0.1]:56625 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XZ1wq-00037W-JD
	for submit <at> debbugs.gnu.org; Tue, 30 Sep 2014 14:14:53 -0400
Received: from eggs.gnu.org ([208.118.235.92]:38015)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XZ1wo-00037N-1m
 for submit <at> debbugs.gnu.org; Tue, 30 Sep 2014 14:14:50 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XZ1wd-0007EJ-Kj
 for submit <at> debbugs.gnu.org; Tue, 30 Sep 2014 14:14:49 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: **
X-Spam-Status: No, score=2.7 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 MALFORMED_FREEMAIL,UNPARSEABLE_RELAY autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:42029)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XZ1wd-0007EF-IY
 for submit <at> debbugs.gnu.org; Tue, 30 Sep 2014 14:14:39 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:39711)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XZ1wV-0007cu-9u
 for bug-grep@HIDDEN; Tue, 30 Sep 2014 14:14:39 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XZ1wM-0007CS-Ek
 for bug-grep@HIDDEN; Tue, 30 Sep 2014 14:14:31 -0400
Received: from iwiw01d.mail.t-online.hu ([84.2.42.53]:50647
 helo=fmxout01.freemail.hu) by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XZ1wM-0007Bz-8E
 for bug-grep@HIDDEN; Tue, 30 Sep 2014 14:14:22 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by fmxout01.freemail.hu (Postfix) with SMTP id CE7253C52
 for <bug-grep@HIDDEN>; Tue, 30 Sep 2014 20:10:58 +0200 (CEST)
Received: (qmail 32285 invoked by uid 151); 30 Sep 2014 20:10:58 +0200
Received: from 195.228.245.211 (HELO fmxmldata07.freemail.hu) (91.83.55.54)
 by fmx24.freemail.hu with SMTP; 30 Sep 2014 20:10:58 +0200
Received: from webmail by smtp gw id s8UIAwBf051116;
 Tue, 30 Sep 2014 20:10:58 +0200 (CEST)
Date: Tue, 30 Sep 2014 20:10:58 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <542824AD.8090501@HIDDEN>
Message-ID: <freemail.20140930201058.51110.3@HIDDEN>
X-Originating-IP: [91.83.55.54]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -3.1 (---)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Hi,

>It's purely a performance question.  GNU grep already uses libpcre to search 
>binary data, and it works now.  It's just slow, that all.  I'm willing to live 
>with this, and tell users "Sorry, but libpcre is not designed to search binary 
>data quickly; if you want speed then don't use grep's -P option."  If you're 
>willing to live with this too, we're done.

Yes, PCRE is not designed for matching binary data as UTF. Too much complexity for too little gain. Normal search can be used on binary data without limitations.

>Grep already does that sort of thing.  And it's smart enough to start matching 
>only at character boundaries.  It's not libpcre's job to worry about this; the 
>caller can worry about it.

Thank you for bringing this up. I don't see any point of reimplementing what is already there. However, if PCRE says it supports UTF matching in binary data, it should. Because the "what is there" depends on the environment. This clearly the best answer why the environment is responsible for handling the binary part of the data. Most environment needs some kind of validating, and we would just duplicate code. It is good to hear that everything is in grep, perhaps a few more lines are needed to do it in a thread.

>The code you posted could be made faster than that; among other things there 
>should not be an unbounded backward scan.  And even the code you posted would 
>often be faster than what's in libpcre now.  That early UTF-8 validity prepass 
>is a killer.

I would recommend to disable it. It's only purpose is returning early for invalid buffers. I am sure grep already knows that a buffer is invalid, since it scans the buffer.

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 30 Sep 2014 18:11:04 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Sep 30 14:11:04 2014
Received: from localhost ([127.0.0.1]:56606 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XZ1t9-00031W-Or
	for submit <at> debbugs.gnu.org; Tue, 30 Sep 2014 14:11:04 -0400
Received: from iwiw03d.mail.t-online.hu ([84.2.42.68]:38068)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XZ1t6-00030y-6A
 for 18454 <at> debbugs.gnu.org; Tue, 30 Sep 2014 14:11:01 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by iwiw03d.mail.t-online.hu (Postfix) with SMTP id EDE184E824E
 for <18454 <at> debbugs.gnu.org>; Tue, 30 Sep 2014 20:10:48 +0200 (CEST)
Received: (qmail 32285 invoked by uid 151); 30 Sep 2014 20:10:58 +0200
Received: from 195.228.245.211 (HELO fmxmldata07.freemail.hu) (91.83.55.54)
 by fmx24.freemail.hu with SMTP; 30 Sep 2014 20:10:58 +0200
Received: from webmail by smtp gw id s8UIAwBf051116;
 Tue, 30 Sep 2014 20:10:58 +0200 (CEST)
Date: Tue, 30 Sep 2014 20:10:58 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <542824AD.8090501@HIDDEN>
Message-ID: <freemail.20140930201058.51110.3@HIDDEN>
X-Originating-IP: [91.83.55.54]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-Spam-Score: 1.9 (+)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has
 identified this incoming email as possible spam.  The original message
 has been attached to this so you can view it (if it isn't spam) or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview:  Hi, >It's purely a performance question. GNU grep already
 uses libpcre to search >binary data, and it works now. It's just slow, that
 all. I'm willing to live >with this, and tell users "Sorry, but libpcre is
 not designed to search binary >data quickly; if you want speed then don't
 use grep's -P option." If you're >willing to live with this too, we're done.
 [...] Content analysis details:   (1.9 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
 (hzmester[at]freemail.hu)
 -0.0 RCVD_IN_DNSWL_NONE     RBL: Sender listed at http://www.dnswl.org/, no
 trust [84.2.42.68 listed in list.dnswl.org]
 1.9 MALFORMED_FREEMAIL     Bad headers on message from free email service
 0.0 UNPARSEABLE_RELAY Informational: message has unparseable relay lines
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Hi,

>It's purely a performance question.  GNU grep already uses libpcre to search 
>binary data, and it works now.  It's just slow, that all.  I'm willing to live 
>with this, and tell users "Sorry, but libpcre is not designed to search binary 
>data quickly; if you want speed then don't use grep's -P option."  If you're 
>willing to live with this too, we're done.

Yes, PCRE is not designed for matching binary data as UTF. Too much complexity for too little gain. Normal search can be used on binary data without limitations.

>Grep already does that sort of thing.  And it's smart enough to start matching 
>only at character boundaries.  It's not libpcre's job to worry about this; the 
>caller can worry about it.

Thank you for bringing this up. I don't see any point of reimplementing what is already there. However, if PCRE says it supports UTF matching in binary data, it should. Because the "what is there" depends on the environment. This clearly the best answer why the environment is responsible for handling the binary part of the data. Most environment needs some kind of validating, and we would just duplicate code. It is good to hear that everything is in grep, perhaps a few more lines are needed to do it in a thread.

>The code you posted could be made faster than that; among other things there 
>should not be an unbounded backward scan.  And even the code you posted would 
>often be faster than what's in libpcre now.  That early UTF-8 validity prepass 
>is a killer.

I would recommend to disable it. It's only purpose is returning early for invalid buffers. I am sure grep already knows that a buffer is invalid, since it scans the buffer.

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 28 Sep 2014 22:07:01 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 28 18:07:01 2014
Received: from localhost ([127.0.0.1]:54612 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XYMcP-0007jd-4v
	for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 18:07:01 -0400
Received: from mail-la0-f49.google.com ([209.85.215.49]:45807)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <meyering@HIDDEN>) id 1XYMcM-0007jT-6H
 for 18454 <at> debbugs.gnu.org; Sun, 28 Sep 2014 18:06:59 -0400
Received: by mail-la0-f49.google.com with SMTP id ge10so2414070lab.36
 for <18454 <at> debbugs.gnu.org>; Sun, 28 Sep 2014 15:06:56 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc:content-type;
 bh=8nZi/ItILd6TGncGD8eRXVZqVxmAHSboHPWuZ67/mQc=;
 b=s1NHATUPfEovvYZL7dngSYHthmLO+YZYUoWlivcX/Ll1QJOz/CN9EjSmY9RlyPcdu4
 R9GEayfVKgnX6R9NuToGvgLc5KKNtMjk1FVC9qyptlS2XwBJlEEHzrHdqTFdBnDM4m4l
 FpGFcNtPuDm5PBqpuKQZFT6TgA0wD1e4qt5/4BFHJqlg7j5grhgOX8u2QEIAjJbVssNN
 F9tZt56SMLEP83JcT6853ipkwtXdYy5BsIHclKuju5gSi6k7nL1BEQMoSjGLHGKnhxWQ
 xAAU1psOIZ91HfMcl9AS81sWBlTmNBge+sCgnoeu5E6/UYGe154KW81CvY3t9IIcskkz
 cADQ==
X-Received: by 10.152.27.66 with SMTP id r2mr5239439lag.84.1411942016891; Sun,
 28 Sep 2014 15:06:56 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.25.23.89 with HTTP; Sun, 28 Sep 2014 15:06:36 -0700 (PDT)
In-Reply-To: <54273BF5.2060605@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <541A750E.2050606@HIDDEN> <20140918083327.GA16324@nomada>
 <CA+8g5KH3LY75wVb3WsL8dvTt4FhfiO=cuYCadRD2R=9nrpw_hg@HIDDEN>
 <CA+8g5KHdnaEB=yYF5Kp7XCd7GgvjL73HeoOHNEPCAqy0KPs6+w@HIDDEN>
 <5424B1E6.8090502@HIDDEN>
 <CA+8g5KEcQ_Y-raQRmoiyz=SVMH-gSizMVdYutUYt6J33YMPzVQ@HIDDEN>
 <54273BF5.2060605@HIDDEN>
From: Jim Meyering <jim@HIDDEN>
Date: Sun, 28 Sep 2014 15:06:36 -0700
X-Google-Sender-Auth: --ULz_nY4WxVw-w0nvEVbPZk7v8
Message-ID: <CA+8g5KGy1v94bBvDSh_YZ4Bi2yUcwZMApNOs1XyoRcaJ+eMGsg@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
Content-Type: text/plain; charset=ISO-8859-1
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 18454
Cc: =?ISO-8859-1?Q?Santiago_Ruano_Rinc=F3n?= <santiago@HIDDEN>,
 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.7 (/)

Nice! I didn't know about _Pragma. It's much better to encapsulate
that, keeping the #pragma directives out of function bodies.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 28 Sep 2014 15:10:26 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 28 11:10:26 2014
Received: from localhost ([127.0.0.1]:54469 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XYG7F-0005nr-S0
	for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 11:10:26 -0400
Received: from eggs.gnu.org ([208.118.235.92]:52937)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XYG7D-0005ni-Q6
 for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 11:10:24 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XYG73-000259-Pi
 for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 11:10:23 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:60549)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XYG73-00024u-MY
 for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 11:10:13 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:54634)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XYG6r-0004Ej-5R
 for bug-grep@HIDDEN; Sun, 28 Sep 2014 11:10:08 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XYG6g-0001rj-39
 for bug-grep@HIDDEN; Sun, 28 Sep 2014 11:10:01 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:52138)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XYG6f-0001qX-Je
 for bug-grep@HIDDEN; Sun, 28 Sep 2014 11:09:49 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id EC78E39E8014;
 Sun, 28 Sep 2014 08:09:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id oB9og1yzqvxt; Sun, 28 Sep 2014 08:09:33 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id CFFF439E8012;
 Sun, 28 Sep 2014 08:09:33 -0700 (PDT)
Message-ID: <542824AD.8090501@HIDDEN>
Date: Sun, 28 Sep 2014 08:09:33 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140928121116.83173.3@HIDDEN>
In-Reply-To: <freemail.20140928121116.83173.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 208.118.235.17
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

Zolt=C3=A1n Herczeg wrote:

> For me the question is whether binary search needs to supported on PCRE=
 level.

It's purely a performance question.  GNU grep already uses libpcre to sea=
rch=20
binary data, and it works now.  It's just slow, that all.  I'm willing to=
 live=20
with this, and tell users "Sorry, but libpcre is not designed to search b=
inary=20
data quickly; if you want speed then don't use grep's -P option."  If you=
're=20
willing to live with this too, we're done.

> removing a lot of optimizations.

You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK=
 case.=20
  Keep them all.  It should be just as fast before.  The idea is to have =
one=20
matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and =
another=20
matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as =
it=20
goes).  The former matcher will be just as fast as now, and the latter ma=
tcher=20
will be faster than what libpcre has now.  I readily concede that this wi=
ll=20
require some nontrivial coding, but I don't concede that it will remove=20
optimizations or make libpcre slower.  It should make libpcre faster; tha=
t's the=20
point.

> You have a 100 byte long buffer, and you start matching from byte 50.

Grep already does that sort of thing.  And it's smart enough to start mat=
ching=20
only at character boundaries.  It's not libpcre's job to worry about this=
; the=20
caller can worry about it.

> For me this is way too much checks, and affects compiler optimizations =
too much.

The code you posted could be made faster than that; among other things th=
ere=20
should not be an unbounded backward scan.  And even the code you posted w=
ould=20
often be faster than what's in libpcre now.  That early UTF-8 validity pr=
epass=20
is a killer.





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 28 Sep 2014 15:09:46 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 28 11:09:46 2014
Received: from localhost ([127.0.0.1]:54465 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XYG6c-0005mQ-8K
	for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 11:09:46 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:33496)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XYG6a-0005mG-HS
 for 18454 <at> debbugs.gnu.org; Sun, 28 Sep 2014 11:09:45 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id EC78E39E8014;
 Sun, 28 Sep 2014 08:09:42 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id oB9og1yzqvxt; Sun, 28 Sep 2014 08:09:33 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id CFFF439E8012;
 Sun, 28 Sep 2014 08:09:33 -0700 (PDT)
Message-ID: <542824AD.8090501@HIDDEN>
Date: Sun, 28 Sep 2014 08:09:33 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140928121116.83173.3@HIDDEN>
In-Reply-To: <freemail.20140928121116.83173.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -2.9 (--)
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.9 (--)

Zoltán Herczeg wrote:

> For me the question is whether binary search needs to supported on PCRE level.

It's purely a performance question.  GNU grep already uses libpcre to search 
binary data, and it works now.  It's just slow, that all.  I'm willing to live 
with this, and tell users "Sorry, but libpcre is not designed to search binary 
data quickly; if you want speed then don't use grep's -P option."  If you're 
willing to live with this too, we're done.

> removing a lot of optimizations.

You shouldn't need to remove any optimizations for the PCRE_NO_UTF8_CHECK case. 
  Keep them all.  It should be just as fast before.  The idea is to have one 
matcher for the PCRE_NO_UTF8_CHECK case (one that works much as now) and another 
matcher for the non-PCRE_NO_UTF8_CHECK case (one that checks validity as it 
goes).  The former matcher will be just as fast as now, and the latter matcher 
will be faster than what libpcre has now.  I readily concede that this will 
require some nontrivial coding, but I don't concede that it will remove 
optimizations or make libpcre slower.  It should make libpcre faster; that's the 
point.

> You have a 100 byte long buffer, and you start matching from byte 50.

Grep already does that sort of thing.  And it's smart enough to start matching 
only at character boundaries.  It's not libpcre's job to worry about this; the 
caller can worry about it.

> For me this is way too much checks, and affects compiler optimizations too much.

The code you posted could be made faster than that; among other things there 
should not be an unbounded backward scan.  And even the code you posted would 
often be faster than what's in libpcre now.  That early UTF-8 validity prepass 
is a killer.





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 28 Sep 2014 10:11:21 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 28 06:11:21 2014
Received: from localhost ([127.0.0.1]:54102 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XYBRp-0005XL-4x
	for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 06:11:21 -0400
Received: from iwiw02d.mail.t-online.hu ([84.2.42.67]:46041)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XYBRm-0005X7-8F
 for 18454 <at> debbugs.gnu.org; Sun, 28 Sep 2014 06:11:19 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by iwiw02d.mail.t-online.hu (Postfix) with SMTP id 869E44875EF
 for <18454 <at> debbugs.gnu.org>; Sun, 28 Sep 2014 12:11:26 +0200 (CEST)
Received: (qmail 78673 invoked by uid 151); 28 Sep 2014 12:11:16 +0200
Received: from 195.228.245.211 (HELO fmxmldata04.freemail.hu) (193.226.212.27)
 by fmx24.freemail.hu with SMTP; 28 Sep 2014 12:11:16 +0200
Received: from webmail by smtp gw id s8SABGei083174;
 Sun, 28 Sep 2014 12:11:16 +0200 (CEST)
Date: Sun, 28 Sep 2014 12:11:16 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <54272400.1020704@HIDDEN>
Message-ID: <freemail.20140928121116.83173.3@HIDDEN>
X-Originating-IP: [193.226.212.27]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-Spam-Score: 0.6 (/)
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

>> In the regex world, matching performance is the key aspect of an engine
>
>Absolutely.  That's why we're having this discussion: libpcre is slow when 
>matching binary data.

For me the question is whether binary search needs to supported on PCRE level. There are other questions like this. People ask string replacement support in PCRE from time to time. It can be implemented of course, we could add a whole string handling API to PCRE. But we feel this is outside the scope of PCRE. Any string management library can use PCRE for string replacement, we don't need another one.

Binary matching is similar thing. PCRE is used by some (I think closed source) projects for network data filtering, which is obviously binary data. They use some kind of pre-filtering, data arranging and partial matching to efficiently check TCP stream data (without waiting for the whole stream to arrive).

>> A "simple" change like this would require a major redesign of the engine.
>
>It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm volunteering....)

Anything can be done, which as an algorithmic solution, this was never a question. The question is whether it is worth to do it on PCRE or higher level. Perl/PCRE is all about text processing, characters which has meanings, types, sub-types, other case(s). Unicode is also about defining characters, not binary data. UTF is an encoding format, not mapping random bytes to characters.

If this task would be trivial, I wouldn't mind doing it myself. But it seems this task is about destroying what we built so far. A lot of extra checks to process invalid bytes, a large code size increase, and removing a lot of optimizations. The result might be much slower than using clever pre-filtering.

>> What should happen, if the starting offset is inside an otherwise valid UTF character?
>
>The same thing that would happen if an input file started with the tail end of a 
>UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this already; 
>it's not a problem.

The question was about intermediate start offsets. You have a 100 byte long buffer, and you start matching from byte 50. That is part of a valid UTF byte. Your pattern starts with an invalid character, which matches to that UTF fragment. You said invalid UTF character matches only themselves, not part of other characters. A lot of extra check again, preparing for the worst case.

>> This might be efficient for engines which scans the input only forward direction
> > and read every character once.
>
>It can also be efficient for matchers, like grep's, that don't necessarily do 
>that.  It just takes more implementation work, that's all.  It's not rocket 
>science to go backwards through a UTF-8 string and to catch decoding errors as 
>you go.

My problem is the lot of "ifs", you need to execute. Lets compare the current and the proposed solution.

char* c_ptr /* Current string position. */

Current:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  while (*cptr & 0xc0 == 0x80)
   cptr--;

Proposed solution:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  char* saved_c_ptr = c_ptr; /* We need to save the starting position, loosing a CPU register for that. */
  while (*cptr & 0xc0 == 0x80) {
    if (c_ptr == input_start)
      return FAIL;
    cptr--;
  }

  /* We moved back a lot, we don't know where are we. Check character length. */
  int length = utf_length[*cptr]; /* Another lost register. Compiler life is difficult. */
  if (cptr + length != saved_c_ptr + 1) 
    c_ptr = saved_c_ptr;
  else {
    /* We need to check whether the character is encoded in the minimum number of bytes. */
    if (length == 1) {
      /* Great, nothing to do. */
    }
    else if (length == 2) {
      if (*c_ptr < 0xc2) /* Character is <= 127, can be encoded in a single byte. */
        c_ptr = saved_c_ptr;
    } else if (length == 3) {
      if (*c_ptr == 0xe0 && cptr[1] < 0xa0) /* Character is <= 0x800, can be encoded in less bytes. */
        c_ptr = saved_c_ptr;
    } else
      ....
  }

For me this is way too much checks, and affects compiler optimizations too much.

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 28 Sep 2014 10:11:59 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 28 06:11:59 2014
Received: from localhost ([127.0.0.1]:54105 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XYBSQ-0005YB-NW
	for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 06:11:59 -0400
Received: from eggs.gnu.org ([208.118.235.92]:39126)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XYBSN-0005Y2-Vi
 for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 06:11:56 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XYBSD-000432-C3
 for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 06:11:55 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: **
X-Spam-Status: No, score=2.4 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 MALFORMED_FREEMAIL,UNPARSEABLE_RELAY autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:55316)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XYBSD-00042r-8k
 for submit <at> debbugs.gnu.org; Sun, 28 Sep 2014 06:11:45 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40841)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XYBRz-0004Xv-T8
 for bug-grep@HIDDEN; Sun, 28 Sep 2014 06:11:40 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XYBRr-00041T-JE
 for bug-grep@HIDDEN; Sun, 28 Sep 2014 06:11:31 -0400
Received: from iwiw02d.mail.t-online.hu ([84.2.42.67]:46040)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XYBRr-000417-9u
 for bug-grep@HIDDEN; Sun, 28 Sep 2014 06:11:23 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by iwiw02d.mail.t-online.hu (Postfix) with SMTP id 86BB94875F2
 for <bug-grep@HIDDEN>; Sun, 28 Sep 2014 12:11:26 +0200 (CEST)
Received: (qmail 78673 invoked by uid 151); 28 Sep 2014 12:11:16 +0200
Received: from 195.228.245.211 (HELO fmxmldata04.freemail.hu) (193.226.212.27)
 by fmx24.freemail.hu with SMTP; 28 Sep 2014 12:11:16 +0200
Received: from webmail by smtp gw id s8SABGei083174;
 Sun, 28 Sep 2014 12:11:16 +0200 (CEST)
Date: Sun, 28 Sep 2014 12:11:16 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <54272400.1020704@HIDDEN>
Message-ID: <freemail.20140928121116.83173.3@HIDDEN>
X-Originating-IP: [193.226.212.27]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.4 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

>> In the regex world, matching performance is the key aspect of an engine
>
>Absolutely.  That's why we're having this discussion: libpcre is slow when 
>matching binary data.

For me the question is whether binary search needs to supported on PCRE level. There are other questions like this. People ask string replacement support in PCRE from time to time. It can be implemented of course, we could add a whole string handling API to PCRE. But we feel this is outside the scope of PCRE. Any string management library can use PCRE for string replacement, we don't need another one.

Binary matching is similar thing. PCRE is used by some (I think closed source) projects for network data filtering, which is obviously binary data. They use some kind of pre-filtering, data arranging and partial matching to efficiently check TCP stream data (without waiting for the whole stream to arrive).

>> A "simple" change like this would require a major redesign of the engine.
>
>It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm volunteering....)

Anything can be done, which as an algorithmic solution, this was never a question. The question is whether it is worth to do it on PCRE or higher level. Perl/PCRE is all about text processing, characters which has meanings, types, sub-types, other case(s). Unicode is also about defining characters, not binary data. UTF is an encoding format, not mapping random bytes to characters.

If this task would be trivial, I wouldn't mind doing it myself. But it seems this task is about destroying what we built so far. A lot of extra checks to process invalid bytes, a large code size increase, and removing a lot of optimizations. The result might be much slower than using clever pre-filtering.

>> What should happen, if the starting offset is inside an otherwise valid UTF character?
>
>The same thing that would happen if an input file started with the tail end of a 
>UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this already; 
>it's not a problem.

The question was about intermediate start offsets. You have a 100 byte long buffer, and you start matching from byte 50. That is part of a valid UTF byte. Your pattern starts with an invalid character, which matches to that UTF fragment. You said invalid UTF character matches only themselves, not part of other characters. A lot of extra check again, preparing for the worst case.

>> This might be efficient for engines which scans the input only forward direction
> > and read every character once.
>
>It can also be efficient for matchers, like grep's, that don't necessarily do 
>that.  It just takes more implementation work, that's all.  It's not rocket 
>science to go backwards through a UTF-8 string and to catch decoding errors as 
>you go.

My problem is the lot of "ifs", you need to execute. Lets compare the current and the proposed solution.

char* c_ptr /* Current string position. */

Current:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  while (*cptr & 0xc0 == 0x80)
   cptr--;

Proposed solution:
  if (c_ptr == input_start)
    return FAIL;
  c_ptr--;
  char* saved_c_ptr = c_ptr; /* We need to save the starting position, loosing a CPU register for that. */
  while (*cptr & 0xc0 == 0x80) {
    if (c_ptr == input_start)
      return FAIL;
    cptr--;
  }

  /* We moved back a lot, we don't know where are we. Check character length. */
  int length = utf_length[*cptr]; /* Another lost register. Compiler life is difficult. */
  if (cptr + length != saved_c_ptr + 1) 
    c_ptr = saved_c_ptr;
  else {
    /* We need to check whether the character is encoded in the minimum number of bytes. */
    if (length == 1) {
      /* Great, nothing to do. */
    }
    else if (length == 2) {
      if (*c_ptr < 0xc2) /* Character is <= 127, can be encoded in a single byte. */
        c_ptr = saved_c_ptr;
    } else if (length == 3) {
      if (*c_ptr == 0xe0 && cptr[1] < 0xa0) /* Character is <= 0x800, can be encoded in less bytes. */
        c_ptr = saved_c_ptr;
    } else
      ....
  }

For me this is way too much checks, and affects compiler optimizations too much.

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 22:36:48 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 27 18:36:48 2014
Received: from localhost ([127.0.0.1]:53995 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XY0bf-0005l2-5U
	for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 18:36:47 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:38366)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XY0bc-0005ks-5C
 for 18454 <at> debbugs.gnu.org; Sat, 27 Sep 2014 18:36:45 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 2C70A39E8015;
 Sat, 27 Sep 2014 15:36:43 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id AXgGeRZCknfA; Sat, 27 Sep 2014 15:36:38 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 13C6039E8011;
 Sat, 27 Sep 2014 15:36:38 -0700 (PDT)
Message-ID: <54273BF5.2060605@HIDDEN>
Date: Sat, 27 Sep 2014 15:36:37 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: Jim Meyering <jim@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <541A750E.2050606@HIDDEN> <20140918083327.GA16324@nomada>
 <CA+8g5KH3LY75wVb3WsL8dvTt4FhfiO=cuYCadRD2R=9nrpw_hg@HIDDEN>
 <CA+8g5KHdnaEB=yYF5Kp7XCd7GgvjL73HeoOHNEPCAqy0KPs6+w@HIDDEN>
 <5424B1E6.8090502@HIDDEN>
 <CA+8g5KEcQ_Y-raQRmoiyz=SVMH-gSizMVdYutUYt6J33YMPzVQ@HIDDEN>
In-Reply-To: <CA+8g5KEcQ_Y-raQRmoiyz=SVMH-gSizMVdYutUYt6J33YMPzVQ@HIDDEN>
Content-Type: multipart/mixed; boundary="------------070206070402090608040206"
X-Spam-Score: -3.2 (---)
X-Debbugs-Envelope-To: 18454
Cc: =?UTF-8?B?U2FudGlhZ28gUnVhbm8gUmluY8Ozbg==?= <santiago@HIDDEN>,
 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.2 (---)

This is a multi-part message in MIME format.
--------------070206070402090608040206
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Jim Meyering wrote:

> I've pushed this follow-up patch to suppress a new warning:

Thanks, I expect I didn't get that warning because I built on x86-64, which 
allows unaligned accesses so GCC doesn't complain.  (Incidentally, I had already 
tried modifying the code to exploit the fact that unaligned accesses are OK on 
x86ish platforms, but that made the word-by-word loop go slower, so no dice.)

Too bad GCC isn't smart enough to notice that the pointer must be aligned.  It 
strikes me that this problem must come up elsewhere, and that it's worth writing 
a macro to encapsulate the situation.  I pushed the attached follow-up patch, 
which is an attempt to move in that direction.

--------------070206070402090608040206
Content-Type: text/plain; charset=UTF-8;
 name="0001-maint-generalize-the-Wcast-align-fix.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="0001-maint-generalize-the-Wcast-align-fix.patch"

RnJvbSA2MTMzYjhlMDBhODg3NmFhYTY5Y2U1MWQyZmUyNWExNzA0MGFmZTFhIE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBTYXQsIDI3IFNlcCAyMDE0IDE1OjMxOjEyIC0wNzAwClN1YmplY3Q6IFtQQVRD
SF0gbWFpbnQ6IGdlbmVyYWxpemUgdGhlIC1XY2FzdC1hbGlnbiBmaXgKCiogc3JjL2dyZXAu
YyAoQ0FTVF9BTElHTkVEKTogTmV3IG1hY3JvLgooc2tpcF9lYXN5X2J5dGVzKTogVXNlIGl0
LgotLS0KIHNyYy9ncmVwLmMgfCAyNCArKysrKysrKysrKysrKysrLS0tLS0tLS0KIDEgZmls
ZSBjaGFuZ2VkLCAxNiBpbnNlcnRpb25zKCspLCA4IGRlbGV0aW9ucygtKQoKZGlmZiAtLWdp
dCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IDIwN2JkZWEuLmJiNWJhMWMgMTAw
NjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAuYwpAQCAtNDY5LDYgKzQ2OSwy
MSBAQCBpbml0X2Vhc3lfZW5jb2RpbmcgKHZvaWQpCiAgICAgZWFzeV9lbmNvZGluZyAmPSBt
YmNsZW5fY2FjaGVbaV0gPT0gMTsKIH0KIAorLyogQSBjYXN0IHRvIFRZUEUgb2YgVkFMLiAg
VXNlIHRoaXMgd2hlbiBUWVBFIGlzIGEgcG9pbnRlciB0eXBlLCBWQUwKKyAgIGlzIHByb3Bl
cmx5IGFsaWduZWQgZm9yIFRZUEUsIGFuZCAnZ2NjIC1XY2FzdC1hbGlnbicgY2Fubm90IGlu
ZmVyCisgICB0aGUgYWxpZ25tZW50IGFuZCB3b3VsZCBvdGhlcndpc2UgY29tcGxhaW4gYWJv
dXQgdGhlIGNhc3QuICAqLworI2lmIDQgPCBfX0dOVUNfXyArICg2IDw9IF9fR05VQ19NSU5P
Ul9fKQorIyBkZWZpbmUgQ0FTVF9BTElHTkVEKHR5cGUsIHZhbCkgICAgICAgICAgICAgICAg
ICAgICAgICAgICBcCisgICAgKHsgX190eXBlb2ZfXyAodmFsKSB2YWxfID0gdmFsOyAgICAg
ICAgICAgICAgICAgICAgICAgIFwKKyAgICAgICBfUHJhZ21hICgiR0NDIGRpYWdub3N0aWMg
cHVzaCIpICAgICAgICAgICAgICAgICAgICAgXAorICAgICAgIF9QcmFnbWEgKCJHQ0MgZGlh
Z25vc3RpYyBpZ25vcmVkIFwiLVdjYXN0LWFsaWduXCIiKSBcCisgICAgICAgKHR5cGUpIHZh
bF87ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIFwKKyAgICAgICBf
UHJhZ21hICgiR0NDIGRpYWdub3N0aWMgcG9wIikgICAgICAgICAgICAgICAgICAgICAgXAor
ICAgIH0pCisjZWxzZQorIyBkZWZpbmUgQ0FTVF9BTElHTkVEKHR5cGUsIHZhbCkgKCh0eXBl
KSAodmFsKSkKKyNlbmRpZgorCiAvKiBBbiB1bnNpZ25lZCB0eXBlIHN1aXRhYmxlIGZvciBm
YXN0IG1hdGNoaW5nLiAgKi8KIHR5cGVkZWYgdWludG1heF90IHV3b3JkOwogCkBAIC00OTYs
MTUgKzUxMSw4IEBAIHNraXBfZWFzeV9ieXRlcyAoY2hhciBjb25zdCAqYnVmKQogICBmb3Ig
KHAgPSBidWY7ICh1aW50cHRyX3QpIHAgJSBzaXplb2YgKHV3b3JkKSAhPSAwOyBwKyspCiAg
ICAgaWYgKCpwICYgSElCWVRFKQogICAgICAgcmV0dXJuIHA7Ci0KLSNwcmFnbWEgR0NDIGRp
YWdub3N0aWMgcHVzaAotI3ByYWdtYSBHQ0MgZGlhZ25vc3RpYyBpZ25vcmVkICItV2Nhc3Qt
YWxpZ24iCi0gIC8qIFdlIGhhdmUgYWxpZ25lZCBQIHRvIGEgdXdvcmQgYm91bmRhcnksIHNv
IHdlIGNhbiBzYWZlbHkKLSAgICAgdGVsbCBnY2MgdG8gc3VwcHJlc3MgaXRzIGNhc3QtYWxp
Z25tZW50IHdhcm5pbmcuICAqLwotICBmb3IgKHMgPSAodXdvcmQgY29uc3QgKikgcDsgISAo
KnMgJiBoaWJ5dGVfbWFzayk7IHMrKykKKyAgZm9yIChzID0gQ0FTVF9BTElHTkVEICh1d29y
ZCBjb25zdCAqLCBwKTsgISAoKnMgJiBoaWJ5dGVfbWFzayk7IHMrKykKICAgICBjb250aW51
ZTsKLSNwcmFnbWEgR0NDIGRpYWdub3N0aWMgcG9wCi0KICAgZm9yIChwID0gKGNoYXIgY29u
c3QgKikgczsgISAoKnAgJiBISUJZVEUpOyBwKyspCiAgICAgY29udGludWU7CiAgIHJldHVy
biBwOwotLSAKMS45LjMKCg==
--------------070206070402090608040206--




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 27 Sep 2014 20:55:27 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 27 16:55:27 2014
Received: from localhost ([127.0.0.1]:53969 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXz1Z-0003MF-Ty
	for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 16:55:26 -0400
Received: from eggs.gnu.org ([208.118.235.92]:40246)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXz1W-0003M4-0v
 for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 16:55:22 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXz1M-0007ld-2o
 for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 16:55:21 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:34594)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXz1L-0007kW-VF
 for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 16:55:11 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41937)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXz19-0001Vk-9k
 for bug-grep@HIDDEN; Sat, 27 Sep 2014 16:55:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXz11-0007Xy-MJ
 for bug-grep@HIDDEN; Sat, 27 Sep 2014 16:54:59 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:54354)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXz11-0007Wk-Er
 for bug-grep@HIDDEN; Sat, 27 Sep 2014 16:54:51 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 7355639E8014;
 Sat, 27 Sep 2014 13:54:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id XCXEgSYvvXcj; Sat, 27 Sep 2014 13:54:29 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id A3C6739E8011;
 Sat, 27 Sep 2014 13:54:29 -0700 (PDT)
Message-ID: <54272400.1020704@HIDDEN>
Date: Sat, 27 Sep 2014 13:54:24 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140927201645.67744.3@HIDDEN>
In-Reply-To: <freemail.20140927201645.67744.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 208.118.235.17
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

Zolt=C3=A1n Herczeg wrote:
> He said 'I still want "." to match a single (valid) UTF-8 character.'

That's what the GNU matchers do, yes.  '.' does not match an invalid byte=
.  It's=20
a reasonable default.  If you have some users who want '.' to match an in=
valid=20
byte, you can add a flag for them, just as there's a PCRE_DOTALL flag for=
 users=20
who want '.' to match newline.  That being said, I doubt whether users wi=
ll care=20
enough to need such a flag.  (After all, they're evidently not caring *no=
w*, as=20
libpcre can't search such data at *all*.)

> In the regex world, matching performance is the key aspect of an engine

Absolutely.  That's why we're having this discussion: libpcre is slow whe=
n=20
matching binary data.

> A "simple" change like this would require a major redesign of the engin=
e.

It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm volunte=
ering....)

> What should happen, if the starting offset is inside an otherwise valid=
 UTF character?

The same thing that would happen if an input file started with the tail e=
nd of a=20
UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this a=
lready;=20
it's not a problem.

>> Filtering would not be needed if libpcre were like grep's other matche=
rs
>> and simply worked with arbitrary binary data.
>
> This might be efficient for engines which scans the input only forward =
direction
 > and read every character once.

It can also be efficient for matchers, like grep's, that don't necessaril=
y do=20
that.  It just takes more implementation work, that's all.  It's not rock=
et=20
science to go backwards through a UTF-8 string and to catch decoding erro=
rs as=20
you go.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 20:54:47 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 27 16:54:47 2014
Received: from localhost ([127.0.0.1]:53965 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXz0w-0003Kn-Hc
	for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 16:54:47 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:35712)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXz0p-0003KW-MY
 for 18454 <at> debbugs.gnu.org; Sat, 27 Sep 2014 16:54:41 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 7355639E8014;
 Sat, 27 Sep 2014 13:54:38 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id XCXEgSYvvXcj; Sat, 27 Sep 2014 13:54:29 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id A3C6739E8011;
 Sat, 27 Sep 2014 13:54:29 -0700 (PDT)
Message-ID: <54272400.1020704@HIDDEN>
Date: Sat, 27 Sep 2014 13:54:24 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140927201645.67744.3@HIDDEN>
In-Reply-To: <freemail.20140927201645.67744.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -3.2 (---)
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.2 (---)

Zoltán Herczeg wrote:
> He said 'I still want "." to match a single (valid) UTF-8 character.'

That's what the GNU matchers do, yes.  '.' does not match an invalid byte.  It's 
a reasonable default.  If you have some users who want '.' to match an invalid 
byte, you can add a flag for them, just as there's a PCRE_DOTALL flag for users 
who want '.' to match newline.  That being said, I doubt whether users will care 
enough to need such a flag.  (After all, they're evidently not caring *now*, as 
libpcre can't search such data at *all*.)

> In the regex world, matching performance is the key aspect of an engine

Absolutely.  That's why we're having this discussion: libpcre is slow when 
matching binary data.

> A "simple" change like this would require a major redesign of the engine.

It'd be nontrivial, yes.  But it's clearly doable.  (Not that I'm volunteering....)

> What should happen, if the starting offset is inside an otherwise valid UTF character?

The same thing that would happen if an input file started with the tail end of a 
UTF-8 sequence.  The leading bytes are invalid.  'grep' deals with this already; 
it's not a problem.

>> Filtering would not be needed if libpcre were like grep's other matchers
>> and simply worked with arbitrary binary data.
>
> This might be efficient for engines which scans the input only forward direction
 > and read every character once.

It can also be efficient for matchers, like grep's, that don't necessarily do 
that.  It just takes more implementation work, that's all.  It's not rocket 
science to go backwards through a UTF-8 string and to catch decoding errors as 
you go.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 27 Sep 2014 18:24:43 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 27 14:24:43 2014
Received: from localhost ([127.0.0.1]:53894 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXwfi-0006ib-Ia
	for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 14:24:43 -0400
Received: from eggs.gnu.org ([208.118.235.92]:49775)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XXwfg-0006iS-0H
 for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 14:24:40 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXwfO-0006uz-FO
 for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 14:24:39 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: *
X-Spam-Status: No, score=1.4 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 MALFORMED_FREEMAIL,UNPARSEABLE_RELAY autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:53870)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXwfO-0006ty-CV
 for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 14:24:22 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50707)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXwYI-00081W-PL
 for bug-grep@HIDDEN; Sat, 27 Sep 2014 14:17:11 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXwYA-00057n-FD
 for bug-grep@HIDDEN; Sat, 27 Sep 2014 14:17:02 -0400
Received: from iwiw02d.mail.t-online.hu ([84.2.42.67]:14608)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXwYA-000566-5N
 for bug-grep@HIDDEN; Sat, 27 Sep 2014 14:16:54 -0400
Received: from fmx25.freemail.hu (fmx25.freemail.hu [195.228.245.75])
 by iwiw02d.mail.t-online.hu (Postfix) with SMTP id DD634487101
 for <bug-grep@HIDDEN>; Sat, 27 Sep 2014 20:16:54 +0200 (CEST)
Received: (qmail 38138 invoked by uid 151); 27 Sep 2014 20:16:45 +0200
Received: from 195.228.245.211 (HELO fmxmldata01.freemail.hu) (91.82.212.146)
 by fmx25.freemail.hu with SMTP; 27 Sep 2014 20:16:45 +0200
Received: from webmail by smtp gw id s8RIGjrK067745;
 Sat, 27 Sep 2014 20:16:45 +0200 (CEST)
Date: Sat, 27 Sep 2014 20:16:45 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <5425BC8D.9040305@HIDDEN>
Message-ID: <freemail.20140927201645.67744.3@HIDDEN>
X-Originating-IP: [91.82.212.146]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.4 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Hi,

>Sorry, I assume you meant \x9c here?  Anyway, the point is that 
>conceptually you walk through the input byte sequence left-to-right, 
>converting it to characters as you go, and if you encounter an encoding 
>error in the process you convert the error to the corresponding 
>"character" outside the Unicode range.  You then do all matching against 
>the converted sequence.  So there is no question about interpretation: 
>it's the left-to-right interpretation.  This simple and easy-to-explain 
>approach is used by grep's other matchers, by Emacs, etc.

This was one of my proposal, we need a converter before we run PCRE. To be more precise, we likely need several converters, and users can select the appropriate for their use case.

>Obviously you don't want to *implement* it the way I described; instead, 
>you want to convert on-the-fly, lazily.  But whatever optimizations you 
>do, you do consistently with the conceptual model.

I would implement exactly as you described. PCRE is a complex backtracking engine, we need to decode the input forward and backward directions from any starting position, and several characters parsed multiple times depending on the pattern and input. We also employ many optimizations to make this as fast as possible, especially in JIT. The overhead of decoding invalid characters accumulates for every character regardless they are valid or not.

>In practice, the simple approach explained above works well enough to 
>satisfy the vast majority of users.  It's conceivable some special cases 
>in the PCRE world would have trouble fitting into this model, but to be 
>honest I expect this won't be a problem, and that there won't be any 
>serious conceptual issues here, though admittedly there will be some 
>nontrivial programming effort.

The approach might sound simple, but its side effects are non-trivial. For example, if we would implement the way suggested before, the guy, who you quoted, would not be satisfied. He said 'I still want "." to match a single (valid) UTF-8 character.' The dot character matches anything but newline. According to you, the invalid code points should have a "minimal" type, so they would match.

In the regex world, matching performance is the key aspect of an engine, and this is our focus in PCRE. But every advantage has a trade-of. In PCRE, this is source code complexity. A "simple" change like this would require a major redesign of the engine.

>Again, the proposed change should not slow down libpcre.  It should 
>speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre 
>could use exactly the same code it has now, so performance would be 
>unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should 
>typically be faster than it is now, because it would avoid unnecessary 
>UTF-8 validation for the parts of the input string that it does not examine.

Partial matching was invented exactly for this purpose. You can divide the input into small chunks, filter them, and perform matches. Btw, how partial matching is affected by your proposed solution? What should happen, if the starting offset is inside an otherwise valid UTF character?

>Filtering would not be needed if libpcre were like grep's other matchers 
>and simply worked with arbitrary binary data.

This might be efficient for engines which scans the input only forward direction and read every character once. This is not true for PCRE.

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 18:16:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 27 14:16:53 2014
Received: from localhost ([127.0.0.1]:53890 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXwY7-0006WR-W6
	for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 14:16:52 -0400
Received: from iwiw01d.mail.t-online.hu ([84.2.42.53]:36244
 helo=fmxout01.freemail.hu) by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XXwY3-0006WG-N4
 for 18454 <at> debbugs.gnu.org; Sat, 27 Sep 2014 14:16:49 -0400
Received: from fmx25.freemail.hu (fmx25.freemail.hu [195.228.245.75])
 by fmxout01.freemail.hu (Postfix) with SMTP id 420AE361C
 for <18454 <at> debbugs.gnu.org>; Sat, 27 Sep 2014 20:16:45 +0200 (CEST)
Received: (qmail 38138 invoked by uid 151); 27 Sep 2014 20:16:45 +0200
Received: from 195.228.245.211 (HELO fmxmldata01.freemail.hu) (91.82.212.146)
 by fmx25.freemail.hu with SMTP; 27 Sep 2014 20:16:45 +0200
Received: from webmail by smtp gw id s8RIGjrK067745;
 Sat, 27 Sep 2014 20:16:45 +0200 (CEST)
Date: Sat, 27 Sep 2014 20:16:45 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <5425BC8D.9040305@HIDDEN>
Message-ID: <freemail.20140927201645.67744.3@HIDDEN>
X-Originating-IP: [91.82.212.146]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-Spam-Score: 0.6 (/)
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Hi,

>Sorry, I assume you meant \x9c here?  Anyway, the point is that 
>conceptually you walk through the input byte sequence left-to-right, 
>converting it to characters as you go, and if you encounter an encoding 
>error in the process you convert the error to the corresponding 
>"character" outside the Unicode range.  You then do all matching against 
>the converted sequence.  So there is no question about interpretation: 
>it's the left-to-right interpretation.  This simple and easy-to-explain 
>approach is used by grep's other matchers, by Emacs, etc.

This was one of my proposal, we need a converter before we run PCRE. To be more precise, we likely need several converters, and users can select the appropriate for their use case.

>Obviously you don't want to *implement* it the way I described; instead, 
>you want to convert on-the-fly, lazily.  But whatever optimizations you 
>do, you do consistently with the conceptual model.

I would implement exactly as you described. PCRE is a complex backtracking engine, we need to decode the input forward and backward directions from any starting position, and several characters parsed multiple times depending on the pattern and input. We also employ many optimizations to make this as fast as possible, especially in JIT. The overhead of decoding invalid characters accumulates for every character regardless they are valid or not.

>In practice, the simple approach explained above works well enough to 
>satisfy the vast majority of users.  It's conceivable some special cases 
>in the PCRE world would have trouble fitting into this model, but to be 
>honest I expect this won't be a problem, and that there won't be any 
>serious conceptual issues here, though admittedly there will be some 
>nontrivial programming effort.

The approach might sound simple, but its side effects are non-trivial. For example, if we would implement the way suggested before, the guy, who you quoted, would not be satisfied. He said 'I still want "." to match a single (valid) UTF-8 character.' The dot character matches anything but newline. According to you, the invalid code points should have a "minimal" type, so they would match.

In the regex world, matching performance is the key aspect of an engine, and this is our focus in PCRE. But every advantage has a trade-of. In PCRE, this is source code complexity. A "simple" change like this would require a major redesign of the engine.

>Again, the proposed change should not slow down libpcre.  It should 
>speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre 
>could use exactly the same code it has now, so performance would be 
>unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should 
>typically be faster than it is now, because it would avoid unnecessary 
>UTF-8 validation for the parts of the input string that it does not examine.

Partial matching was invented exactly for this purpose. You can divide the input into small chunks, filter them, and perform matches. Btw, how partial matching is affected by your proposed solution? What should happen, if the starting offset is inside an otherwise valid UTF character?

>Filtering would not be needed if libpcre were like grep's other matchers 
>and simply worked with arbitrary binary data.

This might be efficient for engines which scans the input only forward direction and read every character once. This is not true for PCRE.

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 27 Sep 2014 17:52:41 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Sep 27 13:52:41 2014
Received: from localhost ([127.0.0.1]:53877 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXwAi-0005ug-3d
	for submit <at> debbugs.gnu.org; Sat, 27 Sep 2014 13:52:40 -0400
Received: from mail-lb0-f169.google.com ([209.85.217.169]:49791)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <meyering@HIDDEN>) id 1XXwAf-0005uR-1w
 for 18454 <at> debbugs.gnu.org; Sat, 27 Sep 2014 13:52:38 -0400
Received: by mail-lb0-f169.google.com with SMTP id u10so2603011lbd.14
 for <18454 <at> debbugs.gnu.org>; Sat, 27 Sep 2014 10:52:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc:content-type;
 bh=RGLCmnsLsHk8uv6nuL4NnEuS+GB66112nKROZVrIAWw=;
 b=AQ9mqXdsz6t909OH5GtrbuQnIxNPP+OXexqXT7ZQcS8B5wBcJDbyp0q/++m0Xlz4tb
 hVCFiZV55KFHP5efxCrQaOuGXruaQWQdSWcTgSwHSRNwJ7JH0CSnaAYgLGbUvfXmHriz
 aCoO96VkzqXfeH5DsPIDMnRf7ysKM6OKau8SBFMlwLW+UkO9w3GK8kyR7qp09XadLeYH
 /IzN86F0GYS2hsNtZ2g9fgrX8ypJFT2VjhGepJV49QJ02Bp+OId797Jq0iJ9/gnwlGfG
 X1LrjP9jR1Lu3NVnkgO+oq6fcAJYH4nYdQXuoXKyOFLjDaEt8V5Ckcyx/sve3JoNZlgJ
 VBnQ==
X-Received: by 10.152.87.193 with SMTP id ba1mr4895393lab.83.1411840355593;
 Sat, 27 Sep 2014 10:52:35 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.25.23.89 with HTTP; Sat, 27 Sep 2014 10:52:15 -0700 (PDT)
In-Reply-To: <5424B1E6.8090502@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <541A750E.2050606@HIDDEN> <20140918083327.GA16324@nomada>
 <CA+8g5KH3LY75wVb3WsL8dvTt4FhfiO=cuYCadRD2R=9nrpw_hg@HIDDEN>
 <CA+8g5KHdnaEB=yYF5Kp7XCd7GgvjL73HeoOHNEPCAqy0KPs6+w@HIDDEN>
 <5424B1E6.8090502@HIDDEN>
From: Jim Meyering <jim@HIDDEN>
Date: Sat, 27 Sep 2014 10:52:15 -0700
X-Google-Sender-Auth: dU-j5j3A9z-jgSEzlbx10klNN5A
Message-ID: <CA+8g5KEcQ_Y-raQRmoiyz=SVMH-gSizMVdYutUYt6J33YMPzVQ@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
Content-Type: multipart/mixed; boundary=001a11c345c2e00fbe05040fb2e8
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 18454
Cc: =?ISO-8859-1?Q?Santiago_Ruano_Rinc=F3n?= <santiago@HIDDEN>,
 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.7 (/)

--001a11c345c2e00fbe05040fb2e8
Content-Type: text/plain; charset=ISO-8859-1

On Thu, Sep 25, 2014 at 5:23 PM, Paul Eggert <eggert@HIDDEN> wrote:
> Thanks for looking into that.  The attached patches solve those performance
> problems for me.

I've pushed this follow-up patch to suppress a new warning:

--001a11c345c2e00fbe05040fb2e8
Content-Type: application/octet-stream; 
	name="0001-maint-suppress-a-false-positive-Wcast-align-warning.patch"
Content-Disposition: attachment; 
	filename="0001-maint-suppress-a-false-positive-Wcast-align-warning.patch"
Content-Transfer-Encoding: base64
X-Attachment-Id: f_i0l99gju2

RnJvbSAzZjk5ZmYwMWVhYmQwN2Q0NjBjZDBjMDg2MDJhZWIzNGUzOGUwZDJiIE1vbiBTZXAgMTcg
MDA6MDA6MDAgMjAwMQpGcm9tOiBKaW0gTWV5ZXJpbmcgPG1leWVyaW5nQGZiLmNvbT4KRGF0ZTog
U2F0LCAyNyBTZXAgMjAxNCAwOTo0NDo0NyAtMDcwMApTdWJqZWN0OiBbUEFUQ0hdIG1haW50OiBz
dXBwcmVzcyBhIGZhbHNlLXBvc2l0aXZlIC1XY2FzdC1hbGlnbiB3YXJuaW5nCgpCdWlsZGluZyB3
aXRoIC0tZW5hYmxlLWdjYy13YXJuaW5ncyBhbmQgZ2NjLTQuOS4xIHdvdWxkIHByb3Zva2UgdGhp
czoKICBncmVwLmM6NDk5OjEyOiBlcnJvcjogY2FzdCBmcm9tICdjb25zdCBjaGFyIConIHRvICdj
b25zdCB1d29yZCAqJ1wKICAgICAgKGFrYSAnY29uc3QgdW5zaWduZWQgbG9uZyAqJykgaW5jcmVh
c2VzIHJlcXVpcmVkIGFsaWdubWVudCBmcm9tXAogICAgICAxIHRvIDggWy1XZXJyb3IsLVdjYXN0
LWFsaWduXQogICAgZm9yIChzID0gKHV3b3JkIGNvbnN0ICopIHA7ICEgKCpzICYgaGlieXRlX21h
c2spOyBzKyspCgkgICAgIF5+fn5+fn5+fn5+fn5+fn5+Ciogc3JjL2dyZXAuYyAoc2tpcF9lYXN5
X2J5dGVzKTogVXNlIGEgcHJhZ21hIHRvIHN1cHByZXNzCmdjYydzIGZhbHNlLXBvc2l0aXZlIGNh
c3QtYWxpZ25tZW50IHdhcm5pbmcuCi0tLQogc3JjL2dyZXAuYyB8IDcgKysrKysrKwogMSBmaWxl
IGNoYW5nZWQsIDcgaW5zZXJ0aW9ucygrKQoKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMv
Z3JlcC5jCmluZGV4IDA0NmYxN2YuLjIwN2JkZWEgMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysr
IGIvc3JjL2dyZXAuYwpAQCAtNDk2LDggKzQ5NiwxNSBAQCBza2lwX2Vhc3lfYnl0ZXMgKGNoYXIg
Y29uc3QgKmJ1ZikKICAgZm9yIChwID0gYnVmOyAodWludHB0cl90KSBwICUgc2l6ZW9mICh1d29y
ZCkgIT0gMDsgcCsrKQogICAgIGlmICgqcCAmIEhJQllURSkKICAgICAgIHJldHVybiBwOworCisj
cHJhZ21hIEdDQyBkaWFnbm9zdGljIHB1c2gKKyNwcmFnbWEgR0NDIGRpYWdub3N0aWMgaWdub3Jl
ZCAiLVdjYXN0LWFsaWduIgorICAvKiBXZSBoYXZlIGFsaWduZWQgUCB0byBhIHV3b3JkIGJvdW5k
YXJ5LCBzbyB3ZSBjYW4gc2FmZWx5CisgICAgIHRlbGwgZ2NjIHRvIHN1cHByZXNzIGl0cyBjYXN0
LWFsaWdubWVudCB3YXJuaW5nLiAgKi8KICAgZm9yIChzID0gKHV3b3JkIGNvbnN0ICopIHA7ICEg
KCpzICYgaGlieXRlX21hc2spOyBzKyspCiAgICAgY29udGludWU7CisjcHJhZ21hIEdDQyBkaWFn
bm9zdGljIHBvcAorCiAgIGZvciAocCA9IChjaGFyIGNvbnN0ICopIHM7ICEgKCpwICYgSElCWVRF
KTsgcCsrKQogICAgIGNvbnRpbnVlOwogICByZXR1cm4gcDsKLS0gCjIuMS4wCgo=
--001a11c345c2e00fbe05040fb2e8--




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 26 Sep 2014 19:25:01 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 15:25:01 2014
Received: from localhost ([127.0.0.1]:53303 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXb8W-0008Jd-2X
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 15:25:00 -0400
Received: from eggs.gnu.org ([208.118.235.92]:59165)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXb8T-0008JV-CF
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 15:24:58 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXb8J-0001Xf-8G
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 15:24:57 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:58602)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXb59-0000o0-M3
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 15:21:31 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:60208)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXb4t-0008AA-HQ
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 15:21:26 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXb4f-0000gr-Ai
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 15:21:15 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:39141)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXb4f-0000eA-16
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 15:21:01 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 199E6A60001;
 Fri, 26 Sep 2014 12:20:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id SiFITjrpXlTK; Fri, 26 Sep 2014 12:20:46 -0700 (PDT)
Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 4CEB439E8011;
 Fri, 26 Sep 2014 12:20:46 -0700 (PDT)
Message-ID: <5425BC8D.9040305@HIDDEN>
Date: Fri, 26 Sep 2014 12:20:45 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140926200433.55424.3@HIDDEN>
In-Reply-To: <freemail.20140926200433.55424.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 208.118.235.17
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

On 09/26/2014 11:04 AM, Zolt=C3=A1n Herczeg wrote:
> this is a very interesting discussion.

Yes, I have a lot of other things I'm *supposed* to be doing, but this=20
thread is more fun....

>>> /(?<=3D\x9c)#/
>>>
>>> Does it match \xd5\x9c# starting from #?
>> No, because the input does not contain a \x9c encoding error.  Encodin=
g errors
>> match only themselves, not parts of other characters.  That is how the=
 glibc
>> matchers behave, and it's what users expect.
> Why \xc9 is part of another character? It depends how you interpret \xd=
5.

Sorry, I assume you meant \x9c here?  Anyway, the point is that=20
conceptually you walk through the input byte sequence left-to-right,=20
converting it to characters as you go, and if you encounter an encoding=20
error in the process you convert the error to the corresponding=20
"character" outside the Unicode range.  You then do all matching against=20
the converted sequence.  So there is no question about interpretation:=20
it's the left-to-right interpretation.  This simple and easy-to-explain=20
approach is used by grep's other matchers, by Emacs, etc.

Obviously you don't want to *implement* it the way I described; instead,=20
you want to convert on-the-fly, lazily.  But whatever optimizations you=20
do, you do consistently with the conceptual model.

> The problem is, you do it some way, and others need something else.

In practice, the simple approach explained above works well enough to=20
satisfy the vast majority of users.  It's conceivable some special cases=20
in the PCRE world would have trouble fitting into this model, but to be=20
honest I expect this won't be a problem, and that there won't be any=20
serious conceptual issues here, though admittedly there will be some=20
nontrivial programming effort.
.
> I have doubts that slowing down PCRE would increase grep performance.

Again, the proposed change should not slow down libpcre.  It should=20
speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre=20
could use exactly the same code it has now, so performance would be=20
unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should=20
typically be faster than it is now, because it would avoid unnecessary=20
UTF-8 validation for the parts of the input string that it does not exami=
ne.


> This is exactly the use case where filtering is needed. His input is a=20
> combination of binary and UTF data, and he needs matches only in the=20
> UTF part. Regards, Zoltan=20

Filtering would not be needed if libpcre were like grep's other matchers=20
and simply worked with arbitrary binary data.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 19:20:59 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 15:20:59 2014
Received: from localhost ([127.0.0.1]:53299 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXb4c-0008Cb-UK
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 15:20:59 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:48734)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXb4a-0008CS-EP
 for 18454 <at> debbugs.gnu.org; Fri, 26 Sep 2014 15:20:57 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 199E6A60001;
 Fri, 26 Sep 2014 12:20:55 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id SiFITjrpXlTK; Fri, 26 Sep 2014 12:20:46 -0700 (PDT)
Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 4CEB439E8011;
 Fri, 26 Sep 2014 12:20:46 -0700 (PDT)
Message-ID: <5425BC8D.9040305@HIDDEN>
Date: Fri, 26 Sep 2014 12:20:45 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140926200433.55424.3@HIDDEN>
In-Reply-To: <freemail.20140926200433.55424.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

On 09/26/2014 11:04 AM, Zoltán Herczeg wrote:
> this is a very interesting discussion.

Yes, I have a lot of other things I'm *supposed* to be doing, but this 
thread is more fun....

>>> /(?<=\x9c)#/
>>>
>>> Does it match \xd5\x9c# starting from #?
>> No, because the input does not contain a \x9c encoding error.  Encoding errors
>> match only themselves, not parts of other characters.  That is how the glibc
>> matchers behave, and it's what users expect.
> Why \xc9 is part of another character? It depends how you interpret \xd5.

Sorry, I assume you meant \x9c here?  Anyway, the point is that 
conceptually you walk through the input byte sequence left-to-right, 
converting it to characters as you go, and if you encounter an encoding 
error in the process you convert the error to the corresponding 
"character" outside the Unicode range.  You then do all matching against 
the converted sequence.  So there is no question about interpretation: 
it's the left-to-right interpretation.  This simple and easy-to-explain 
approach is used by grep's other matchers, by Emacs, etc.

Obviously you don't want to *implement* it the way I described; instead, 
you want to convert on-the-fly, lazily.  But whatever optimizations you 
do, you do consistently with the conceptual model.

> The problem is, you do it some way, and others need something else.

In practice, the simple approach explained above works well enough to 
satisfy the vast majority of users.  It's conceivable some special cases 
in the PCRE world would have trouble fitting into this model, but to be 
honest I expect this won't be a problem, and that there won't be any 
serious conceptual issues here, though admittedly there will be some 
nontrivial programming effort.
.
> I have doubts that slowing down PCRE would increase grep performance.

Again, the proposed change should not slow down libpcre.  It should 
speed it up.  That's the point.  In the PCRE_NO_UTF8_CHECK case, libpcre 
could use exactly the same code it has now, so performance would be 
unaffected.  And in the non-PCRE_NO_UTF8_CHECK case, libpcre should 
typically be faster than it is now, because it would avoid unnecessary 
UTF-8 validation for the parts of the input string that it does not examine.


> This is exactly the use case where filtering is needed. His input is a 
> combination of binary and UTF data, and he needs matches only in the 
> UTF part. Regards, Zoltan 

Filtering would not be needed if libpcre were like grep's other matchers 
and simply worked with arbitrary binary data.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 26 Sep 2014 18:05:19 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 14:05:19 2014
Received: from localhost ([127.0.0.1]:53266 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXZtO-0006D7-CQ
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 14:05:19 -0400
Received: from eggs.gnu.org ([208.118.235.92]:39159)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XXZtL-0006Co-Sn
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 14:05:16 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXZt6-0003RJ-UJ
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 14:05:10 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: *
X-Spam-Status: No, score=1.4 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 MALFORMED_FREEMAIL,UNPARSEABLE_RELAY autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:46686)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXZt6-0003QP-Rf
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 14:05:00 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:40816)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXZsv-00079J-5F
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 14:04:55 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXZsm-0003Nw-OY
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 14:04:49 -0400
Received: from iwiw01d.mail.t-online.hu ([84.2.42.53]:42439
 helo=fmxout01.freemail.hu) by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXZsm-0003Mr-E2
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 14:04:40 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by fmxout01.freemail.hu (Postfix) with SMTP id 8694C416B
 for <bug-grep@HIDDEN>; Fri, 26 Sep 2014 20:04:33 +0200 (CEST)
Received: (qmail 27675 invoked by uid 151); 26 Sep 2014 20:04:33 +0200
Received: from 195.228.245.211 (HELO fmxmldata06.freemail.hu) (79.120.253.101)
 by fmx24.freemail.hu with SMTP; 26 Sep 2014 20:04:33 +0200
Received: from webmail by smtp gw id s8QI4X7j055427;
 Fri, 26 Sep 2014 20:04:33 +0200 (CEST)
Date: Fri, 26 Sep 2014 20:04:33 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <54252840.2020409@HIDDEN>
Message-ID: <freemail.20140926200433.55424.3@HIDDEN>
X-Originating-IP: [79.120.253.101]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.4 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Hi,=0A=0Athis is a very interesting discussion.=0A=0A>> /(?<=3D\x9c)#/=0A>>=
=0A>> Does it match \xd5\x9c# starting from #?=0A>=0A>No, because the input=
 does not contain a \x9c encoding error.  Encoding errors =0A>match only th=
emselves, not parts of other characters.  That is how the glibc =0A>matcher=
s behave, and it's what users expect.=0A=0AWhy \xc9 is part of another char=
acter? It depends how you interpret \xd5. And this was just a simple exampl=
e.=0A=0A>> Noticing errors during a backward scan is complicated.=0A>=0A>It=
's doable, and it's the right thing to do.=0A=0AThe problem is, you do it s=
ome way, and others need something else. Just think about the example above=
.=0A=0A>Range expressions have implementation-defined semantics in POSIX.  =
For PCRE you =0A>can do what you like.  I suggest mapping encoding-error by=
tes into characters =0A>outside the Unicode range; that's what Emacs does, =
I think, and it's simple and =0A>easy to explain to users.  It's not a big =
deal either way.=0A=0AThis mapping idea is clever. Basically invalid codepo=
ints are converted to something valid.=0A=0A>> What kind of invalid and val=
id UTF byte sequences are inside (and outside) the bounds?=0A>=0A>Just trea=
t encoding-error bytes like everything else.  In effect, extend the =0A>enc=
oding to allow any byte sequence, and add a few "characters" outside the =
=0A>Unicode range, one for each invalid UTF-8 byte.=0A=0AIn other words, \x=
c9 really is an encoding error (since it is an invalid UTF-8 byte, followin=
g another invalid UTF-8 byte). This is what I said from the beginning, depe=
nding on the context, people choose different interpretations of handling U=
TF fragments. Usually they choose what is more convenient from that viewpoi=
nt. But if you put all pieces together, the result is full of contradiction=
s.=0A=0A>Sorry, I don't quite follow, but encoding errors aren't letters an=
d don't have =0A>case.  They match only themselves.=0A=0ANot necessarily. I=
t depends on your mapping: if more than one invalid UTF fragment is mapped =
to the same codepoint, they will match. Especially when you define range of=
 characters.=0A=0A> > What unicode properties does an invalid codepoint hav=
e?=0A>=0A>The minimal ones.=0A=0AWe could use the same flags as for charact=
ers between \x{d800}=E2=80=93\x{dfff}=0A=0A>> The question is, who would be=
 willing to do this work.=0A>=0A>Not me.  :-)=0A=0AI know this would be a l=
ot of work. And I have doubts that slowing down PCRE would increase grep pe=
rformance. Regardless, if somebody is willing to work on this, I can help. =
Please keep in mind that PCRE1 is considered done, and our efforts are limi=
ted to bugfixing. We are currently busy with PCRE2, and such a big change c=
ould only go there.=0A=0A>I'm not sure it'd be popular to add a --drain-bat=
tery option to grep. :)=0A=0AI don't think on performance hungry desktop or=
 server environments this really matters. On phone, you likely don't need t=
his feature.=0A=0A>I suggested that already, but the user (e.g., see the la=
st paragraph of =0A><http://bugs.gnu.org/18454#19>) says he wants to check =
for more-complicated =0A>UTF-8 patterns in binary data.  For example, I exp=
ect the user wants the pattern =0A>'Lef.vre' to match the UTF-8 string 'Lef=
=C3=A8vre' in a binary file.  So he can't =0A>simply use unibyte processing=
.=0A=0AThis is exactly the use case where filtering is needed. His input is=
 a combination of binary and UTF data, and he needs matches only in the UTF=
 part.=0A=0ARegards,=0AZoltan=0A




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 18:04:39 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 14:04:39 2014
Received: from localhost ([127.0.0.1]:53262 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXZsk-0006BY-9N
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 14:04:38 -0400
Received: from iwiw03d.mail.t-online.hu ([84.2.42.68]:15763)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XXZsh-0006BO-AW
 for 18454 <at> debbugs.gnu.org; Fri, 26 Sep 2014 14:04:36 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by iwiw03d.mail.t-online.hu (Postfix) with SMTP id 171D24EA6A8
 for <18454 <at> debbugs.gnu.org>; Fri, 26 Sep 2014 20:04:26 +0200 (CEST)
Received: (qmail 27675 invoked by uid 151); 26 Sep 2014 20:04:33 +0200
Received: from 195.228.245.211 (HELO fmxmldata06.freemail.hu) (79.120.253.101)
 by fmx24.freemail.hu with SMTP; 26 Sep 2014 20:04:33 +0200
Received: from webmail by smtp gw id s8QI4X7j055427;
 Fri, 26 Sep 2014 20:04:33 +0200 (CEST)
Date: Fri, 26 Sep 2014 20:04:33 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <54252840.2020409@HIDDEN>
Message-ID: <freemail.20140926200433.55424.3@HIDDEN>
X-Originating-IP: [79.120.253.101]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
X-Spam-Score: 0.6 (/)
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Hi,=0A=0Athis is a very interesting discussion.=0A=0A>> /(?<=3D\x9c)#/=0A>>=
=0A>> Does it match \xd5\x9c# starting from #?=0A>=0A>No, because the input=
 does not contain a \x9c encoding error.  Encoding errors =0A>match only th=
emselves, not parts of other characters.  That is how the glibc =0A>matcher=
s behave, and it's what users expect.=0A=0AWhy \xc9 is part of another char=
acter? It depends how you interpret \xd5. And this was just a simple exampl=
e.=0A=0A>> Noticing errors during a backward scan is complicated.=0A>=0A>It=
's doable, and it's the right thing to do.=0A=0AThe problem is, you do it s=
ome way, and others need something else. Just think about the example above=
.=0A=0A>Range expressions have implementation-defined semantics in POSIX.  =
For PCRE you =0A>can do what you like.  I suggest mapping encoding-error by=
tes into characters =0A>outside the Unicode range; that's what Emacs does, =
I think, and it's simple and =0A>easy to explain to users.  It's not a big =
deal either way.=0A=0AThis mapping idea is clever. Basically invalid codepo=
ints are converted to something valid.=0A=0A>> What kind of invalid and val=
id UTF byte sequences are inside (and outside) the bounds?=0A>=0A>Just trea=
t encoding-error bytes like everything else.  In effect, extend the =0A>enc=
oding to allow any byte sequence, and add a few "characters" outside the =
=0A>Unicode range, one for each invalid UTF-8 byte.=0A=0AIn other words, \x=
c9 really is an encoding error (since it is an invalid UTF-8 byte, followin=
g another invalid UTF-8 byte). This is what I said from the beginning, depe=
nding on the context, people choose different interpretations of handling U=
TF fragments. Usually they choose what is more convenient from that viewpoi=
nt. But if you put all pieces together, the result is full of contradiction=
s.=0A=0A>Sorry, I don't quite follow, but encoding errors aren't letters an=
d don't have =0A>case.  They match only themselves.=0A=0ANot necessarily. I=
t depends on your mapping: if more than one invalid UTF fragment is mapped =
to the same codepoint, they will match. Especially when you define range of=
 characters.=0A=0A> > What unicode properties does an invalid codepoint hav=
e?=0A>=0A>The minimal ones.=0A=0AWe could use the same flags as for charact=
ers between \x{d800}=E2=80=93\x{dfff}=0A=0A>> The question is, who would be=
 willing to do this work.=0A>=0A>Not me.  :-)=0A=0AI know this would be a l=
ot of work. And I have doubts that slowing down PCRE would increase grep pe=
rformance. Regardless, if somebody is willing to work on this, I can help. =
Please keep in mind that PCRE1 is considered done, and our efforts are limi=
ted to bugfixing. We are currently busy with PCRE2, and such a big change c=
ould only go there.=0A=0A>I'm not sure it'd be popular to add a --drain-bat=
tery option to grep. :)=0A=0AI don't think on performance hungry desktop or=
 server environments this really matters. On phone, you likely don't need t=
his feature.=0A=0A>I suggested that already, but the user (e.g., see the la=
st paragraph of =0A><http://bugs.gnu.org/18454#19>) says he wants to check =
for more-complicated =0A>UTF-8 patterns in binary data.  For example, I exp=
ect the user wants the pattern =0A>'Lef.vre' to match the UTF-8 string 'Lef=
=C3=A8vre' in a binary file.  So he can't =0A>simply use unibyte processing=
.=0A=0AThis is exactly the use case where filtering is needed. His input is=
 a combination of binary and UTF data, and he needs matches only in the UTF=
 part.=0A=0ARegards,=0AZoltan=0A




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 08:48:11 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 04:48:11 2014
Received: from localhost ([127.0.0.1]:52719 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXRCE-0007Fi-DC
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 04:48:10 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:52730)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXRCA-0007FY-HP
 for 18454 <at> debbugs.gnu.org; Fri, 26 Sep 2014 04:48:07 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 70FE639E8012;
 Fri, 26 Sep 2014 01:48:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 4eoyP0fY7-XT; Fri, 26 Sep 2014 01:48:00 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 8E0AC39E801B;
 Fri, 26 Sep 2014 01:48:00 -0700 (PDT)
Message-ID: <54252840.2020409@HIDDEN>
Date: Fri, 26 Sep 2014 01:48:00 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140926083640.6725.3@HIDDEN>
In-Reply-To: <freemail.20140926083640.6725.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

Zoltán Herczeg wrote:

> Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint:
>
> /(?<=\x9c)#/
>
> Does it match \xd5\x9c# starting from #?

No, because the input does not contain a \x9c encoding error.  Encoding errors 
match only themselves, not parts of other characters.  That is how the glibc 
matchers behave, and it's what users expect.

> Noticing errors during a backward scan is complicated.

It's doable, and it's the right thing to do.

> /[\x9c-\x{ffff}]/
>
> What does this range defines exactly?

Range expressions have implementation-defined semantics in POSIX.  For PCRE you 
can do what you like.  I suggest mapping encoding-error bytes into characters 
outside the Unicode range; that's what Emacs does, I think, and it's simple and 
easy to explain to users.  It's not a big deal either way.

> What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?

Just treat encoding-error bytes like everything else.  In effect, extend the 
encoding to allow any byte sequence, and add a few "characters" outside the 
Unicode range, one for each invalid UTF-8 byte.

> Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence?

Sorry, I don't quite follow, but encoding errors aren't letters and don't have 
case.  They match only themselves.

 > What unicode properties does an invalid codepoint have?

The minimal ones.

> depending on their needs, everybody has different answers to these questions.

That's fine.  Just implement reasonable defaults, and provide options if people 
have needs that differ from the defaults.  That's easier for libpcre than for 
grep, since libpcre users (who are programmers) can reasonably be expected to be 
more sophisticated about this sort of thing than grep users (who are not 
necessarily programmers).

> Imagine if you would need to add buffer end and other bit checks.

Of course it will be more expensive to check for UTF-8 as you go, than to assume 
the input is valid UTF-8.  But again, we're not talking about the 
PCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talking 
about the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether the 
input is valid UTF-8, and currently does so inefficiently.  In the 
non-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you go, 
than to have a prepass that checks for UTF-8.  This is because the prepass must 
be stupid (it must check the entire input buffer) whereas the matcher can be 
smart (it often can do its work without checking the entire input buffer).  This 
is one reason libpcre is slower than the glibc matchers.

Obviously it would be some work to build a libpcre that runs faster in the 
non-PCRE_NO_UTF8_CHECK case, without hurting performance in the 
PCRE_NO_UTF8_CHECK case.  But it could be done, if someone had the time to do it.

> The question is, who would be willing to do this work.

Not me.  :-)

>> That would chew up CPU resources unnecessarily

> Yeah but you could add a flag to enable this :)

I'm not sure it'd be popular to add a --drain-battery option to grep. :)

>> The use case that prompted
>> this bug report is someone using 'grep -r' to search for strings like
>> 'foobar' in binary data, and this use case would not work with this
>> suggested solution.
>
> In this case, I would simply disable UTF-8 decoding.

I suggested that already, but the user (e.g., see the last paragraph of 
<http://bugs.gnu.org/18454#19>) says he wants to check for more-complicated 
UTF-8 patterns in binary data.  For example, I expect the user wants the pattern 
'Lef.vre' to match the UTF-8 string 'Lefèvre' in a binary file.  So he can't 
simply use unibyte processing.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 26 Sep 2014 08:48:47 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 04:48:47 2014
Received: from localhost ([127.0.0.1]:52722 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXRCo-0007Ga-Aq
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 04:48:46 -0400
Received: from eggs.gnu.org ([208.118.235.92]:45379)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXRCl-0007GR-CV
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 04:48:44 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXRCe-0005U7-0v
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 04:48:42 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:50128)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXRCd-0005TI-Sy
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 04:48:35 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:46999)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXRCS-0005p4-OP
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 04:48:30 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXRCM-0005Pu-Pd
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 04:48:24 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:43139)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <eggert@HIDDEN>) id 1XXRCM-0005LZ-IM
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 04:48:18 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 70FE639E8012;
 Fri, 26 Sep 2014 01:48:05 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 4eoyP0fY7-XT; Fri, 26 Sep 2014 01:48:00 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 8E0AC39E801B;
 Fri, 26 Sep 2014 01:48:00 -0700 (PDT)
Message-ID: <54252840.2020409@HIDDEN>
Date: Fri, 26 Sep 2014 01:48:00 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <freemail.20140926083640.6725.3@HIDDEN>
In-Reply-To: <freemail.20140926083640.6725.3@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

Zolt=C3=A1n Herczeg wrote:

> Just consider these two examples, where \x9c is an incorrectly encoded =
unicode codepoint:
>
> /(?<=3D\x9c)#/
>
> Does it match \xd5\x9c# starting from #?

No, because the input does not contain a \x9c encoding error.  Encoding e=
rrors=20
match only themselves, not parts of other characters.  That is how the gl=
ibc=20
matchers behave, and it's what users expect.

> Noticing errors during a backward scan is complicated.

It's doable, and it's the right thing to do.

> /[\x9c-\x{ffff}]/
>
> What does this range defines exactly?

Range expressions have implementation-defined semantics in POSIX.  For PC=
RE you=20
can do what you like.  I suggest mapping encoding-error bytes into charac=
ters=20
outside the Unicode range; that's what Emacs does, I think, and it's simp=
le and=20
easy to explain to users.  It's not a big deal either way.

> What kind of invalid and valid UTF byte sequences are inside (and outsi=
de) the bounds?

Just treat encoding-error bytes like everything else.  In effect, extend =
the=20
encoding to allow any byte sequence, and add a few "characters" outside t=
he=20
Unicode range, one for each invalid UTF-8 byte.

> Caseless matching is also another question: does /\xe9/ matches to \xc3=
\x89 or \xc9 invalid UTF byte sequence?

Sorry, I don't quite follow, but encoding errors aren't letters and don't=
 have=20
case.  They match only themselves.

 > What unicode properties does an invalid codepoint have?

The minimal ones.

> depending on their needs, everybody has different answers to these ques=
tions.

That's fine.  Just implement reasonable defaults, and provide options if =
people=20
have needs that differ from the defaults.  That's easier for libpcre than=
 for=20
grep, since libpcre users (who are programmers) can reasonably be expecte=
d to be=20
more sophisticated about this sort of thing than grep users (who are not=20
necessarily programmers).

> Imagine if you would need to add buffer end and other bit checks.

Of course it will be more expensive to check for UTF-8 as you go, than to=
 assume=20
the input is valid UTF-8.  But again, we're not talking about the=20
PCRE_NO_UTF8_CHECK case where libpcre can assume valid UTF-8; we're talki=
ng=20
about the non-PCRE_NO_UTF8_CHECK case, where libpcre must check whether t=
he=20
input is valid UTF-8, and currently does so inefficiently.  In the=20
non-PCRE_NO_UTF8_CHECK case, it's often cheaper to check for UTF-8 as you=
 go,=20
than to have a prepass that checks for UTF-8.  This is because the prepas=
s must=20
be stupid (it must check the entire input buffer) whereas the matcher can=
 be=20
smart (it often can do its work without checking the entire input buffer)=
.  This=20
is one reason libpcre is slower than the glibc matchers.

Obviously it would be some work to build a libpcre that runs faster in th=
e=20
non-PCRE_NO_UTF8_CHECK case, without hurting performance in the=20
PCRE_NO_UTF8_CHECK case.  But it could be done, if someone had the time t=
o do it.

> The question is, who would be willing to do this work.

Not me.  :-)

>> That would chew up CPU resources unnecessarily

> Yeah but you could add a flag to enable this :)

I'm not sure it'd be popular to add a --drain-battery option to grep. :)

>> The use case that prompted
>> this bug report is someone using 'grep -r' to search for strings like
>> 'foobar' in binary data, and this use case would not work with this
>> suggested solution.
>
> In this case, I would simply disable UTF-8 decoding.

I suggested that already, but the user (e.g., see the last paragraph of=20
<http://bugs.gnu.org/18454#19>) says he wants to check for more-complicat=
ed=20
UTF-8 patterns in binary data.  For example, I expect the user wants the =
pattern=20
'Lef.vre' to match the UTF-8 string 'Lef=C3=A8vre' in a binary file.  So =
he can't=20
simply use unibyte processing.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 26 Sep 2014 06:40:31 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 02:40:31 2014
Received: from localhost ([127.0.0.1]:52514 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXPCh-00040A-0L
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 02:40:31 -0400
Received: from eggs.gnu.org ([208.118.235.92]:49023)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XXPCd-0003zz-4E
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 02:40:28 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXPCS-0006mO-RZ
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 02:40:27 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: *
X-Spam-Status: No, score=1.4 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 MALFORMED_FREEMAIL,UNPARSEABLE_RELAY autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:33840)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXPCS-0006kT-PE
 for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 02:40:16 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50016)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXP9G-0004d6-R4
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 02:37:07 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXP98-0005jo-Ib
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 02:36:58 -0400
Received: from iwiw02d.mail.t-online.hu ([84.2.42.67]:30797)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XXP98-0005iZ-7u
 for bug-grep@HIDDEN; Fri, 26 Sep 2014 02:36:50 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by iwiw02d.mail.t-online.hu (Postfix) with SMTP id A8693488958
 for <bug-grep@HIDDEN>; Fri, 26 Sep 2014 08:36:50 +0200 (CEST)
Received: (qmail 98182 invoked by uid 151); 26 Sep 2014 08:36:40 +0200
Received: from 195.228.245.211 (HELO fmxmldata02.freemail.hu) (160.114.36.201)
 by fmx24.freemail.hu with SMTP; 26 Sep 2014 08:36:40 +0200
Received: from webmail by smtp gw id s8Q6aerH006726;
 Fri, 26 Sep 2014 08:36:40 +0200 (CEST)
Date: Fri, 26 Sep 2014 08:36:40 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <5424BF18.7030809@HIDDEN>
Message-ID: <freemail.20140926083640.6725.3@HIDDEN>
X-Originating-IP: [160.114.36.201]
X-HTTP-User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 208.118.235.17
X-Spam-Score: -3.5 (---)
X-Debbugs-Envelope-To: submit
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Hi Paul,

thank you for the feedback.

>I doubt whether users would care all that much, so long as the default 
>is reasonable.  We don't get complaints about it with 'grep', anyway. 
>But if it's a real problem in the PCRE world, you could provide 
>compile-time or run-time options to satisfy the different opinions.

The situation is worse :( Reasonable has a different meaning for everybody.

Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint:

/(?<=\x9c)#/

Does it match \xd5\x9c# starting from #? Noticing errors during a backward scan is complicated.

/[\x9c-\x{ffff}]/

What does this range defines exactly? What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?

Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence? In general, UTF defines several character properties. What unicode properties does an invalid codepoint have?

Believe me, depending on their needs, everybody has different answers to these questions. We don't want to force the view of one group to others, since PCRE is a library. You have much more freedom to define any behavior, since grep is an end-user program.

>I don't see why.  libpcre can continue with its current implementation, 
>for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
>that's not a problem.  The problem is the case where users pass 
>possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
>a slow implementation for this case, and this slow implementation's 
>performance should be improvable without affecting the performance for 
>the PCRE_NO_UTF8_CHECK case.

Regarding performance, this comes from the interpreter:

#define GETUTF8(c, eptr) \
    { \
    if ((c & 0x20) == 0) \
      c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
    else if ((c & 0x10) == 0) \
      c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
    else if ((c & 0x08) == 0) \
      c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
      ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
    else if ((c & 0x04) == 0) \
      c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
          ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
          (eptr[4] & 0x3f); \
    else \
      c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
          ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
          ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
    }

Imagine if you would need to add buffer end and other bit checks. Furthermore unicode expects that any character should be encoded with the least amount of bytes. More checks. You also need to check the current mode. Of course we have several macros similar like this (due to performance reasons), and there are code paths where we have assumptions about valid UTF strings. This would increase complexity a lot, we would need a lot of extra regression tests, we need a correct JIT implementation, and so on. This would also kill optimizations. For example, if you define a character range, where all characters are two byte long, JIT cleverly detect this, and use a fast case to discard any non-two byte UTF codepoints.

The question is, who would be willing to do this work.

>That would chew up CPU resources unnecessarily, by requiring two passes 
>over the input, one for checking UTF-8, the other for doing the actual 
>match.  Granted, it might be faster in real-time than what we have now, 
>but overall it'd probably be more expensive (e.g., more energy 
>consumption) than what we have now, and this doesn't sound promising.

Yeah but you could add a flag to enable this :) I feel this would be much less work than the former.

>That doesn't sound like a win, I'm afraid.  The use case that prompted 
>this bug report is someone using 'grep -r' to search for strings like 
>'foobar' in binary data, and this use case would not work with this 
>suggested solution.

In this case, I would simply disable UTF-8 decoding. You could search UTF character codes encoded in ascii (i.e. searching \xe9 as \xc3\xa9)

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 06:37:03 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 26 02:37:03 2014
Received: from localhost ([127.0.0.1]:52509 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXP9K-0003ta-Qj
	for submit <at> debbugs.gnu.org; Fri, 26 Sep 2014 02:37:03 -0400
Received: from iwiw01d.mail.t-online.hu ([84.2.42.53]:51868
 helo=fmxout01.freemail.hu) by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XXP9H-0003sn-KV
 for 18454 <at> debbugs.gnu.org; Fri, 26 Sep 2014 02:37:01 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by fmxout01.freemail.hu (Postfix) with SMTP id B16901293E
 for <18454 <at> debbugs.gnu.org>; Fri, 26 Sep 2014 08:36:40 +0200 (CEST)
Received: (qmail 98182 invoked by uid 151); 26 Sep 2014 08:36:40 +0200
Received: from 195.228.245.211 (HELO fmxmldata02.freemail.hu) (160.114.36.201)
 by fmx24.freemail.hu with SMTP; 26 Sep 2014 08:36:40 +0200
Received: from webmail by smtp gw id s8Q6aerH006726;
 Fri, 26 Sep 2014 08:36:40 +0200 (CEST)
Date: Fri, 26 Sep 2014 08:36:40 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <5424BF18.7030809@HIDDEN>
Message-ID: <freemail.20140926083640.6725.3@HIDDEN>
X-Originating-IP: [160.114.36.201]
X-HTTP-User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-Spam-Score: 1.5 (+)
X-Spam-Report: Spam detection software, running on the system "debbugs.gnu.org",
 has
 identified this incoming email as possible spam.  The original message
 has been attached to this so you can view it (if it isn't spam) or label
 similar future email.  If you have any questions, see
 the administrator of that system for details.
 Content preview:  Hi Paul, thank you for the feedback. >I doubt whether users
 would care all that much, so long as the default >is reasonable. We don't
 get complaints about it with 'grep', anyway. >But if it's a real problem
 in the PCRE world, you could provide >compile-time or run-time options to
 satisfy the different opinions. [...] 
 Content analysis details:   (1.5 points, 10.0 required)
 pts rule name              description
 ---- ---------------------- --------------------------------------------------
 0.0 FREEMAIL_FROM Sender email is commonly abused enduser mail provider
 (hzmester[at]freemail.hu)
 -0.0 RCVD_IN_DNSWL_NONE     RBL: Sender listed at http://www.dnswl.org/, no
 trust [84.2.42.53 listed in list.dnswl.org]
 1.5 MALFORMED_FREEMAIL     Bad headers on message from free email service
 0.0 UNPARSEABLE_RELAY Informational: message has unparseable relay lines
X-Debbugs-Envelope-To: 18454
Cc: bug-grep@HIDDEN, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

Hi Paul,

thank you for the feedback.

>I doubt whether users would care all that much, so long as the default 
>is reasonable.  We don't get complaints about it with 'grep', anyway. 
>But if it's a real problem in the PCRE world, you could provide 
>compile-time or run-time options to satisfy the different opinions.

The situation is worse :( Reasonable has a different meaning for everybody.

Just consider these two examples, where \x9c is an incorrectly encoded unicode codepoint:

/(?<=\x9c)#/

Does it match \xd5\x9c# starting from #? Noticing errors during a backward scan is complicated.

/[\x9c-\x{ffff}]/

What does this range defines exactly? What kind of invalid and valid UTF byte sequences are inside (and outside) the bounds?

Caseless matching is also another question: does /\xe9/ matches to \xc3\x89 or \xc9 invalid UTF byte sequence? In general, UTF defines several character properties. What unicode properties does an invalid codepoint have?

Believe me, depending on their needs, everybody has different answers to these questions. We don't want to force the view of one group to others, since PCRE is a library. You have much more freedom to define any behavior, since grep is an end-user program.

>I don't see why.  libpcre can continue with its current implementation, 
>for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
>that's not a problem.  The problem is the case where users pass 
>possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
>a slow implementation for this case, and this slow implementation's 
>performance should be improvable without affecting the performance for 
>the PCRE_NO_UTF8_CHECK case.

Regarding performance, this comes from the interpreter:

#define GETUTF8(c, eptr) \
    { \
    if ((c & 0x20) == 0) \
      c = ((c & 0x1f) << 6) | (eptr[1] & 0x3f); \
    else if ((c & 0x10) == 0) \
      c = ((c & 0x0f) << 12) | ((eptr[1] & 0x3f) << 6) | (eptr[2] & 0x3f); \
    else if ((c & 0x08) == 0) \
      c = ((c & 0x07) << 18) | ((eptr[1] & 0x3f) << 12) | \
      ((eptr[2] & 0x3f) << 6) | (eptr[3] & 0x3f); \
    else if ((c & 0x04) == 0) \
      c = ((c & 0x03) << 24) | ((eptr[1] & 0x3f) << 18) | \
          ((eptr[2] & 0x3f) << 12) | ((eptr[3] & 0x3f) << 6) | \
          (eptr[4] & 0x3f); \
    else \
      c = ((c & 0x01) << 30) | ((eptr[1] & 0x3f) << 24) | \
          ((eptr[2] & 0x3f) << 18) | ((eptr[3] & 0x3f) << 12) | \
          ((eptr[4] & 0x3f) << 6) | (eptr[5] & 0x3f); \
    }

Imagine if you would need to add buffer end and other bit checks. Furthermore unicode expects that any character should be encoded with the least amount of bytes. More checks. You also need to check the current mode. Of course we have several macros similar like this (due to performance reasons), and there are code paths where we have assumptions about valid UTF strings. This would increase complexity a lot, we would need a lot of extra regression tests, we need a correct JIT implementation, and so on. This would also kill optimizations. For example, if you define a character range, where all characters are two byte long, JIT cleverly detect this, and use a fast case to discard any non-two byte UTF codepoints.

The question is, who would be willing to do this work.

>That would chew up CPU resources unnecessarily, by requiring two passes 
>over the input, one for checking UTF-8, the other for doing the actual 
>match.  Granted, it might be faster in real-time than what we have now, 
>but overall it'd probably be more expensive (e.g., more energy 
>consumption) than what we have now, and this doesn't sound promising.

Yeah but you could add a flag to enable this :) I feel this would be much less work than the former.

>That doesn't sound like a win, I'm afraid.  The use case that prompted 
>this bug report is someone using 'grep -r' to search for strings like 
>'foobar' in binary data, and this use case would not work with this 
>suggested solution.

In this case, I would simply disable UTF-8 decoding. You could search UTF character codes encoded in ascii (i.e. searching \xe9 as \xc3\xa9)

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 01:19:30 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Sep 25 21:19:30 2014
Received: from localhost ([127.0.0.1]:52442 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXKC1-00017i-LN
	for submit <at> debbugs.gnu.org; Thu, 25 Sep 2014 21:19:30 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:40072)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXKBz-00017Z-3P
 for 18454 <at> debbugs.gnu.org; Thu, 25 Sep 2014 21:19:28 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 0547FA6000C;
 Thu, 25 Sep 2014 18:19:26 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id STXuGlK7GRi6; Thu, 25 Sep 2014 18:19:21 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 203A4A6000A;
 Thu, 25 Sep 2014 18:19:21 -0700 (PDT)
Message-ID: <5424BF18.7030809@HIDDEN>
Date: Thu, 25 Sep 2014 18:19:20 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: =?UTF-8?B?Wm9sdMOhbiBIZXJjemVn?= <hzmester@HIDDEN>, 
 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <freemail.20140921084639.57328.1@HIDDEN>
In-Reply-To: <freemail.20140921084639.57328.1@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

Zoltán, thanks for your comments on this subject.  Some thoughts and 
suggestions:

> - what should you do if you encounter an invalid UTF-8 opcode

Do whatever plain 'grep' does, which is what the glibc regular 
expression matcher does.  If I recall correctly, an encoding error in 
the pattern matches the same encoding error in the string.  It shouldn't 
be that complicated.

> Everybody has different opinion about handling invalid UTF opcodes

I doubt whether users would care all that much, so long as the default 
is reasonable.  We don't get complaints about it with 'grep', anyway. 
But if it's a real problem in the PCRE world, you could provide 
compile-time or run-time options to satisfy the different opinions.

> everybody would suffer this performance regression, including those, who pass valid UTF strings.

I don't see why.  libpcre can continue with its current implementation, 
for users who pass valid UTF-8 strings and use PCRE_NO_UTF8_CHECK; 
that's not a problem.  The problem is the case where users pass 
possibly-invalid strings and do not use PCRE_NO_UTF8_CHECK.  libpcre has 
a slow implementation for this case, and this slow implementation's 
performance should be improvable without affecting the performance for 
the PCRE_NO_UTF8_CHECK case.

> * The best solution is multi-threaded grepping

That would chew up CPU resources unnecessarily, by requiring two passes 
over the input, one for checking UTF-8, the other for doing the actual 
match.  Granted, it might be faster in real-time than what we have now, 
but overall it'd probably be more expensive (e.g., more energy 
consumption) than what we have now, and this doesn't sound promising.

> * The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed

That doesn't sound like a win, I'm afraid.  The use case that prompted 
this bug report is someone using 'grep -r' to search for strings like 
'foobar' in binary data, and this use case would not work with this 
suggested solution.


I'm hoping that the recent set of changes to 'grep' lessens the urgency 
of improving libpcre.  On my platform (Fedora 20 x86-64) Jim Meyering's 
benchmark <http://bugs.gnu.org/18454#56> says that with grep 2.18, grep 
-P is 6.4x slower than plain grep, and that with the latest experimental 
grep (including the patches I just posted in 
<http://bugs.gnu.org/18454#62>), grep -P is 5.6x slower than plain grep. 
  So it's plausible that the latest set of fixes is good enough, in the 
sense that, sure, PCRE is slower, but it's always been slower and if 
that used to be good enough then it should still be good enough.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 26 Sep 2014 00:23:12 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Sep 25 20:23:12 2014
Received: from localhost ([127.0.0.1]:52412 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XXJJX-00089J-P7
	for submit <at> debbugs.gnu.org; Thu, 25 Sep 2014 20:23:12 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:38256)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XXJJU-000898-6w
 for 18454 <at> debbugs.gnu.org; Thu, 25 Sep 2014 20:23:09 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 4A3EEA60036;
 Thu, 25 Sep 2014 17:23:07 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id MBSAwqAjtzLB; Thu, 25 Sep 2014 17:23:03 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id D807FA6000A;
 Thu, 25 Sep 2014 17:23:02 -0700 (PDT)
Message-ID: <5424B1E6.8090502@HIDDEN>
Date: Thu, 25 Sep 2014 17:23:02 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.2
MIME-Version: 1.0
To: Jim Meyering <jim@HIDDEN>, =?UTF-8?B?U2FudGlhZ28gUnVhbm8gUmluY8Oz?=
 =?UTF-8?B?bg==?= <santiago@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <541A750E.2050606@HIDDEN> <20140918083327.GA16324@nomada>
 <CA+8g5KH3LY75wVb3WsL8dvTt4FhfiO=cuYCadRD2R=9nrpw_hg@HIDDEN>
 <CA+8g5KHdnaEB=yYF5Kp7XCd7GgvjL73HeoOHNEPCAqy0KPs6+w@HIDDEN>
In-Reply-To: <CA+8g5KHdnaEB=yYF5Kp7XCd7GgvjL73HeoOHNEPCAqy0KPs6+w@HIDDEN>
Content-Type: multipart/mixed; boundary="------------070505040701010404060607"
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

This is a multi-part message in MIME format.
--------------070505040701010404060607
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Thanks for looking into that.  The attached patches solve those 
performance problems for me.

--------------070505040701010404060607
Content-Type: text/plain; charset=UTF-8;
 name="0001-grep-scan-for-valid-multibyte-strings-more-quickly.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename*0="0001-grep-scan-for-valid-multibyte-strings-more-quickly.patc";
 filename*1="h"

RnJvbSA0ZWY2N2EyNzJhZjg1YjQ2ZjQ3NjlkMWM1OTMxNzhhNjNmNjIwNWRhIE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBUaHUsIDI1IFNlcCAyMDE0IDE3OjA0OjQ5IC0wNzAwClN1YmplY3Q6IFtQQVRD
SCAxLzJdIGdyZXA6IHNjYW4gZm9yIHZhbGlkIG11bHRpYnl0ZSBzdHJpbmdzIG1vcmUgcXVp
Y2tseQoKU2NhbiB2YWxpZCBtdWx0aWJ5dGUgc3RyaW5ncyBtb3JlIHF1aWNrbHkgaW4gdGhl
IGNvbW1vbiBjYXNlIG9mCmVuY29kaW5ncyB0aGF0IGFyZSB1cHdhcmQgY29tcGF0aWJsZSB3
aXRoIEFTQ0lJLCBzdWNoIGFzIFVURi04LgpZb3UnZCB0aGluayB0aGVyZSdkIGJlIGEgZmFz
dCBzdGFuZGFyZCB3YXkgdG8gZG8gdGhpcyBub3dhZGF5cywKYnV0IG5vb29vby4uLi4KUHJv
YmxlbSByZXBvcnRlZCBieSBKaW0gTWV5ZXJpbmcgaW46IGh0dHA6Ly9idWdzLmdudS5vcmcv
MTg0NTQjNTYKKiBzcmMvZ3JlcC5jIChISUJZVEUpOiBOZXcgY29uc3RhbnQuCihlYXN5X2Vu
Y29kaW5nKTogTmV3IHN0YXRpYyB2YXIuCihpbml0X2Vhc3lfZW5jb2RpbmcsIHNraXBfZWFz
eV9ieXRlcyk6IE5ldyBmdW5jdGlvbnMuCihidWZmZXJfdGV4dGJpbik6IFNraXAgZWFzeSBi
eXRlcyBxdWlja2x5LgpEb24ndCBib3RoZXIgd2l0aCBtYl9jbGVuIGhlcmUsIHNpbmNlIHNr
aXBfZWFzeV9ieXRlcyB0eXBpY2FsbHkKY2FwdHVyZXMgdGhlIGVhc3kgY2FzZXM7IGp1c3Qg
dXNlIG1icmxlbiBkaXJlY3RseS4KKGJ1ZmZlcl90ZXh0YmluLCBmaWxlX3RleHRiaW4pOiBG
aXJzdCBhcmcgaXMgbm8gbG9uZ2VyIGEgY29uc3QKcG9pbnRlciwgc2luY2UgdGhlIGJ5dGUg
cGFzdCB0aGUgZW5kIGlzIG5vdyBhbiBvdmVyd3JpdHRlbiBzZW50aW5lbC4KKG1haW4pOiBD
YWxsIGluaXRfZWFzeV9lbmNvZGluZy4KLS0tCiBzcmMvZ3JlcC5jIHwgNTcgKysrKysrKysr
KysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKystLS0tCiAxIGZp
bGUgY2hhbmdlZCwgNTMgaW5zZXJ0aW9ucygrKSwgNCBkZWxldGlvbnMoLSkKCmRpZmYgLS1n
aXQgYS9zcmMvZ3JlcC5jIGIvc3JjL2dyZXAuYwppbmRleCAzNWQzMzU4Li45NDhlNDI3IDEw
MDY0NAotLS0gYS9zcmMvZ3JlcC5jCisrKyBiL3NyYy9ncmVwLmMKQEAgLTQ1NCw5ICs0NTQs
NTYgQEAgdGV4dGJpbl9pc19iaW5hcnkgKGVudW0gdGV4dGJpbiB0ZXh0YmluKQogICByZXR1
cm4gdGV4dGJpbiA8IFRFWFRCSU5fVU5LTk9XTjsKIH0KIAorLyogVGhlIGhpZ2gtb3JkZXIg
Yml0IG9mIGEgYnl0ZS4gICovCitlbnVtIHsgSElCWVRFID0gMHg4MCB9OworCisvKiBUcnVl
IGlmIGV2ZXJ5IGJ5dGUgd2l0aCBISUJZVEUgb2ZmIGlzIGEgc2luZ2xlLWJ5dGUgY2hhcmFj
dGVyLgorICAgVVRGLTggaGFzIHRoaXMgcHJvcGVydHkuICAqLworc3RhdGljIGJvb2wgZWFz
eV9lbmNvZGluZzsKKworc3RhdGljIHZvaWQKK2luaXRfZWFzeV9lbmNvZGluZyAodm9pZCkK
K3sKKyAgZWFzeV9lbmNvZGluZyA9IHRydWU7CisgIGZvciAoaW50IGkgPSAwOyBpIDwgSElC
WVRFOyBpKyspCisgICAgZWFzeV9lbmNvZGluZyAmPSBtYmNsZW5fY2FjaGVbaV0gPT0gMTsK
K30KKworLyogU2tpcCB0aGUgZWFzeSBieXRlcyBpbiBhIGJ1ZmZlciB0aGF0IGlzIGd1YXJh
bnRlZWQgdG8gaGF2ZSBhIHNlbnRpbmVsCisgICB0aGF0IGlzIG5vdCBlYXN5LCBhbmQgcmV0
dXJuIGEgcG9pbnRlciB0byB0aGUgZmlyc3Qgbm9uLWVhc3kgYnl0ZS4KKyAgIEluIGVhc3kg
ZW5jb2RpbmdzLCB0aGUgZWFzeSBieXRlcyBhbGwgaGF2ZSBISUJZVEUgb2ZmLgorICAgSW4g
b3RoZXIgZW5jb2RpbmdzLCBubyBieXRlIGlzIGVhc3kuICAqLworc3RhdGljIGNoYXIgY29u
c3QgKiBfR0xfQVRUUklCVVRFX1BVUkUKK3NraXBfZWFzeV9ieXRlcyAoY2hhciBjb25zdCAq
YnVmKQoreworICBpZiAoIWVhc3lfZW5jb2RpbmcpCisgICAgcmV0dXJuIGJ1ZjsKKworICAv
KiBBbiB1bnNpZ25lZCB0eXBlIHN1aXRhYmxlIGZvciBmYXN0IG1hdGNoaW5nLiAgKi8KKyAg
dHlwZWRlZiB1aW50bWF4X3QgdXdvcmQ7CisKKyAgLyogMHg4MDgwLi4uLCBleHRlbmRlZCB0
byBiZSB3aWRlIGVub3VnaCBmb3IgdXdvcmQuICAqLworICB1d29yZCBoaWJ5dGVfbWFzayA9
ICh1d29yZCkgLTEgLyBVQ0hBUl9NQVggKiBISUJZVEU7CisKKyAgLyogU2VhcmNoIGEgYnl0
ZSBhdCBhIHRpbWUgdW50aWwgdGhlIHBvaW50ZXIgaXMgYWxpZ25lZCwgdGhlbiBhCisgICAg
IHV3b3JkIGF0IGEgdGltZSB1bnRpbCBhIG1hdGNoIGlzIGZvdW5kLCB0aGVuIGEgYnl0ZSBh
dCBhIHRpbWUgdG8KKyAgICAgaWRlbnRpZnkgdGhlIGV4YWN0IGJ5dGUuICBUaGUgdXdvcmQg
c2VhcmNoIG1heSBnbyBzbGlnaHRseSBwYXN0CisgICAgIHRoZSBidWZmZXIgZW5kLCBidXQg
dGhhdCdzIGJlbmlnbi4gICovCisgIGNoYXIgY29uc3QgKnA7CisgIHV3b3JkIGNvbnN0ICpz
OworICBmb3IgKHAgPSBidWY7ICh1aW50cHRyX3QpIHAgJSBzaXplb2YgKHV3b3JkKSAhPSAw
OyBwKyspCisgICAgaWYgKCpwICYgSElCWVRFKQorICAgICAgcmV0dXJuIHA7CisgIGZvciAo
cyA9ICh1d29yZCBjb25zdCAqKSBwOyAhICgqcyAmIGhpYnl0ZV9tYXNrKTsgcysrKQorICAg
IGNvbnRpbnVlOworICBmb3IgKHAgPSAoY2hhciBjb25zdCAqKSBzOyAhICgqcCAmIEhJQllU
RSk7IHArKykKKyAgICBjb250aW51ZTsKKyAgcmV0dXJuIHA7Cit9CisKIC8qIFJldHVybiB0
aGUgdGV4dCB0eXBlIG9mIGRhdGEgaW4gQlVGLCBvZiBzaXplIFNJWkUuICAqLwogc3RhdGlj
IGVudW0gdGV4dGJpbgotYnVmZmVyX3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90
IHNpemUpCitidWZmZXJfdGV4dGJpbiAoY2hhciAqYnVmLCBzaXplX3Qgc2l6ZSkKIHsKICAg
aWYgKGVvbGJ5dGUgJiYgbWVtY2hyIChidWYsICdcMCcsIHNpemUpKQogICAgIHJldHVybiBU
RVhUQklOX0JJTkFSWTsKQEAgLTQ2Nyw5ICs1MTQsMTAgQEAgYnVmZmVyX3RleHRiaW4gKGNo
YXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAgICAgICBzaXplX3QgY2xlbjsKICAgICAg
IGNoYXIgY29uc3QgKnA7CiAKLSAgICAgIGZvciAocCA9IGJ1ZjsgcCA8IGJ1ZiArIHNpemU7
IHAgKz0gY2xlbikKKyAgICAgIGJ1ZltzaXplXSA9IC0xOworICAgICAgZm9yIChwID0gYnVm
OyAocCA9IHNraXBfZWFzeV9ieXRlcyAocCkpIDwgYnVmICsgc2l6ZTsgcCArPSBjbGVuKQog
ICAgICAgICB7Ci0gICAgICAgICAgY2xlbiA9IG1iX2NsZW4gKHAsIGJ1ZiArIHNpemUgLSBw
LCAmbWJzKTsKKyAgICAgICAgICBjbGVuID0gbWJybGVuIChwLCBidWYgKyBzaXplIC0gcCwg
Jm1icyk7CiAgICAgICAgICAgaWYgKChzaXplX3QpIC0yIDw9IGNsZW4pCiAgICAgICAgICAg
ICByZXR1cm4gY2xlbiA9PSAoc2l6ZV90KSAtMiA/IFRFWFRCSU5fVU5LTk9XTiA6IFRFWFRC
SU5fQklOQVJZOwogICAgICAgICB9CkBAIC00ODEsNyArNTI5LDcgQEAgYnVmZmVyX3RleHRi
aW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAvKiBSZXR1cm4gdGhlIHRleHQg
dHlwZSBvZiBhIGZpbGUuICBCVUYsIG9mIHNpemUgQlVGU0laRSwgaXMgdGhlIGluaXRpYWwK
ICAgIGJ1ZmZlciByZWFkIGZyb20gdGhlIGZpbGUgd2l0aCBkZXNjcmlwdG9yIEZEIGFuZCBz
dGF0dXMgU1QuICAqLwogc3RhdGljIGVudW0gdGV4dGJpbgotZmlsZV90ZXh0YmluIChjaGFy
IGNvbnN0ICpidWYsIHNpemVfdCBidWZzaXplLCBpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0
ICpzdCkKK2ZpbGVfdGV4dGJpbiAoY2hhciAqYnVmLCBzaXplX3QgYnVmc2l6ZSwgaW50IGZk
LCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiB7CiAgIGVudW0gdGV4dGJpbiB0ZXh0YmluID0g
YnVmZmVyX3RleHRiaW4gKGJ1ZiwgYnVmc2l6ZSk7CiAgIGlmICh0ZXh0YmluX2lzX2JpbmFy
eSAodGV4dGJpbikpCkBAIC0yNDE3LDYgKzI0NjUsNyBAQCBtYWluIChpbnQgYXJnYywgY2hh
ciAqKmFyZ3YpCiAgICAgdXNhZ2UgKEVYSVRfVFJPVUJMRSk7CiAKICAgYnVpbGRfbWJjbGVu
X2NhY2hlICgpOworICBpbml0X2Vhc3lfZW5jb2RpbmcgKCk7CiAKICAgLyogSWYgZmdyZXAg
aW4gYSBtdWx0aWJ5dGUgbG9jYWxlLCB0aGVuIHVzZSBncmVwIGlmIGVpdGhlcgogICAgICAo
MSkgY2FzZSBpcyBpZ25vcmVkICh3aGVyZSBncmVwIGlzIHR5cGljYWxseSBmYXN0ZXIpLCBv
cgotLSAKMS45LjMKCg==
--------------070505040701010404060607
Content-Type: text/plain; charset=UTF-8;
 name="0002-grep-don-t-check-extensively-for-invalid-prefix-byte.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename*0="0002-grep-don-t-check-extensively-for-invalid-prefix-byte.pa";
 filename*1="tch"

RnJvbSA0NjZjYTQ0YjBiNTA5MDdlNDdlZjZmMWI0ZTEyODNlMzJkNjY3ZjM3IE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBUaHUsIDI1IFNlcCAyMDE0IDE3OjE0OjU2IC0wNzAwClN1YmplY3Q6IFtQQVRD
SCAyLzJdIGdyZXA6IGRvbid0IGNoZWNrIGV4dGVuc2l2ZWx5IGZvciBpbnZhbGlkIHByZWZp
eCBieXRlcwogdW5sZXNzIC1QCgpQcm9ibGVtIHJlcG9ydGVkIGJ5IEppbSBNZXllcmluZyBp
bjogaHR0cDovL2J1Z3MuZ251Lm9yZy8xODQ1NCM1NgoqIHNyYy9ncmVwLmMgKGdyZXApOiBB
ZnRlciB0aGUgZmlyc3QgYnVmZmVyIGlzIGNoZWNrZWQsIGxlYXZlIHRoZQpmaWxlLXR5cGUg
Y2hlY2tlciBpbiBURVhUQklOX1VOS05PV04gc3RhdGUgb25seSB3aGVuIC1QIGlzIHVzZWQu
Ck9ubHkgdGhlIC1QIG1hdGNoZXIgaGFzIHBlcmZvcm1hbmNlIHByb2JsZW1zIHdpdGggY2hl
Y2tpbmcgYmluYXJ5CmRhdGEgdGhhdCBtYWtlIGl0IHdvcnRod2hpbGUgdG8gY2hlY2sgZXZl
cnkgcHJlZml4IGlucHV0IGJ5dGUgc28KdGhlIC1QIG1hdGNoZXIncyBURVhUQklOX1VOS05P
V04gb3B0aW1pemF0aW9ucyBjYW4gY29tZSBpbnRvIHBsYXkuCk90aGVyIG1hdGNoZXJzIGNh
biBzaW1wbHkgY2hlY2sgdGhlIGRhdGEgZGlyZWN0bHksIGFuZCB1c2luZwpURVhUQklOX1VO
S05PV04gd2l0aCB0aGVtIHNsb3dzICdncmVwJyBkb3duIGZvciBubyBiZW5lZml0LgotLS0K
IHNyYy9ncmVwLmMgfCAyICsrCiAxIGZpbGUgY2hhbmdlZCwgMiBpbnNlcnRpb25zKCspCgpk
aWZmIC0tZ2l0IGEvc3JjL2dyZXAuYyBiL3NyYy9ncmVwLmMKaW5kZXggOTQ4ZTQyNy4uM2E4
ZDlmNSAxMDA2NDQKLS0tIGEvc3JjL2dyZXAuYworKysgYi9zcmMvZ3JlcC5jCkBAIC0xMjg4
LDYgKzEyODgsOCBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAg
ICAgICAgICBudWxfemFwcGVyID0gZW9sOwogICAgICAgICAgIHNraXBfbnVscyA9IHNraXBf
ZW1wdHlfbGluZXM7CiAgICAgICAgIH0KKyAgICAgIGVsc2UgaWYgKGV4ZWN1dGUgIT0gUGV4
ZWN1dGUpCisgICAgICAgIHRleHRiaW4gPSBURVhUQklOX1RFWFQ7CiAgICAgfQogCiAgIGZv
ciAoOzspCi0tIAoxLjkuMwoK
--------------070505040701010404060607--




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 21 Sep 2014 21:23:29 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Sep 21 17:23:29 2014
Received: from localhost ([127.0.0.1]:47720 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XVobP-0001Lw-IL
	for submit <at> debbugs.gnu.org; Sun, 21 Sep 2014 17:23:28 -0400
Received: from eggs.gnu.org ([208.118.235.92]:50129)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <hzmester@HIDDEN>) id 1XVavU-0004TY-QP
 for submit <at> debbugs.gnu.org; Sun, 21 Sep 2014 02:47:17 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XVavM-0007eg-7Z
 for submit <at> debbugs.gnu.org; Sun, 21 Sep 2014 02:47:16 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 UNPARSEABLE_RELAY autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:56581)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XVavM-0007eB-4G
 for submit <at> debbugs.gnu.org; Sun, 21 Sep 2014 02:47:08 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:51806)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XVav9-0000ml-P1
 for bug-grep@HIDDEN; Sun, 21 Sep 2014 02:47:02 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XVav2-0007Vm-Gs
 for bug-grep@HIDDEN; Sun, 21 Sep 2014 02:46:55 -0400
Received: from iwiw03d.mail.t-online.hu ([84.2.42.68]:15254)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <hzmester@HIDDEN>) id 1XVav2-0007VY-9a
 for bug-grep@HIDDEN; Sun, 21 Sep 2014 02:46:48 -0400
Received: from fmx24.freemail.hu (fmx24.freemail.hu [195.228.245.74])
 by iwiw03d.mail.t-online.hu (Postfix) with SMTP id F1AEC4E8B7A
 for <bug-grep@HIDDEN>; Sun, 21 Sep 2014 08:46:35 +0200 (CEST)
Received: (qmail 38945 invoked by uid 151); 21 Sep 2014 08:46:39 +0200
Received: from fm-haproxy01.freemail.hu (HELO fmxmldata04.freemail.hu)
 (195.228.245.211)
 by fmx24.freemail.hu with SMTP; 21 Sep 2014 08:46:39 +0200
Received: from webmail by smtp gw id s8L6kda2057330;
 Sun, 21 Sep 2014 08:46:39 +0200 (CEST)
Date: Sun, 21 Sep 2014 08:46:39 +0200 (CEST)
From: =?UTF-8?Q?Zolt=C3=A1n_Herczeg?= <hzmester@HIDDEN>
Subject: bug#18454: Improve performance when -P (PCRE) is used in UTF-8 locales
To: bug-grep@HIDDEN
Message-ID: <freemail.20140921084639.57328.1@HIDDEN>
X-Originating-IP: [91.83.38.253]
X-HTTP-User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64;
 rv:32.0) Gecko/20100101 Firefox/32.0
X-Original-User: hzmester
MIME-Version: 1.0
Content-Type: TEXT/plain; CHARSET=UTF-8
X-detected-operating-system: by eggs.gnu.org: FreeBSD 9.x
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 208.118.235.17
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Sun, 21 Sep 2014 17:23:25 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

Hi,

I am the developer of the JIT compiler in PCRE. I am frequently checking the discussions about PCRE and found this comment here on bug-grep@HIDDEN:

> There's another way: fix libpcre so that it works on arbitrary binary data, without the need for prescreening
> the data. That's the fundamental problem here. 

This requires too much effort with no benefit. Reasons:

- what should you do if you encounter an invalid UTF-8 opcode: ignore it? decode it to some random value? For example, what should happen if you find a stray 0xe9? Does it match \xe9? Everybody has different opinion about handling invalid UTF opcodes, and this would lead to never ending arguing on pcre-dev.

- the bigger problem is performance. Handling invalid UTF codes require a lot of extra checks and kills many optimizations. For example, when we encounter a 0xc5, we know that the input buffer has at least one more byte. We did not check the input buffer size. We also assume that the highest 2 bits are 10 for the second byte, and did not check this when we decode that character. This would also kill other optimizations like boyer-moore like search in JIT. The major problem is, everybody would suffer this performance regression, including those, who pass valid UTF strings.

Therefore such change will never happen due to these reasons.

But there are alternatives.

* The best solution is multi-threaded grepping: one thread reads file data, and replace/remove invalid UTF8 opcodes to something valid. The other thread runs PCRE on the filtered thread. Alternatively, you can convert everything to UTF32, and use pcre32.

* The other solution is improving PCRE survivability: if the buffer passed to PCRE has at least one zero character code before the invalid input buffer, and maximum UTF character length - 1 (6 in UTF8, 1 in UTF 16) zeroes after the buffer, we could guarantee that PCRE does not crash and PCRE does not enter infinite loops. Nothing else is guaranteed, i.e. if you search /ab/, and the invalid UTF sequence contains ab, this might not be found (or might be found with interpreter, but not with JIT or vice versa). If you use pcre32, there is no need for any extra byte extension.

Regards,
Zoltan





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 19 Sep 2014 16:07:12 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 19 12:07:12 2014
Received: from localhost ([127.0.0.1]:45966 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XV0iF-0001Vo-A8
	for submit <at> debbugs.gnu.org; Fri, 19 Sep 2014 12:07:11 -0400
Received: from mail-wg0-f48.google.com ([74.125.82.48]:61038)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <meyering@HIDDEN>) id 1XV0iC-0001Vf-JO
 for 18454 <at> debbugs.gnu.org; Fri, 19 Sep 2014 12:07:09 -0400
Received: by mail-wg0-f48.google.com with SMTP id m15so2692552wgh.7
 for <18454 <at> debbugs.gnu.org>; Fri, 19 Sep 2014 09:07:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc:content-type;
 bh=hm2NjWOybrqFQ9m2Hjr65mqjutZgSBiVs8hhzn8ZeDo=;
 b=WsJ4i9MIacK+OvSkNCmdk4TPbXeNgnx/bsybcRGRCYJeG1MIeYt2kd2KS73BwpRAk1
 8T7L6hImPESsO30M/b8RKPEKb7dUzTzwErpM2KwwRZ5A2T8wviYnt4mZHVW9pcPSXJ2d
 zBOkSv5bE0Bvpuc0GsO7xlx9kMlQ5oW5GILjWwMzLK696Rr8tKGXPJDRCQbyCsVAW//q
 Qm1ayd7DJRdQTBTA0ZZ6/uegC595h8ssnUYpuMZ8Ts0EySQqzz5vPEIzkfB/83yxAddZ
 dLuuLhnVXd42qMBXb198b5HC1iNSTDa4mLEZP2eujwCpDe8kCPWM1urQzOvelGJrR3L9
 RqIw==
X-Received: by 10.180.78.226 with SMTP id e2mr7077703wix.68.1411142827829;
 Fri, 19 Sep 2014 09:07:07 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.194.86.131 with HTTP; Fri, 19 Sep 2014 09:06:47 -0700 (PDT)
In-Reply-To: <CA+8g5KH3LY75wVb3WsL8dvTt4FhfiO=cuYCadRD2R=9nrpw_hg@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <541A750E.2050606@HIDDEN> <20140918083327.GA16324@nomada>
 <CA+8g5KH3LY75wVb3WsL8dvTt4FhfiO=cuYCadRD2R=9nrpw_hg@HIDDEN>
From: Jim Meyering <jim@HIDDEN>
Date: Fri, 19 Sep 2014 09:06:47 -0700
X-Google-Sender-Auth: gfBoNdOkl51BZat2ai718EeTNoM
Message-ID: <CA+8g5KHdnaEB=yYF5Kp7XCd7GgvjL73HeoOHNEPCAqy0KPs6+w@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: =?ISO-8859-1?Q?Santiago_Ruano_Rinc=F3n?= <santiago@HIDDEN>
Content-Type: text/plain; charset=ISO-8859-1
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 18454
Cc: Paul Eggert <eggert@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.7 (/)

On Thu, Sep 18, 2014 at 12:36 PM, Jim Meyering <jim@HIDDEN> wrote:
> It looks like most of the difference is the result of
> commit cd36abd46c5e0768606979ea75a51732062f5624,
> "grep: treat a file as binary if its prefix contains encoding errors",

Hi Paul,

I found that the above commit induces a large performance hit.
Over 50x in this example:

  seq 99999999 > k
  LC_ALL=C diff -u \
    <(PATH=.bin/2.20-31:$PATH env time -f %e grep asdf k 2>&1) \
    <(PATH=.bin/2.20-32:$PATH env time -f %e grep asdf k 2>&1)
  ...
  -0.21
  +11.47

The problem is that the new function is processing all of
the input, not just a prefix.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 18 Sep 2014 19:37:21 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Sep 18 15:37:20 2014
Received: from localhost ([127.0.0.1]:44785 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XUhW3-0004re-T7
	for submit <at> debbugs.gnu.org; Thu, 18 Sep 2014 15:37:20 -0400
Received: from mail-we0-f180.google.com ([74.125.82.180]:54556)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <meyering@HIDDEN>) id 1XUhW1-0004rW-Ry
 for 18454 <at> debbugs.gnu.org; Thu, 18 Sep 2014 15:37:18 -0400
Received: by mail-we0-f180.google.com with SMTP id q59so1448701wes.25
 for <18454 <at> debbugs.gnu.org>; Thu, 18 Sep 2014 12:37:17 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc:content-type:content-transfer-encoding;
 bh=cTyiBHYsccPZ4mHMDRlQHyfGgZlMHthxCLurLkdYxik=;
 b=CAIhLB4nIUKmR1ugY6xAxrs1m0daW2vNEZVyovS8NJ5NJMmqWIEWQlJt8OAat2bwIP
 siyWAQaY5RhbiaUcu33BPRSbpxZZoXWAc0HRdqg1y8Py0m3neVdGJn4lOPvHOByzGSal
 Z98QGZyZCEu1U+LA0LjqDv1PL9gAXJD4hVoviO2PvPjM7rChyIFJZtDFKWBLHepQf2YN
 i+aoU+z4m4ZkJ1srArQcQ6UgsYzyQNfYCJpHdPLPIigeim4Jv3oasjED/NDNeSXlU3sc
 uPdAxSSBhOIEY/CVSMtXojTf9agwgFAFoxKF+t29xINIlkBjb635hNPd73VxrYl6Za49
 2oNg==
X-Received: by 10.194.8.232 with SMTP id u8mr7508156wja.64.1411069037110; Thu,
 18 Sep 2014 12:37:17 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.194.86.131 with HTTP; Thu, 18 Sep 2014 12:36:57 -0700 (PDT)
In-Reply-To: <20140918083327.GA16324@nomada>
References: <20140912012449.GB18162@HIDDEN>
 <541A750E.2050606@HIDDEN> <20140918083327.GA16324@nomada>
From: Jim Meyering <jim@HIDDEN>
Date: Thu, 18 Sep 2014 12:36:57 -0700
X-Google-Sender-Auth: cafI-Ke3C0iV4ZGt9nrph0sbAwY
Message-ID: <CA+8g5KH3LY75wVb3WsL8dvTt4FhfiO=cuYCadRD2R=9nrpw_hg@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: =?ISO-8859-1?Q?Santiago_Ruano_Rinc=F3n?= <santiago@HIDDEN>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 18454
Cc: Paul Eggert <eggert@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.7 (/)

On Thu, Sep 18, 2014 at 1:33 AM, Santiago Ruano Rinc=F3n
<santiago@HIDDEN> wrote:
> El 17/09/14 a las 23:00, Paul Eggert escribi=F3:
>> I've installed all the patches mentioned so far.
>>
>
> I've successfully build the latest commit
> (f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any
> performance boost. Rather the opposite.
>
> Comparing with debian's grep 2.20-3, that includes your first patch to so=
lve
> this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch:
>
> grep -P asdf /usr/bin/*  12,42s user 0,12s system 99% cpu 12,545 total
> src/grep -P asdf /usr/bin/*  14,37s user 0,12s system 99% cpu 14,492 tota=
l
>
> Note that basic grep also slowdowns:
>
> grep asdf /usr/bin/*  0,22s user 0,16s system 99% cpu 0,382 total
> src/grep asdf /usr/bin/*  1,26s user 0,12s system 99% cpu 1,384 total

Thank you for running timing comparisons.

Once I verified that I had no large, sparse files in my grep working direct=
ory,
I ran the same test there (du -sh . reports 176M, du --app -sh . reports 13=
9M)

The following shows a performance regression when searching files
like those in my grep working directory.
The new grep (v2.20-46-gf6de00f) takes 2.5x longer than 2.20.14.
This is with a hot cache (best of several runs) on a
Intel(R) Xeon(R) CPU E5-2660, compiled with gcc-5.x

$ diff -u <(env time grep -r asdf . 2>&1) <(PATH=3Dsrc:$PATH env time
grep -r asdf . 2>&1)
--- /proc/self/fd/11    2014-09-18 12:07:43.169721947 -0700
+++ /proc/self/fd/12    2014-09-18 12:07:43.169721947 -0700
@@ -1,3 +1,3 @@
 ./src/grep.c:               printf 'asdfqwerzxcv\rASDF\tZXCV\n'
 -0.08user 0.10system 0:00.18elapsed 100%CPU (0avgtext+0avgdata
6256maxresident)k
 -0inputs+0outputs (0major+670minor)pagefaults 0swaps
 +0.40user 0.11system 0:00.51elapsed 99%CPU (0avgtext+0avgdata 5328maxresid=
ent)k
 +0inputs+0outputs (0major+634minor)pagefaults 0swaps

It looks like most of the difference is the result of
commit cd36abd46c5e0768606979ea75a51732062f5624,
"grep: treat a file as binary if its prefix contains encoding errors",
with its new,
locale-sensitive "is_binary" test. I saw the above timing difference
even with LC_ALL=3DC, so one quick fix would be to skip the use of
mbrlen when possible.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 18 Sep 2014 08:33:37 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Sep 18 04:33:36 2014
Received: from localhost ([127.0.0.1]:43906 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XUX9k-0000Ta-9l
	for submit <at> debbugs.gnu.org; Thu, 18 Sep 2014 04:33:36 -0400
Received: from mx1.riseup.net ([198.252.153.129]:41158)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <santiago@HIDDEN>) id 1XUX9h-0000TR-8O
 for 18454 <at> debbugs.gnu.org; Thu, 18 Sep 2014 04:33:34 -0400
Received: from berryeater.riseup.net (berryeater-pn.riseup.net [10.0.1.120])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "*.riseup.net", Issuer "Gandi Standard SSL CA" (not verified))
 by mx1.riseup.net (Postfix) with ESMTPS id D9D9C51A50;
 Thu, 18 Sep 2014 01:33:31 -0700 (PDT)
Received: from [127.0.0.1] (localhost [127.0.0.1])
 (Authenticated sender: santiagorr) with ESMTPSA id EC5EA4202F
Received: by nomada (sSMTP sendmail emulation); Thu, 18 Sep 2014 10:33:27 +0200
Date: Thu, 18 Sep 2014 10:33:27 +0200
From: Santiago Ruano =?iso-8859-1?Q?Rinc=F3n?= <santiago@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Message-ID: <20140918083327.GA16324@nomada>
References: <20140912012449.GB18162@HIDDEN>
 <541A750E.2050606@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <541A750E.2050606@HIDDEN>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Virus-Scanned: clamav-milter 0.98.4 at mx1
X-Virus-Status: Clean
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.0 (/)

El 17/09/14 a las 23:00, Paul Eggert escribió:
> I've installed all the patches mentioned so far.
> 

I've successfully build the latest commit
(f6de00f6cec3831b8f334de7dbd1b59115627457), but I don't see any
performance boost. Rather the opposite.

Comparing with debian's grep 2.20-3, that includes your first patch to solve
this -P issue, 0001-grep-P-invalid-utf8-non-matching.patch:

grep -P asdf /usr/bin/*  12,42s user 0,12s system 99% cpu 12,545 total
src/grep -P asdf /usr/bin/*  14,37s user 0,12s system 99% cpu 14,492 total

Note that basic grep also slowdowns:

grep asdf /usr/bin/*  0,22s user 0,16s system 99% cpu 0,382 total
src/grep asdf /usr/bin/*  1,26s user 0,12s system 99% cpu 1,384 total

Cheers, and thanks for your work,

Santiago




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 18 Sep 2014 06:01:12 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Sep 18 02:01:12 2014
Received: from localhost ([127.0.0.1]:43842 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XUUmF-00056C-HI
	for submit <at> debbugs.gnu.org; Thu, 18 Sep 2014 02:01:11 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:36146)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XUUmE-000564-1s
 for 18454 <at> debbugs.gnu.org; Thu, 18 Sep 2014 02:01:10 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id D4A9339E801C
 for <18454 <at> debbugs.gnu.org>; Wed, 17 Sep 2014 23:01:08 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id zYupyWTfHc+2 for <18454 <at> debbugs.gnu.org>;
 Wed, 17 Sep 2014 23:00:56 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 402F4A60003
 for <18454 <at> debbugs.gnu.org>; Wed, 17 Sep 2014 23:00:49 -0700 (PDT)
Message-ID: <541A750E.2050606@HIDDEN>
Date: Wed, 17 Sep 2014 23:00:46 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: 18454 <at> debbugs.gnu.org
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

I've installed all the patches mentioned so far.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 19:56:01 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Sep 17 15:56:01 2014
Received: from localhost ([127.0.0.1]:43624 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XULKa-0006Ys-Vw
	for submit <at> debbugs.gnu.org; Wed, 17 Sep 2014 15:56:01 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:42594)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XULKX-0006Yg-Ur
 for 18454 <at> debbugs.gnu.org; Wed, 17 Sep 2014 15:55:59 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id A9D2DA60001;
 Wed, 17 Sep 2014 12:55:54 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id SZ-Mflp5SCq5; Wed, 17 Sep 2014 12:55:50 -0700 (PDT)
Received: from penguin.cs.ucla.edu (Penguin.CS.UCLA.EDU [131.179.64.200])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 6076D39E8015;
 Wed, 17 Sep 2014 12:55:50 -0700 (PDT)
Message-ID: <5419E746.7070604@HIDDEN>
Date: Wed, 17 Sep 2014 12:55:50 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.0
MIME-Version: 1.0
To: Norihiro Tanaka <noritnk@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <5418E73E.2050002@HIDDEN> <20140917231749.B8AD.27F6AC2D@HIDDEN>
In-Reply-To: <20140917231749.B8AD.27F6AC2D@HIDDEN>
Content-Type: multipart/mixed; boundary="------------060303070004010300080405"
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
Cc: Vincent Lefevre <vincent@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

This is a multi-part message in MIME format.
--------------060303070004010300080405
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Thanks for reporting that, I forgot that the code defaulted SEEK_HOLE 
but not SEEK_DATA.  The first attached patch should fix it.  The second 
one should improve performance further on Solaris for files that end in 
holes.

--------------060303070004010300080405
Content-Type: text/x-patch;
 name="0001-grep-port-to-platforms-lacking-SEEK_DATA.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="0001-grep-port-to-platforms-lacking-SEEK_DATA.patch"

From 78497a2aaaaeae439f9546223b45b3b553146f36 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@HIDDEN>
Date: Wed, 17 Sep 2014 12:33:55 -0700
Subject: [PATCH 1/2] grep: port to platforms lacking SEEK_DATA

Reported by Norihiro Tanaka in: http://bugs.gnu.org/18454#38
* src/grep.c (SEEK_DATA): Default to SEEK_SET if not defined.
(SEEK_HOLE): Move to top level, and default it to SEEK_SET.
(file_textbin): Adjust to new default.
(fillbuf): Don't bother with SEEK_DATA if it defaults to SEEK_SET.
---
 src/grep.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/src/grep.c b/src/grep.c
index 3e94804..a08fa41 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -415,6 +415,15 @@ usable_st_size (struct stat const *st)
   return S_ISREG (st->st_mode) || S_TYPEISSHM (st) || S_TYPEISTMO (st);
 }
 
+/* Lame substitutes for SEEK_DATA and SEEK_HOLE on platforms lacking them.
+   Do not rely on these finding data or holes if they equal SEEK_SET.  */
+#ifndef SEEK_DATA
+enum { SEEK_DATA = SEEK_SET };
+#endif
+#ifndef SEEK_HOLE
+enum { SEEK_HOLE = SEEK_SET };
+#endif
+
 /* Functions we'll use to search. */
 typedef void (*compile_fp_t) (char const *, size_t);
 typedef size_t (*execute_fp_t) (char const *, size_t, size_t *, char const *);
@@ -474,10 +483,6 @@ buffer_textbin (char const *buf, size_t size)
 static enum textbin
 file_textbin (char const *buf, size_t bufsize, int fd, struct stat const *st)
 {
-  #ifndef SEEK_HOLE
-  enum { SEEK_HOLE = SEEK_END };
-  #endif
-
   enum textbin textbin = buffer_textbin (buf, bufsize);
   if (textbin_is_binary (textbin))
     return textbin;
@@ -488,7 +493,7 @@ file_textbin (char const *buf, size_t bufsize, int fd, struct stat const *st)
         return textbin == TEXTBIN_UNKNOWN ? TEXTBIN_BINARY : textbin;
 
       /* If the file has holes, it must contain a null byte somewhere.  */
-      if (SEEK_HOLE != SEEK_END && eolbyte)
+      if (SEEK_HOLE != SEEK_SET && eolbyte)
         {
           off_t cur = bufsize;
           if (O_BINARY || fd == STDIN_FILENO)
@@ -713,7 +718,7 @@ fillbuf (size_t save, struct stat const *st)
         break;
       totalnl = add_count (totalnl, fillsize);
 
-      if (!seek_data_failed)
+      if (SEEK_DATA != SEEK_SET && !seek_data_failed)
         {
           off_t data_start = lseek (bufdesc, bufoffset, SEEK_DATA);
           if (data_start < 0)
-- 
1.9.3


--------------060303070004010300080405
Content-Type: text/x-patch;
 name="0002-grep-speed-up-processing-of-holes-before-EOF-on-Sola.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename*0="0002-grep-speed-up-processing-of-holes-before-EOF-on-Sola.pa";
 filename*1="tch"

From 0d6febac38c03391d7eecb5335620a0ec5ba8278 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@HIDDEN>
Date: Wed, 17 Sep 2014 12:53:17 -0700
Subject: [PATCH 2/2] grep: speed up processing of holes before EOF on Solaris

* src/grep.c (fillbuf): If SEEK_DATA fails with errno == ENXIO,
skip over the hole at EOF.
---
 src/grep.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/src/grep.c b/src/grep.c
index a08fa41..35d3358 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -720,7 +720,12 @@ fillbuf (size_t save, struct stat const *st)
 
       if (SEEK_DATA != SEEK_SET && !seek_data_failed)
         {
+          /* Solaris SEEK_DATA fails with errno == ENXIO in a hole at EOF.  */
           off_t data_start = lseek (bufdesc, bufoffset, SEEK_DATA);
+          if (data_start < 0 && errno == ENXIO
+              && usable_st_size (st) && bufoffset < st->st_size)
+            data_start = lseek (bufdesc, 0, SEEK_END);
+
           if (data_start < 0)
             seek_data_failed = true;
           else
-- 
1.9.3


--------------060303070004010300080405--




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 14:25:33 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Sep 17 10:25:33 2014
Received: from localhost ([127.0.0.1]:43544 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XUGAl-0006hs-UZ
	for submit <at> debbugs.gnu.org; Wed, 17 Sep 2014 10:25:32 -0400
Received: from mx1.redhat.com ([209.132.183.28]:13725)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eblake@HIDDEN>) id 1XUGAi-0006hi-Qv
 for 18454 <at> debbugs.gnu.org; Wed, 17 Sep 2014 10:25:29 -0400
Received: from int-mx09.intmail.prod.int.phx2.redhat.com
 (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])
 by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s8HEPPEa009147
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL);
 Wed, 17 Sep 2014 10:25:26 -0400
Received: from [10.3.113.56] (ovpn-113-56.phx2.redhat.com [10.3.113.56])
 by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id
 s8HEPOs7019319; Wed, 17 Sep 2014 10:25:25 -0400
Message-ID: <541999D4.6070802@HIDDEN>
Date: Wed, 17 Sep 2014 08:25:24 -0600
From: Eric Blake <eblake@HIDDEN>
Organization: Red Hat, Inc.
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.0
MIME-Version: 1.0
To: Norihiro Tanaka <noritnk@HIDDEN>, Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>	<5418E73E.2050002@HIDDEN>
 <20140917231749.B8AD.27F6AC2D@HIDDEN>
In-Reply-To: <20140917231749.B8AD.27F6AC2D@HIDDEN>
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature";
 boundary="2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q"
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22
X-Spam-Score: -5.7 (-----)
X-Debbugs-Envelope-To: 18454
Cc: Vincent Lefevre <vincent@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.7 (-----)

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

On 09/17/2014 08:17 AM, Norihiro Tanaka wrote:
> Thanks for many improvements.  I applied six patches to grep and tried =
to
> compile it, but after the sixth patch, I recevied 'SEEK_DATA' undeclare=
d
> error.  I looked for it on CentOS 5.10, but I couldn't find it in stand=
ard
> header files (glibc 2.5.1) and gnulib files.
>=20

It should be fairly easy for gnulib to fake SEEK_DATA/SEEK_HOLE (by
treating all files as non-sparse).  I guess we haven't needed to do that
before now, because other GNU clients (such as coreutils and tar) of
this have been doing conditional compilation.

--=20
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


--2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg

iQEcBAEBCAAGBQJUGZnUAAoJEKeha0olJ0NqegEH+wSkvcyVkCo2ydhzSfO6C+JT
bzzbGHod4eXVibvgc6JKKN7SiuPRa8/xgzcyMUqFOgTo3EMpsq6SW1Ilz+OWBU9P
7dCRSvg7zIR0S21BMvZF+7OGYR2lsRdIcF1dL3Oi+VT/0IIc6Nwi/ksri8eYljDl
yoUpJFf8uhxIT463d1txZ8SRGGJbEnYPlI1qTBPmD446ePPzbprFyV2APJxDFOSI
8wGKtkmp5EC96K6zSBzP7c1u0DAjSE3+l3XCRKoeurT9cea5tXEaUrsi3MAvG5+o
dJcbE+FmAOCc/hGnPs29nfDhdPkvVnurnq+Axg+nlrFG67dEdyHcqlXcEa5T+7k=
=qr+M
-----END PGP SIGNATURE-----

--2BiPBGkRHA8HiGp9wt2m2J7LT5Ij7Dq1q--




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 14:18:02 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Sep 17 10:18:02 2014
Received: from localhost ([127.0.0.1]:43537 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XUG3V-0006Wr-UG
	for submit <at> debbugs.gnu.org; Wed, 17 Sep 2014 10:18:02 -0400
Received: from mailgw06.kcn.ne.jp ([61.86.7.213]:50240)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <noritnk@HIDDEN>) id 1XUG3R-0006WP-VI
 for 18454 <at> debbugs.gnu.org; Wed, 17 Sep 2014 10:17:59 -0400
Received: from imp01 (mailgw5.kcn.ne.jp [61.86.15.231])
 by mailgw06.kcn.ne.jp (Postfix) with ESMTP id 82E1CE80019
 for <18454 <at> debbugs.gnu.org>; Wed, 17 Sep 2014 23:17:55 +0900 (JST)
Received: from mail09.kcn.ne.jp ([61.86.6.188]) by imp01 with bizsmtp
 id sEHv1o00B43QJrh01EHvBg; Wed, 17 Sep 2014 23:17:55 +0900
X-OrgRCPT: 18454 <at> debbugs.gnu.org
Received: from [10.120.1.54] (i118-21-128-66.s30.a048.ap.plala.or.jp
 [118.21.128.66])
 by mail09.kcn.ne.jp (Postfix) with ESMTPA id B4CFF1BD00C6;
 Wed, 17 Sep 2014 23:17:54 +0900 (JST)
Date: Wed, 17 Sep 2014 23:17:50 +0900
From: Norihiro Tanaka <noritnk@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
In-Reply-To: <5418E73E.2050002@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <5418E73E.2050002@HIDDEN>
Message-Id: <20140917231749.B8AD.27F6AC2D@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit
X-Mailer: Becky! ver. 2.65.07 [ja]
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 18454
Cc: Vincent Lefevre <vincent@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.7 (/)

Thanks for many improvements.  I applied six patches to grep and tried to
compile it, but after the sixth patch, I recevied 'SEEK_DATA' undeclared
error.  I looked for it on CentOS 5.10, but I couldn't find it in standard
header files (glibc 2.5.1) and gnulib files.

==
gcc -std=gnu99 -DHAVE_CONFIG_H -I. -I..  -I../lib -I../lib    -g -O2 -MT
searchutils.o -MD -MP -MF $depbase.Tpo -c -o searchutils.o  searchutils.c && \
mv -f $depbase.Tpo $depbase.Po
grep.c: In function 'fillbuf':
grep.c:718: error: 'SEEK_DATA' undeclared (first use in this function)
grep.c:718: error: (Each undeclared identifier is reported only once
grep.c:718: error: for each function it appears in.)
Makefile:1309: recipe for target 'grep.o' failed
make[2]: *** [grep.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: Leaving directory '/b/grep-2.20/src'
Makefile:1238: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/b/grep-2.20'
Makefile:1179: recipe for target 'all' failed
make: *** [all] Error 2





Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 05:18:39 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Sep 17 01:18:38 2014
Received: from localhost ([127.0.0.1]:42891 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XU7dW-0008SJ-8y
	for submit <at> debbugs.gnu.org; Wed, 17 Sep 2014 01:18:38 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:35997)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XU7dT-0008S9-FA
 for 18454 <at> debbugs.gnu.org; Wed, 17 Sep 2014 01:18:36 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 9C23739E8015;
 Tue, 16 Sep 2014 22:18:34 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id afEVcAYIBKKK; Tue, 16 Sep 2014 22:18:29 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id C403B39E8012;
 Tue, 16 Sep 2014 22:18:29 -0700 (PDT)
Message-ID: <541919A5.5030604@HIDDEN>
Date: Tue, 16 Sep 2014 22:18:29 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: Jim Meyering <jim@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <5418E73E.2050002@HIDDEN>
 <CA+8g5KGW5+R6rzt3Q8Sy_9x-uD_erp0hOiXf41RtjTgsLm02Vg@HIDDEN>
In-Reply-To: <CA+8g5KGW5+R6rzt3Q8Sy_9x-uD_erp0hOiXf41RtjTgsLm02Vg@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
Cc: Vincent Lefevre <vincent@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

Jim Meyering wrote:
> Slightly surprised that 4/6 makes a measurable performance
> difference (didn't check),

It's not measurable.  I should have written it up as a cleanup more than 
as a speedup.

I did like breaking the nominal ZB/s barrier, though, in patch 6/6. 
That's waaaay more than the total throughput of the Internet.  It tops 
even the hypothetical throughput if one recruited the entire US freight 
industry to move nothing but MicroSD cards, which xkcd estimates would 
get only 0.06 ZB/s or so.

http://what-if.xkcd.com/31/




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 04:58:19 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Sep 17 00:58:19 2014
Received: from localhost ([127.0.0.1]:42882 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XU7Jq-0007s1-Fy
	for submit <at> debbugs.gnu.org; Wed, 17 Sep 2014 00:58:18 -0400
Received: from mail-wg0-f44.google.com ([74.125.82.44]:46490)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <meyering@HIDDEN>) id 1XU7Jm-0007rq-KM
 for 18454 <at> debbugs.gnu.org; Wed, 17 Sep 2014 00:58:15 -0400
Received: by mail-wg0-f44.google.com with SMTP id y10so781829wgg.27
 for <18454 <at> debbugs.gnu.org>; Tue, 16 Sep 2014 21:58:13 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=mime-version:sender:in-reply-to:references:from:date:message-id
 :subject:to:cc:content-type;
 bh=GV6E0EpkpTuCf2wPRFIEFWLkKDY70w9p7nLGCyKmYVM=;
 b=oPszO+gLZMllHQoJEPPwFRJdaAbA6tEOyMfQv7K+zfBA7hTQqyGRgoM44v6Ez/WEZN
 QCbc2w9RLpTbkr6FHEE3o8kxyRpAnfb4s9Q0ngzeBb1E04jzLPfeX8Na/0o7SA5fFoxu
 mnIts1AyypOJKtGhvhfzya9xq1lz1rpdBPavXpNVcSZl/vycfTThfMsDqj5snuCTpfYi
 2N4DlwRXQfZnE7neZEeunvi1ZCJ151MyjyVEkU527XbF9AwYjtqzm5EFV1m3F2/Zq3Tb
 Zb8GNHA8MdXeD5HvPDJct5XCy6w0XNTsqynbfxGdkGvzeFdp2Z4suaDW+QM+GXq4CYAO
 ngiw==
X-Received: by 10.194.108.73 with SMTP id hi9mr21906578wjb.88.1410929893327;
 Tue, 16 Sep 2014 21:58:13 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.194.41.202 with HTTP; Tue, 16 Sep 2014 21:57:53 -0700 (PDT)
In-Reply-To: <5418E73E.2050002@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <5418E73E.2050002@HIDDEN>
From: Jim Meyering <jim@HIDDEN>
Date: Tue, 16 Sep 2014 21:57:53 -0700
X-Google-Sender-Auth: 6GSyT1-p4nX5kjMZxDil4qE7M8A
Message-ID: <CA+8g5KGW5+R6rzt3Q8Sy_9x-uD_erp0hOiXf41RtjTgsLm02Vg@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
To: Paul Eggert <eggert@HIDDEN>
Content-Type: text/plain; charset=ISO-8859-1
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 18454
Cc: Vincent Lefevre <vincent@HIDDEN>, 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.7 (/)

On Tue, Sep 16, 2014 at 6:43 PM, Paul Eggert <eggert@HIDDEN> wrote:
> I worked on this some more, and came up with the attached patches proposed
> against the current grep Savannah master (commit
> 9ea9254ea58456b84ed2f0c1481ca91cdd325bf7).
>
> For years I've been wanting to write that last patch and I finally got
> around to it.  It improves grep -P's performance by a factor of 1.2 trillion
> on one (admittedly artificial) benchmark.  I hope its 1 ZB/s scan rate is
> some kind of record.  The last patch probably won't help your test cases,
> though I hope the other patches do help somewhat.

Awesome :-)  I found time to look through all but the 5th.
Slightly surprised that 4/6 makes a measurable performance
difference (didn't check), but moving away from file-scoped
is an improvement in any case.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.
Severity set to 'normal' from 'wishlist' Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Added tag(s) patch. Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 17 Sep 2014 01:43:41 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Sep 16 21:43:41 2014
Received: from localhost ([127.0.0.1]:42807 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XU4HS-0001n1-Rr
	for submit <at> debbugs.gnu.org; Tue, 16 Sep 2014 21:43:40 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:57999)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XU4HN-0001mm-65
 for 18454 <at> debbugs.gnu.org; Tue, 16 Sep 2014 21:43:35 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 3A732A60005;
 Tue, 16 Sep 2014 18:43:32 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id dWD7oQXOpMQu; Tue, 16 Sep 2014 18:43:27 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id DC5B139E8014;
 Tue, 16 Sep 2014 18:43:26 -0700 (PDT)
Message-ID: <5418E73E.2050002@HIDDEN>
Date: Tue, 16 Sep 2014 18:43:26 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: 18454 <at> debbugs.gnu.org, Vincent Lefevre <vincent@HIDDEN>
Subject: Re: Improve performance when -P (PCRE) is used in UTF-8 locales
Content-Type: multipart/mixed; boundary="------------090004040002010007040709"
X-Spam-Score: -3.0 (---)
X-Debbugs-Envelope-To: 18454
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.0 (---)

This is a multi-part message in MIME format.
--------------090004040002010007040709
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

I worked on this some more, and came up with the attached patches 
proposed against the current grep Savannah master (commit 
9ea9254ea58456b84ed2f0c1481ca91cdd325bf7).

For years I've been wanting to write that last patch and I finally got 
around to it.  It improves grep -P's performance by a factor of 1.2 
trillion on one (admittedly artificial) benchmark.  I hope its 1 ZB/s 
scan rate is some kind of record.  The last patch probably won't help 
your test cases, though I hope the other patches do help somewhat.

--------------090004040002010007040709
Content-Type: text/plain; charset=UTF-8;
 name="0001-grep-refactor-binary-vs-unknown-vs-text-flags-for-cl.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename*0="0001-grep-refactor-binary-vs-unknown-vs-text-flags-for-cl.pa";
 filename*1="tch"

RnJvbSA3OGU3ZGQ1Nzk4YTJiYmYwZWI3MTdkN2U2NTE1MGE1ZTU1YzViYTkxIE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE1OjUwOjQ3IC0wNzAwClN1YmplY3Q6IFtQQVRD
SCAxLzZdIGdyZXA6IHJlZmFjdG9yIGJpbmFyeS12cy11bmtub3duLXZzLXRleHQgZmxhZ3Mg
Zm9yCiBjbGFyaXR5CgoqIHNyYy9ncmVwLmMgKGVudW0gdGV4dGJpbik6IE5ldyBlbnVtLgoo
dGV4dGJpbl9pc19iaW5hcnkpOiBOZXcgZnVuY3Rpb24uCihidWZmZXJfdGV4dGJpbiwgZmls
ZV90ZXh0YmluLCBncmVwKTogVXNlIHRoZW0sIGZvciBjbGFyaXR5LgotLS0KIHNyYy9ncmVw
LmMgfCA4NiArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrLS0tLS0t
LS0tLS0tLS0tLS0tLS0tLQogMSBmaWxlIGNoYW5nZWQsIDU1IGluc2VydGlvbnMoKyksIDMx
IGRlbGV0aW9ucygtKQoKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmlu
ZGV4IGU0Mzc5YmMuLjFhYTY0ZGIgMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3Jj
L2dyZXAuYwpAQCAtNDM3LDE2ICs0MzcsMzggQEAgY2xlYW5fdXBfc3Rkb3V0ICh2b2lkKQog
ICAgIGNsb3NlX3N0ZG91dCAoKTsKIH0KIAotLyogUmV0dXJuIDEgaWYgQlVGIChvZiBzaXpl
IFNJWkUpIGNvbnRhaW5zIHRleHQsIC0xIGlmIGl0IGNvbnRhaW5zCi0gICBiaW5hcnkgZGF0
YSwgYW5kIDAgaWYgdGhlIGFuc3dlciBkZXBlbmRzIG9uIHdoYXQgY29tZXMgaW1tZWRpYXRl
bHkKLSAgIGFmdGVyIEJVRi4gICovCi1zdGF0aWMgaW50CisvKiBBbiBlbnVtIHRleHRiaW4g
ZGVzY3JpYmVzIHRoZSBmaWxlJ3MgdHlwZSwgaW5mZXJyZWQgZnJvbSBkYXRhIHJlYWQKKyAg
IGJlZm9yZSB0aGUgZmlyc3QgbGluZSBpcyBzZWxlY3RlZCBmb3Igb3V0cHV0LiAgKi8KK2Vu
dW0gdGV4dGJpbgorICB7CisgICAgLyogQmluYXJ5LCBhcyBpdCBjb250YWlucyBudWxsIGJ5
dGVzIGFuZCB0aGUgLXogb3B0aW9uIGlzIG5vdCBpbiBlZmZlY3QsCisgICAgICAgb3IgaXQg
Y29udGFpbnMgZW5jb2RpbmcgZXJyb3JzLiAgKi8KKyAgICBURVhUQklOX0JJTkFSWSA9IC0x
LAorCisgICAgLyogTm90IGtub3duIHlldC4gIE9ubHkgdGV4dCBoYXMgYmVlbiBzZWVuIHNv
IGZhci4gICovCisgICAgVEVYVEJJTl9VTktOT1dOID0gMCwKKworICAgIC8qIFRleHQuICAq
LworICAgIFRFWFRCSU5fVEVYVCA9IDEKKyAgfTsKKworc3RhdGljIGJvb2wKK3RleHRiaW5f
aXNfYmluYXJ5IChlbnVtIHRleHRiaW4gdGV4dGJpbikKK3sKKyAgcmV0dXJuIHRleHRiaW4g
PCBURVhUQklOX1VOS05PV047Cit9CisKKy8qIFJldHVybiB0aGUgdGV4dCB0eXBlIG9mIGRh
dGEgaW4gQlVGLCBvZiBzaXplIFNJWkUuICAqLworc3RhdGljIGVudW0gdGV4dGJpbgogYnVm
ZmVyX3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiB7CiAgIGNoYXIg
YmFkYnl0ZSA9IGVvbGJ5dGUgPyAnXDAnIDogJ1wyMDAnOwogCiAgIGlmIChNQl9DVVJfTUFY
IDw9IDEpCi0gICAgcmV0dXJuIG1lbWNociAoYnVmLCBiYWRieXRlLCBzaXplKSA/IC0xIDog
MTsKKyAgICB7CisgICAgICBpZiAobWVtY2hyIChidWYsIGJhZGJ5dGUsIHNpemUpKQorICAg
ICAgICByZXR1cm4gVEVYVEJJTl9CSU5BUlk7CisgICAgfQogICBlbHNlCiAgICAgewogICAg
ICAgbWJzdGF0ZV90IG1icyA9IHsgMCB9OwpAQCAtNDU2LDM1ICs0NzgsMzMgQEAgYnVmZmVy
X3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAgICAgICBmb3IgKHAg
PSBidWY7IHAgPCBidWYgKyBzaXplOyBwICs9IGNsZW4pCiAgICAgICAgIHsKICAgICAgICAg
ICBpZiAoKnAgPT0gYmFkYnl0ZSkKLSAgICAgICAgICAgIHJldHVybiAtMTsKKyAgICAgICAg
ICAgIHJldHVybiBURVhUQklOX0JJTkFSWTsKICAgICAgICAgICBjbGVuID0gbWJfY2xlbiAo
cCwgYnVmICsgc2l6ZSAtIHAsICZtYnMpOwogICAgICAgICAgIGlmICgoc2l6ZV90KSAtMiA8
PSBjbGVuKQotICAgICAgICAgICAgcmV0dXJuIGNsZW4gPT0gKHNpemVfdCkgLTIgPyAwIDog
LTE7CisgICAgICAgICAgICByZXR1cm4gY2xlbiA9PSAoc2l6ZV90KSAtMiA/IFRFWFRCSU5f
VU5LTk9XTiA6IFRFWFRCSU5fQklOQVJZOwogICAgICAgICB9Ci0KLSAgICAgIHJldHVybiAx
OwogICAgIH0KKworICByZXR1cm4gVEVYVEJJTl9URVhUOwogfQogCi0vKiBSZXR1cm4gMSBp
ZiBhIGZpbGUgaXMga25vd24gdG8gYmUgdGV4dCBmb3IgdGhlIHB1cnBvc2Ugb2YgJ2dyZXAn
LgotICAgUmV0dXJuIC0xIGlmIGl0IGlzIGtub3duIHRvIGJlIGJpbmFyeSwgMCBpZiB1bmtu
b3duLgotICAgQlVGLCBvZiBzaXplIEJVRlNJWkUsIGlzIHRoZSBpbml0aWFsIGJ1ZmZlciBy
ZWFkIGZyb20gdGhlIGZpbGUgd2l0aAotICAgZGVzY3JpcHRvciBGRCBhbmQgc3RhdHVzIFNU
LiAgKi8KLXN0YXRpYyBpbnQKKy8qIFJldHVybiB0aGUgdGV4dCB0eXBlIG9mIGEgZmlsZS4g
IEJVRiwgb2Ygc2l6ZSBCVUZTSVpFLCBpcyB0aGUgaW5pdGlhbAorICAgYnVmZmVyIHJlYWQg
ZnJvbSB0aGUgZmlsZSB3aXRoIGRlc2NyaXB0b3IgRkQgYW5kIHN0YXR1cyBTVC4gICovCitz
dGF0aWMgZW51bSB0ZXh0YmluCiBmaWxlX3RleHRiaW4gKGNoYXIgY29uc3QgKmJ1Ziwgc2l6
ZV90IGJ1ZnNpemUsIGludCBmZCwgc3RydWN0IHN0YXQgY29uc3QgKnN0KQogewogICAjaWZu
ZGVmIFNFRUtfSE9MRQogICBlbnVtIHsgU0VFS19IT0xFID0gU0VFS19FTkQgfTsKICAgI2Vu
ZGlmCiAKLSAgaW50IHRleHRiaW4gPSBidWZmZXJfdGV4dGJpbiAoYnVmLCBidWZzaXplKTsK
LSAgaWYgKHRleHRiaW4gPCAwKQorICBlbnVtIHRleHRiaW4gdGV4dGJpbiA9IGJ1ZmZlcl90
ZXh0YmluIChidWYsIGJ1ZnNpemUpOworICBpZiAodGV4dGJpbl9pc19iaW5hcnkgKHRleHRi
aW4pKQogICAgIHJldHVybiB0ZXh0YmluOwogCiAgIGlmICh1c2FibGVfc3Rfc2l6ZSAoc3Qp
KQogICAgIHsKICAgICAgIGlmIChzdC0+c3Rfc2l6ZSA8PSBidWZzaXplKQotICAgICAgICBy
ZXR1cm4gMiAqIHRleHRiaW4gLSAxOworICAgICAgICByZXR1cm4gdGV4dGJpbiA9PSBURVhU
QklOX1VOS05PV04gPyBURVhUQklOX0JJTkFSWSA6IHRleHRiaW47CiAKICAgICAgIC8qIElm
IHRoZSBmaWxlIGhhcyBob2xlcywgaXQgbXVzdCBjb250YWluIGEgbnVsbCBieXRlIHNvbWV3
aGVyZS4gICovCiAgICAgICBpZiAoU0VFS19IT0xFICE9IFNFRUtfRU5EICYmIGVvbGJ5dGUp
CkBAIC00OTQsNyArNTE0LDcgQEAgZmlsZV90ZXh0YmluIChjaGFyIGNvbnN0ICpidWYsIHNp
emVfdCBidWZzaXplLCBpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgICAgICAg
ICAgIHsKICAgICAgICAgICAgICAgY3VyID0gbHNlZWsgKGZkLCAwLCBTRUVLX0NVUik7CiAg
ICAgICAgICAgICAgIGlmIChjdXIgPCAwKQotICAgICAgICAgICAgICAgIHJldHVybiAwOwor
ICAgICAgICAgICAgICAgIHJldHVybiBURVhUQklOX1VOS05PV047CiAgICAgICAgICAgICB9
CiAKICAgICAgICAgICAvKiBMb29rIGZvciBhIGhvbGUgYWZ0ZXIgdGhlIGN1cnJlbnQgbG9j
YXRpb24uICAqLwpAQCAtNTA0LDEyICs1MjQsMTIgQEAgZmlsZV90ZXh0YmluIChjaGFyIGNv
bnN0ICpidWYsIHNpemVfdCBidWZzaXplLCBpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpz
dCkKICAgICAgICAgICAgICAgaWYgKGxzZWVrIChmZCwgY3VyLCBTRUVLX1NFVCkgPCAwKQog
ICAgICAgICAgICAgICAgIHN1cHByZXNzaWJsZV9lcnJvciAoZmlsZW5hbWUsIGVycm5vKTsK
ICAgICAgICAgICAgICAgaWYgKGhvbGVfc3RhcnQgPCBzdC0+c3Rfc2l6ZSkKLSAgICAgICAg
ICAgICAgICByZXR1cm4gLTE7CisgICAgICAgICAgICAgICAgcmV0dXJuIFRFWFRCSU5fQklO
QVJZOwogICAgICAgICAgICAgfQogICAgICAgICB9CiAgICAgfQogCi0gIHJldHVybiAwOwor
ICByZXR1cm4gVEVYVEJJTl9VTktOT1dOOwogfQogCiAvKiBDb252ZXJ0IFNUUiB0byBhIG5v
bm5lZ2F0aXZlIGludGVnZXIsIHN0b3JpbmcgdGhlIHJlc3VsdCBpbiAqT1VULgpAQCAtMTEy
OSw3ICsxMTQ5LDcgQEAgc3RhdGljIGludG1heF90CiBncmVwIChpbnQgZmQsIHN0cnVjdCBz
dGF0IGNvbnN0ICpzdCkKIHsKICAgaW50bWF4X3QgbmxpbmVzLCBpOwotICBpbnQgdGV4dGJp
bjsKKyAgZW51bSB0ZXh0YmluIHRleHRiaW47CiAgIHNpemVfdCByZXNpZHVlLCBzYXZlOwog
ICBjaGFyIG9sZGM7CiAgIGNoYXIgKmJlZzsKQEAgLTExNTksMTEgKzExNzksMTEgQEAgZ3Jl
cCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAgICAgfQogCiAgIGlmIChiaW5h
cnlfZmlsZXMgPT0gVEVYVF9CSU5BUllfRklMRVMpCi0gICAgdGV4dGJpbiA9IDE7CisgICAg
dGV4dGJpbiA9IFRFWFRCSU5fVEVYVDsKICAgZWxzZQogICAgIHsKICAgICAgIHRleHRiaW4g
PSBmaWxlX3RleHRiaW4gKGJ1ZmJlZywgYnVmbGltIC0gYnVmYmVnLCBmZCwgc3QpOwotICAg
ICAgaWYgKHRleHRiaW4gPCAwKQorICAgICAgaWYgKHRleHRiaW5faXNfYmluYXJ5ICh0ZXh0
YmluKSkKICAgICAgICAgewogICAgICAgICAgIGlmIChiaW5hcnlfZmlsZXMgPT0gV0lUSE9V
VF9NQVRDSF9CSU5BUllfRklMRVMpCiAgICAgICAgICAgICByZXR1cm4gMDsKQEAgLTEyMjMs
OCArMTI0Myw4IEBAIGdyZXAgKGludCBmZCwgc3RydWN0IHN0YXQgY29uc3QgKnN0KQogICAg
ICAgLyogRGV0ZWN0IHdoZXRoZXIgbGVhZGluZyBjb250ZXh0IGlzIGFkamFjZW50IHRvIHBy
ZXZpb3VzIG91dHB1dC4gICovCiAgICAgICBpZiAobGFzdG91dCkKICAgICAgICAgewotICAg
ICAgICAgIGlmICghdGV4dGJpbikKLSAgICAgICAgICAgIHRleHRiaW4gPSAxOworICAgICAg
ICAgIGlmICh0ZXh0YmluID09IFRFWFRCSU5fVU5LTk9XTikKKyAgICAgICAgICAgIHRleHRi
aW4gPSBURVhUQklOX1RFWFQ7CiAgICAgICAgICAgaWYgKGJlZyAhPSBsYXN0b3V0KQogICAg
ICAgICAgICAgbGFzdG91dCA9IDA7CiAgICAgICAgIH0KQEAgLTEyNDMsMTIgKzEyNjMsMTYg
QEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAKICAgICAgIC8qIElm
IHRoZSBmaWxlJ3MgdGV4dGJpbiBoYXMgbm90IGJlZW4gZGV0ZXJtaW5lZCB5ZXQsIGFzc3Vt
ZQogICAgICAgICAgaXQncyBiaW5hcnkgaWYgdGhlIG5leHQgaW5wdXQgYnVmZmVyIHN1Z2dl
c3RzIHNvLiAgKi8KLSAgICAgIGlmICghIHRleHRiaW4gJiYgYnVmZmVyX3RleHRiaW4gKGJ1
ZmJlZywgYnVmbGltIC0gYnVmYmVnKSA8IDApCisgICAgICBpZiAodGV4dGJpbiA9PSBURVhU
QklOX1VOS05PV04pCiAgICAgICAgIHsKLSAgICAgICAgICB0ZXh0YmluID0gLTE7Ci0gICAg
ICAgICAgaWYgKGJpbmFyeV9maWxlcyA9PSBXSVRIT1VUX01BVENIX0JJTkFSWV9GSUxFUykK
LSAgICAgICAgICAgIHJldHVybiAwOwotICAgICAgICAgIGRvbmVfb25fbWF0Y2ggPSBvdXRf
cXVpZXQgPSB0cnVlOworICAgICAgICAgIGVudW0gdGV4dGJpbiB0YiA9IGJ1ZmZlcl90ZXh0
YmluIChidWZiZWcsIGJ1ZmxpbSAtIGJ1ZmJlZyk7CisgICAgICAgICAgaWYgKHRleHRiaW5f
aXNfYmluYXJ5ICh0YikpCisgICAgICAgICAgICB7CisgICAgICAgICAgICAgIGlmIChiaW5h
cnlfZmlsZXMgPT0gV0lUSE9VVF9NQVRDSF9CSU5BUllfRklMRVMpCisgICAgICAgICAgICAg
ICAgcmV0dXJuIDA7CisgICAgICAgICAgICAgIHRleHRiaW4gPSB0YjsKKyAgICAgICAgICAg
ICAgZG9uZV9vbl9tYXRjaCA9IG91dF9xdWlldCA9IHRydWU7CisgICAgICAgICAgICB9CiAg
ICAgICAgIH0KICAgICB9CiAgIGlmIChyZXNpZHVlKQpAQCAtMTI2Myw3ICsxMjg3LDcgQEAg
Z3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAgZmluaXNoX2dyZXA6CiAg
IGRvbmVfb25fbWF0Y2ggPSBkb25lX29uX21hdGNoXzA7CiAgIG91dF9xdWlldCA9IG91dF9x
dWlldF8wOwotICBpZiAodGV4dGJpbiA8IDAgJiYgIW91dF9xdWlldCAmJiBubGluZXMgIT0g
MCkKKyAgaWYgKHRleHRiaW5faXNfYmluYXJ5ICh0ZXh0YmluKSAmJiAhb3V0X3F1aWV0ICYm
IG5saW5lcyAhPSAwKQogICAgIHByaW50ZiAoXygiQmluYXJ5IGZpbGUgJXMgbWF0Y2hlc1xu
IiksIGZpbGVuYW1lKTsKICAgcmV0dXJuIG5saW5lczsKIH0KLS0gCjEuOS4zCgo=
--------------090004040002010007040709
Content-Type: text/plain; charset=UTF-8;
 name="0002-grep-z-no-longer-considers-200-to-be-binary-data.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename*0="0002-grep-z-no-longer-considers-200-to-be-binary-data.patch"

RnJvbSAyNDFhNzYyM2NhNTE5YTdmZjcyNjAxZGU4MDlmYjUyMzkzNDk0MzU0IE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE2OjE4OjAwIC0wNzAwClN1YmplY3Q6IFtQQVRD
SCAyLzZdIGdyZXA6IC16IG5vIGxvbmdlciBjb25zaWRlcnMgJ1wyMDAnIHRvIGJlIGJpbmFy
eSBkYXRhCgpUaGlzIGF2b2lkcyBhIHByb2JsZW0gd2hlbiB1c2luZyBncmVwIC16IGluIGEg
V2luZG93cy0xMjUyIGxvY2FsZS4KUGx1cywgaXQgbGV0cyAnZ3JlcCAteicgcnVuIGEgYml0
IGZhc3Rlci4KKiBORVdTOiBEb2N1bWVudCB0aGlzLgoqIHNyYy9ncmVwLmMgKGJ1ZmZlcl90
ZXh0YmluKTogRG9uJ3QgbG9vayBmb3IgJ1wyMDAnIGlmIC16LgoqIHRlc3RzL3BjcmUtejog
VGVzdCBmb3IgbmV3IGJlaGF2aW9yLgotLS0KIE5FV1MgICAgICAgICB8ICAyICsrCiBzcmMv
Z3JlcC5jICAgfCAxMiArKystLS0tLS0tLS0KIHRlc3RzL3BjcmUteiB8ICA0ICsrKysKIDMg
ZmlsZXMgY2hhbmdlZCwgOSBpbnNlcnRpb25zKCspLCA5IGRlbGV0aW9ucygtKQoKZGlmZiAt
LWdpdCBhL05FV1MgYi9ORVdTCmluZGV4IDkzNzdkN2QuLjUxYjYzZmIgMTAwNjQ0Ci0tLSBh
L05FV1MKKysrIGIvTkVXUwpAQCAtMjYsNiArMjYsOCBAQCBHTlUgZ3JlcCBORVdTICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgLSotIG91dGxpbmUgLSotCiAgIEluIGxv
Y2FsZXMgd2l0aCBtdWx0aWJ5dGUgY2hhcmFjdGVyIGVuY29kaW5ncyBvdGhlciB0aGFuIFVU
Ri04LAogICBncmVwIC1QIG5vdyByZXBvcnRzIGFuIGVycm9yIGFuZCBleGl0cyBpbnN0ZWFk
IG9mIG1pc2JlaGF2aW5nLgogCisgIGdyZXAgLXogbm8gbG9uZ2VyIGF1dG9tYXRpY2FsbHkg
dHJlYXRzIHRoZSBieXRlICdcMjAwJyBhcyBiaW5hcnkgZGF0YS4KKwogKiBOb3Rld29ydGh5
IGNoYW5nZXMgaW4gcmVsZWFzZSAyLjIwICgyMDE0LTA2LTAzKSBbc3RhYmxlXQogCiAqKiBC
dWcgZml4ZXMKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IDFh
YTY0ZGIuLjFjNmZlZTggMTAwNjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAu
YwpAQCAtNDYyLDE0ICs0NjIsMTAgQEAgdGV4dGJpbl9pc19iaW5hcnkgKGVudW0gdGV4dGJp
biB0ZXh0YmluKQogc3RhdGljIGVudW0gdGV4dGJpbgogYnVmZmVyX3RleHRiaW4gKGNoYXIg
Y29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiB7Ci0gIGNoYXIgYmFkYnl0ZSA9IGVvbGJ5dGUg
PyAnXDAnIDogJ1wyMDAnOworICBpZiAoZW9sYnl0ZSAmJiBtZW1jaHIgKGJ1ZiwgJ1wwJywg
c2l6ZSkpCisgICAgcmV0dXJuIFRFWFRCSU5fQklOQVJZOwogCi0gIGlmIChNQl9DVVJfTUFY
IDw9IDEpCi0gICAgewotICAgICAgaWYgKG1lbWNociAoYnVmLCBiYWRieXRlLCBzaXplKSkK
LSAgICAgICAgcmV0dXJuIFRFWFRCSU5fQklOQVJZOwotICAgIH0KLSAgZWxzZQorICBpZiAo
MSA8IE1CX0NVUl9NQVgpCiAgICAgewogICAgICAgbWJzdGF0ZV90IG1icyA9IHsgMCB9Owog
ICAgICAgc2l6ZV90IGNsZW47CkBAIC00NzcsOCArNDczLDYgQEAgYnVmZmVyX3RleHRiaW4g
KGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90IHNpemUpCiAKICAgICAgIGZvciAocCA9IGJ1Zjsg
cCA8IGJ1ZiArIHNpemU7IHAgKz0gY2xlbikKICAgICAgICAgewotICAgICAgICAgIGlmICgq
cCA9PSBiYWRieXRlKQotICAgICAgICAgICAgcmV0dXJuIFRFWFRCSU5fQklOQVJZOwogICAg
ICAgICAgIGNsZW4gPSBtYl9jbGVuIChwLCBidWYgKyBzaXplIC0gcCwgJm1icyk7CiAgICAg
ICAgICAgaWYgKChzaXplX3QpIC0yIDw9IGNsZW4pCiAgICAgICAgICAgICByZXR1cm4gY2xl
biA9PSAoc2l6ZV90KSAtMiA/IFRFWFRCSU5fVU5LTk9XTiA6IFRFWFRCSU5fQklOQVJZOwpk
aWZmIC0tZ2l0IGEvdGVzdHMvcGNyZS16IGIvdGVzdHMvcGNyZS16CmluZGV4IDk5ZWJjNDMu
LjZiYmRlOTQgMTAwNzU1Ci0tLSBhL3Rlc3RzL3BjcmUtegorKysgYi90ZXN0cy9wY3JlLXoK
QEAgLTIwLDQgKzIwLDggQEAgZ3JlcCAtUHogIiRSRUdFWCIgaW4gPiBvdXQgMj5lcnIgfHwg
ZmFpbD0xCiBjb21wYXJlIGV4cCBvdXQgfHwgZmFpbD0xCiBjb21wYXJlIC9kZXYvbnVsbCBl
cnIgfHwgZmFpbD0xCiAKK3ByaW50ZiAnXDIwMFwwJyA+aW4wCitMQ19BTEw9QyBncmVwIC16
IC4gaW4wID5vdXQgfHwgZmFpbD0xCitjb21wYXJlIGluMCBvdXQgfHwgZmFpbD0xCisKIEV4
aXQgJGZhaWwKLS0gCjEuOS4zCgo=
--------------090004040002010007040709
Content-Type: text/plain; charset=UTF-8;
 name="0003-grep-non-text-bytes-in-binary-data-may-be-treated-as.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename*0="0003-grep-non-text-bytes-in-binary-data-may-be-treated-as.pa";
 filename*1="tch"

RnJvbSBjMGM2OTBiZTE1MGQyNjA5N2MzNTgxMDM5YzdlODgyYjJhM2UxOWQ4IE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE3OjE1OjA2IC0wNzAwClN1YmplY3Q6IFtQQVRD
SCAzLzZdIGdyZXA6IG5vbi10ZXh0IGJ5dGVzIGluIGJpbmFyeSBkYXRhIG1heSBiZSB0cmVh
dGVkIGFzCiBsaW5lIGVuZHMKCiogTkVXUywgZG9jL2dyZXAudGV4aSAoRmlsZSBhbmQgRGly
ZWN0b3J5IFNlbGVjdGlvbik6CkRvY3VtZW50IHRoaXMgY2hhbmdlLgoqIHNyYy9ncmVwLmMg
KHphcF9udWxzKTogTmV3IGZ1bmN0aW9uLgooZ3JlcCk6IFVzZSBpdC4KKiB0ZXN0cy9udWxs
LWJ5dGU6IFJlbGF4IHRvIGFsbG93IG5ldyBiZWhhdmlvci4KLS0tCiBORVdTICAgICAgICAg
ICAgfCAgMyArKysKIGRvYy9ncmVwLnRleGkgICB8ICAyICsrCiBzcmMvZ3JlcC5jICAgICAg
fCAyOCArKysrKysrKysrKysrKysrKysrKysrKysrKystCiB0ZXN0cy9udWxsLWJ5dGUgfCAg
NCArKy0tCiA0IGZpbGVzIGNoYW5nZWQsIDM0IGluc2VydGlvbnMoKyksIDMgZGVsZXRpb25z
KC0pCgpkaWZmIC0tZ2l0IGEvTkVXUyBiL05FV1MKaW5kZXggNTFiNjNmYi4uNzMzMzE4ZCAx
MDA2NDQKLS0tIGEvTkVXUworKysgYi9ORVdTCkBAIC0yNiw2ICsyNiw5IEBAIEdOVSBncmVw
IE5FV1MgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAtKi0gb3V0bGluZSAt
Ki0KICAgSW4gbG9jYWxlcyB3aXRoIG11bHRpYnl0ZSBjaGFyYWN0ZXIgZW5jb2RpbmdzIG90
aGVyIHRoYW4gVVRGLTgsCiAgIGdyZXAgLVAgbm93IHJlcG9ydHMgYW4gZXJyb3IgYW5kIGV4
aXRzIGluc3RlYWQgb2YgbWlzYmVoYXZpbmcuCiAKKyAgV2hlbiBzZWFyY2hpbmcgYmluYXJ5
IGRhdGEsIGdyZXAgbm93IG1heSB0cmVhdCBub24tdGV4dCBieXRlcyBhcworICBsaW5lIHRl
cm1pbmF0b3JzLiAgVGhpcyBjYW4gYm9vc3QgcGVyZm9ybWFuY2Ugc2lnbmlmaWNhbnRseS4K
KwogICBncmVwIC16IG5vIGxvbmdlciBhdXRvbWF0aWNhbGx5IHRyZWF0cyB0aGUgYnl0ZSAn
XDIwMCcgYXMgYmluYXJ5IGRhdGEuCiAKICogTm90ZXdvcnRoeSBjaGFuZ2VzIGluIHJlbGVh
c2UgMi4yMCAoMjAxNC0wNi0wMykgW3N0YWJsZV0KZGlmZiAtLWdpdCBhL2RvYy9ncmVwLnRl
eGkgYi9kb2MvZ3JlcC50ZXhpCmluZGV4IDE0YmQ2OWUuLmQ3YWRjYWQgMTAwNjQ0Ci0tLSBh
L2RvYy9ncmVwLnRleGkKKysrIGIvZG9jL2dyZXAudGV4aQpAQCAtNjAwLDYgKzYwMCw4IEBA
IEJ5IGRlZmF1bHQsIEB2YXJ7dHlwZX0gaXMgQHNhbXB7YmluYXJ5fSwKIGFuZCBAY29tbWFu
ZHtncmVwfSBub3JtYWxseSBvdXRwdXRzIGVpdGhlcgogYSBvbmUtbGluZSBtZXNzYWdlIHNh
eWluZyB0aGF0IGEgYmluYXJ5IGZpbGUgbWF0Y2hlcywKIG9yIG5vIG1lc3NhZ2UgaWYgdGhl
cmUgaXMgbm8gbWF0Y2guCitXaGVuIG1hdGNoaW5nIGJpbmFyeSBkYXRhLCBAY29tbWFuZHtn
cmVwfSBtYXkgdHJlYXQgbm9uLXRleHQKK2J5dGVzIGFzIGxpbmUgdGVybWluYXRvcnMuCiAK
IElmIEB2YXJ7dHlwZX0gaXMgQHNhbXB7d2l0aG91dC1tYXRjaH0sCiBAY29tbWFuZHtncmVw
fSBhc3N1bWVzIHRoYXQgYSBiaW5hcnkgZmlsZSBkb2VzIG5vdCBtYXRjaDsKZGlmZiAtLWdp
dCBhL3NyYy9ncmVwLmMgYi9zcmMvZ3JlcC5jCmluZGV4IDFjNmZlZTguLjgzNTU5ZTIgMTAw
NjQ0Ci0tLSBhL3NyYy9ncmVwLmMKKysrIGIvc3JjL2dyZXAuYwpAQCAtMTA5Myw5ICsxMDkz
LDMwIEBAIHBydGV4dCAoY2hhciBjb25zdCAqYmVnLCBjaGFyIGNvbnN0ICpsaW0pCiAgIG91
dGxlZnQgLT0gbjsKIH0KIAorLyogUmVwbGFjZSBhbGwgTlVMIGJ5dGVzIGluIGJ1ZmZlciBQ
ICh3aGljaCBlbmRzIGF0IExJTSkgd2l0aCBFT0wuCisgICBUaGlzIGF2b2lkcyBydW5uaW5n
IG91dCBvZiBtZW1vcnkgd2hlbiBiaW5hcnkgaW5wdXQgY29udGFpbnMgYSBsb25nCisgICBz
ZXF1ZW5jZSBvZiB6ZXJvcywgd2hpY2ggd291bGQgb3RoZXJ3aXNlIGJlIGNvbnNpZGVyZWQg
dG8gYmUgcGFydAorICAgb2YgYSBsb25nIGxpbmUuICBQW0xJTV0gc2hvdWxkIGJlIEVPTC4g
ICovCitzdGF0aWMgdm9pZAoremFwX251bHMgKGNoYXIgKnAsIGNoYXIgKmxpbSwgY2hhciBl
b2wpCit7CisgIGlmIChlb2wpCisgICAgd2hpbGUgKHRydWUpCisgICAgICB7CisgICAgICAg
ICpsaW0gPSAnXDAnOworICAgICAgICBwICs9IHN0cmxlbiAocCk7CisgICAgICAgICpsaW0g
PSBlb2w7CisgICAgICAgIGlmIChwID09IGxpbSkKKyAgICAgICAgICBicmVhazsKKyAgICAg
ICAgZG8KKyAgICAgICAgICAqcCsrID0gZW9sOworICAgICAgICB3aGlsZSAoISpwKTsKKyAg
ICAgIH0KK30KKwogLyogU2NhbiB0aGUgc3BlY2lmaWVkIHBvcnRpb24gb2YgdGhlIGJ1ZmZl
ciwgbWF0Y2hpbmcgbGluZXMgKG9yCiAgICBiZXR3ZWVuIG1hdGNoaW5nIGxpbmVzIGlmIE9V
VF9JTlZFUlQgaXMgdHJ1ZSkuICBSZXR1cm4gYSBjb3VudCBvZgotICAgbGluZXMgcHJpbnRl
ZC4gKi8KKyAgIGxpbmVzIHByaW50ZWQuICBSZXBsYWNlIGFsbCBOVUwgYnl0ZXMgd2l0aCBO
VUxfWkFQUEVSIGFzIHdlIGdvLiAgKi8KIHN0YXRpYyBpbnRtYXhfdAogZ3JlcGJ1ZiAoY2hh
ciBjb25zdCAqYmVnLCBjaGFyIGNvbnN0ICpsaW0pCiB7CkBAIC0xMTQ5LDYgKzExNzAsNyBA
QCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgY2hhciAqYmVnOwog
ICBjaGFyICpsaW07CiAgIGNoYXIgZW9sID0gZW9sYnl0ZTsKKyAgY2hhciBudWxfemFwcGVy
ID0gJ1wwJzsKICAgYm9vbCBkb25lX29uX21hdGNoXzAgPSBkb25lX29uX21hdGNoOwogICBi
b29sIG91dF9xdWlldF8wID0gb3V0X3F1aWV0OwogCkBAIC0xMTgyLDYgKzEyMDQsNyBAQCBn
cmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAgICAgICAgICBpZiAoYmlu
YXJ5X2ZpbGVzID09IFdJVEhPVVRfTUFUQ0hfQklOQVJZX0ZJTEVTKQogICAgICAgICAgICAg
cmV0dXJuIDA7CiAgICAgICAgICAgZG9uZV9vbl9tYXRjaCA9IG91dF9xdWlldCA9IHRydWU7
CisgICAgICAgICAgbnVsX3phcHBlciA9IGVvbDsKICAgICAgICAgfQogICAgIH0KIApAQCAt
MTE5Nyw2ICsxMjIwLDggQEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3Qp
CiAgICAgICBpZiAoYmVnID09IGJ1ZmxpbSkKICAgICAgICAgYnJlYWs7CiAKKyAgICAgIHph
cF9udWxzIChiZWcsIGJ1ZmxpbSwgbnVsX3phcHBlcik7CisKICAgICAgIC8qIERldGVybWlu
ZSBuZXcgcmVzaWR1ZSAodGhlIGxlbmd0aCBvZiBhbiBpbmNvbXBsZXRlIGxpbmUgYXQgdGhl
IGVuZCBvZgogICAgICAgICAgdGhlIGJ1ZmZlciwgMCBtZWFucyB0aGVyZSBpcyBubyBpbmNv
bXBsZXRlIGxhc3QgbGluZSkuICAqLwogICAgICAgb2xkYyA9IGJlZ1stMV07CkBAIC0xMjY2
LDYgKzEyOTEsNyBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpzdCkKICAg
ICAgICAgICAgICAgICByZXR1cm4gMDsKICAgICAgICAgICAgICAgdGV4dGJpbiA9IHRiOwog
ICAgICAgICAgICAgICBkb25lX29uX21hdGNoID0gb3V0X3F1aWV0ID0gdHJ1ZTsKKyAgICAg
ICAgICAgICAgbnVsX3phcHBlciA9IGVvbDsKICAgICAgICAgICAgIH0KICAgICAgICAgfQog
ICAgIH0KZGlmZiAtLWdpdCBhL3Rlc3RzL251bGwtYnl0ZSBiL3Rlc3RzL251bGwtYnl0ZQpp
bmRleCBjOTY3ZGJjLi4xZDgwYmZlIDEwMDc1NQotLS0gYS90ZXN0cy9udWxsLWJ5dGUKKysr
IGIvdGVzdHMvbnVsbC1ieXRlCkBAIC0zOCw4ICszOCw4IEBAIGZvciBsZWZ0IGluICcnIGEg
JyMnICdcMCc7IGRvCiAgICAgICAgICAgcGF0PSIkaGF0JGZvcmNlX3JlZ2V4JGRhdGEkZG9s
bGFyIgogICAgICAgICAgIHByaW50ZiAiJHBhdFxcbiIgPnBhdCB8fCBmcmFtZXdvcmtfZmFp
bHVyZV8KICAgICAgICAgICBmb3IgbG9jYWxlIGluICRsb2NhbGVzOyBkbwotICAgICAgICAg
ICAgTENfQUxMPSRsb2NhbGUgZ3JlcCAtZiBwYXQgaW4gfHwKLSAgICAgICAgICAgICAgZmFp
bF8gIickcGF0JyBkb2VzIG5vdCBtYXRjaCAnJGRhdGEnIgorICAgICAgICAgICAgTENfQUxM
PSRsb2NhbGUgZ3JlcCAtZiBwYXQgaW4KKyAgICAgICAgICAgIHRlc3QgJD8gLWVxIDAgfHwg
dGVzdCAkPyAtZXEgMSB8fCBmYWlsXyAiJyRwYXQnIGNhdXNlZCBhbiBlcnJvciIKICAgICAg
ICAgICAgIExDX0FMTD0kbG9jYWxlIGdyZXAgLWEgLWYgcGF0IGluIHwgY21wIC1zIC0gaW4g
fHwKICAgICAgICAgICAgICAgZmFpbF8gIi1hICckcGF0JyBkb2VzIG5vdCBtYXRjaCAnJGRh
dGEnIgogICAgICAgICAgIGRvbmUKLS0gCjEuOS4zCgo=
--------------090004040002010007040709
Content-Type: text/plain; charset=UTF-8;
 name="0004-grep-minor-P-speedup-with-jit_stack.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="0004-grep-minor-P-speedup-with-jit_stack.patch"

RnJvbSBmMTYzMTM4NzNjYzg2ODZjMTVkOWI3MWY5YWZlYTdiOWI0YzRkMTRiIE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBNb24sIDE1IFNlcCAyMDE0IDE5OjU2OjU1IC0wNzAwClN1YmplY3Q6IFtQQVRD
SCA0LzZdIGdyZXA6IG1pbm9yIC1QIHNwZWVkdXAgd2l0aCBqaXRfc3RhY2sKCiogc3JjL3Bj
cmVzZWFyY2guYyAoaml0X3N0YWNrKTogTm8gbG9uZ2VyIHN0YXRpYy4KLS0tCiBzcmMvcGNy
ZXNlYXJjaC5jIHwgNiArKy0tLS0KIDEgZmlsZSBjaGFuZ2VkLCAyIGluc2VydGlvbnMoKyks
IDQgZGVsZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvc3JjL3BjcmVzZWFyY2guYyBiL3NyYy9w
Y3Jlc2VhcmNoLmMKaW5kZXggYzQxZjdlZi4uMWIxNWU1MyAxMDA2NDQKLS0tIGEvc3JjL3Bj
cmVzZWFyY2guYworKysgYi9zcmMvcGNyZXNlYXJjaC5jCkBAIC0zMyw5ICszMyw3IEBAIHN0
YXRpYyBwY3JlICpjcmU7CiAvKiBBZGRpdGlvbmFsIGluZm9ybWF0aW9uIGFib3V0IHRoZSBw
YXR0ZXJuLiAgKi8KIHN0YXRpYyBwY3JlX2V4dHJhICpleHRyYTsKIAotIyBpZmRlZiBQQ1JF
X1NUVURZX0pJVF9DT01QSUxFCi1zdGF0aWMgcGNyZV9qaXRfc3RhY2sgKmppdF9zdGFjazsK
LSMgZWxzZQorIyBpZm5kZWYgUENSRV9TVFVEWV9KSVRfQ09NUElMRQogIyAgZGVmaW5lIFBD
UkVfU1RVRFlfSklUX0NPTVBJTEUgMAogIyBlbmRpZgogI2VuZGlmCkBAIC0xMjYsNyArMTI0
LDcgQEAgUGNvbXBpbGUgKGNoYXIgY29uc3QgKnBhdHRlcm4sIHNpemVfdCBzaXplKQogICAg
ICAgLyogQSAzMksgc3RhY2sgaXMgYWxsb2NhdGVkIGZvciB0aGUgbWFjaGluZSBjb2RlIGJ5
IGRlZmF1bHQsIHdoaWNoCiAgICAgICAgICBjYW4gZ3JvdyB0byA1MTJLIGlmIG5lY2Vzc2Fy
eS4gU2luY2UgSklUIHVzZXMgZmFyIGxlc3MgbWVtb3J5CiAgICAgICAgICB0aGFuIHRoZSBp
bnRlcnByZXRlciwgdGhpcyBzaG91bGQgYmUgZW5vdWdoIGluIHByYWN0aWNlLiAgKi8KLSAg
ICAgIGppdF9zdGFjayA9IHBjcmVfaml0X3N0YWNrX2FsbG9jICgzMiAqIDEwMjQsIDUxMiAq
IDEwMjQpOworICAgICAgcGNyZV9qaXRfc3RhY2sgKmppdF9zdGFjayA9IHBjcmVfaml0X3N0
YWNrX2FsbG9jICgzMiAqIDEwMjQsIDUxMiAqIDEwMjQpOwogICAgICAgaWYgKCFqaXRfc3Rh
Y2spCiAgICAgICAgIGVycm9yIChFWElUX1RST1VCTEUsIDAsCiAgICAgICAgICAgICAgICBf
KCJmYWlsZWQgdG8gYWxsb2NhdGUgbWVtb3J5IGZvciB0aGUgUENSRSBKSVQgc3RhY2siKSk7
Ci0tIAoxLjkuMwoK
--------------090004040002010007040709
Content-Type: text/plain; charset=UTF-8;
 name="0005-grep-improve-P-performance-in-typical-cases.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="0005-grep-improve-P-performance-in-typical-cases.patch"

RnJvbSBlN2NhMjUyMjAyYTM0ODUwYjhiODVlODUwN2I0YjllZDc1ZWY4Y2RmIE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBUdWUsIDE2IFNlcCAyMDE0IDE1OjQ4OjQ0IC0wNzAwClN1YmplY3Q6IFtQQVRD
SCA1LzZdIGdyZXA6IGltcHJvdmUgLVAgcGVyZm9ybWFuY2UgaW4gdHlwaWNhbCBjYXNlcwoK
KiBzcmMvZ3JlcC5jLCBzcmMvZ3JlcC5oIChlbnVtIHRleHRiaW4pOiBNb3ZlIHRvIGdyZXAu
aC4KKGlucHV0X3RleHRiaW4sIHZhbGlkYXRlZF9ib3VuZGFyeSk6IE5ldyB2YXJzLgoqIHNy
Yy9ncmVwLmMgKGdyZXBidWYsIGdyZXApOiBJbml0aWFsaXplIHRoZW0uCiogc3JjL3BjcmVz
ZWFyY2guYyAoUGV4ZWN1dGUpOiBEbyBhIG11bHRpbGluZSBzZWFyY2gKd2hlbiB0aGUgaW5w
dXQgaXMga25vd24gdG8gYmUgZnJlZSBvZiBlbmNvZGluZyBlcnJvcnMuClF1aWNrbHkgZGlz
Y2FyZCBieXRlcyB0aGF0IGFyZSBvYnZpb3VzbHkgZW5jb2RpbmcgZXJyb3JzLgpRdWlja2x5
IG1hdGNoIGVtcHR5IHN0cmluZ3MuCi0tLQogc3JjL2dyZXAuYyAgICAgICB8ICAxOSArKy0t
LS0tLS0KIHNyYy9ncmVwLmggICAgICAgfCAgMjIgKysrKysrKysrKwogc3JjL3BjcmVzZWFy
Y2guYyB8IDEyMCArKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysr
KysrKy0tLS0tLS0tCiAzIGZpbGVzIGNoYW5nZWQsIDEzMCBpbnNlcnRpb25zKCspLCAzMSBk
ZWxldGlvbnMoLSkKCmRpZmYgLS1naXQgYS9zcmMvZ3JlcC5jIGIvc3JjL2dyZXAuYwppbmRl
eCA4MzU1OWUyLi5lM2M0OTI1IDEwMDY0NAotLS0gYS9zcmMvZ3JlcC5jCisrKyBiL3NyYy9n
cmVwLmMKQEAgLTM1MSw2ICszNTEsOCBAQCBib29sIG1hdGNoX2ljYXNlOwogYm9vbCBtYXRj
aF93b3JkczsKIGJvb2wgbWF0Y2hfbGluZXM7CiB1bnNpZ25lZCBjaGFyIGVvbGJ5dGU7Citl
bnVtIHRleHRiaW4gaW5wdXRfdGV4dGJpbjsKK2NoYXIgY29uc3QgKnZhbGlkYXRlZF9ib3Vu
ZGFyeTsKIAogc3RhdGljIGNoYXIgY29uc3QgKm1hdGNoZXI7CiAKQEAgLTQzNywyMSArNDM5
LDYgQEAgY2xlYW5fdXBfc3Rkb3V0ICh2b2lkKQogICAgIGNsb3NlX3N0ZG91dCAoKTsKIH0K
IAotLyogQW4gZW51bSB0ZXh0YmluIGRlc2NyaWJlcyB0aGUgZmlsZSdzIHR5cGUsIGluZmVy
cmVkIGZyb20gZGF0YSByZWFkCi0gICBiZWZvcmUgdGhlIGZpcnN0IGxpbmUgaXMgc2VsZWN0
ZWQgZm9yIG91dHB1dC4gICovCi1lbnVtIHRleHRiaW4KLSAgewotICAgIC8qIEJpbmFyeSwg
YXMgaXQgY29udGFpbnMgbnVsbCBieXRlcyBhbmQgdGhlIC16IG9wdGlvbiBpcyBub3QgaW4g
ZWZmZWN0LAotICAgICAgIG9yIGl0IGNvbnRhaW5zIGVuY29kaW5nIGVycm9ycy4gICovCi0g
ICAgVEVYVEJJTl9CSU5BUlkgPSAtMSwKLQotICAgIC8qIE5vdCBrbm93biB5ZXQuICBPbmx5
IHRleHQgaGFzIGJlZW4gc2VlbiBzbyBmYXIuICAqLwotICAgIFRFWFRCSU5fVU5LTk9XTiA9
IDAsCi0KLSAgICAvKiBUZXh0LiAgKi8KLSAgICBURVhUQklOX1RFWFQgPSAxCi0gIH07Ci0K
IHN0YXRpYyBib29sCiB0ZXh0YmluX2lzX2JpbmFyeSAoZW51bSB0ZXh0YmluIHRleHRiaW4p
CiB7CkBAIC0xMTIzLDYgKzExMTAsNyBAQCBncmVwYnVmIChjaGFyIGNvbnN0ICpiZWcsIGNo
YXIgY29uc3QgKmxpbSkKICAgaW50bWF4X3Qgb3V0bGVmdDAgPSBvdXRsZWZ0OwogICBjaGFy
IGNvbnN0ICpwOwogICBjaGFyIGNvbnN0ICplbmRwOworICB2YWxpZGF0ZWRfYm91bmRhcnkg
PSBiZWc7CiAKICAgZm9yIChwID0gYmVnOyBwIDwgbGltOyBwID0gZW5kcCkKICAgICB7CkBA
IC0xMjEwLDYgKzExOTgsNyBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBzdGF0IGNvbnN0ICpz
dCkKIAogICBmb3IgKDs7KQogICAgIHsKKyAgICAgIGlucHV0X3RleHRiaW4gPSB0ZXh0Ymlu
OwogICAgICAgbGFzdG5sID0gYnVmYmVnOwogICAgICAgaWYgKGxhc3RvdXQpCiAgICAgICAg
IGxhc3RvdXQgPSBidWZiZWc7CmRpZmYgLS1naXQgYS9zcmMvZ3JlcC5oIGIvc3JjL2dyZXAu
aAppbmRleCA1NDk2ZWIyLi4yM2Q0ZTk1IDEwMDY0NAotLS0gYS9zcmMvZ3JlcC5oCisrKyBi
L3NyYy9ncmVwLmgKQEAgLTI5LDQgKzI5LDI2IEBAIGV4dGVybiBib29sIG1hdGNoX3dvcmRz
OwkvKiAtdyAqLwogZXh0ZXJuIGJvb2wgbWF0Y2hfbGluZXM7CS8qIC14ICovCiBleHRlcm4g
dW5zaWduZWQgY2hhciBlb2xieXRlOwkvKiAteiAqLwogCisvKiBBbiBlbnVtIHRleHRiaW4g
ZGVzY3JpYmVzIHRoZSBmaWxlJ3MgdHlwZSwgaW5mZXJyZWQgZnJvbSBkYXRhIHJlYWQKKyAg
IGJlZm9yZSB0aGUgZmlyc3QgbGluZSBpcyBzZWxlY3RlZCBmb3Igb3V0cHV0LiAgKi8KK2Vu
dW0gdGV4dGJpbgorICB7CisgICAgLyogQmluYXJ5LCBhcyBpdCBjb250YWlucyBudWxsIGJ5
dGVzIGFuZCB0aGUgLXogb3B0aW9uIGlzIG5vdCBpbiBlZmZlY3QsCisgICAgICAgb3IgaXQg
Y29udGFpbnMgZW5jb2RpbmcgZXJyb3JzLiAgKi8KKyAgICBURVhUQklOX0JJTkFSWSA9IC0x
LAorCisgICAgLyogTm90IGtub3duIHlldC4gIE9ubHkgdGV4dCBoYXMgYmVlbiBzZWVuIHNv
IGZhci4gICovCisgICAgVEVYVEJJTl9VTktOT1dOID0gMCwKKworICAgIC8qIFRleHQuICAq
LworICAgIFRFWFRCSU5fVEVYVCA9IDEKKyAgfTsKKworLyogSW5wdXQgZmlsZSB0eXBlLiAg
Ki8KK2V4dGVybiBlbnVtIHRleHRiaW4gaW5wdXRfdGV4dGJpbjsKKworLyogVmFsaWRhdGlv
biBib3VuZGFyeS4gIEVhcmxpZXIgYnl0ZXMgaGF2ZSBhbHJlYWR5IGJlZW4gdmFsaWRhdGVk
IGJ5CisgICB0aGUgUENSRSBtYXRjaGVyLCB3aGljaCBjYXJlcyBhYm91dCB0aGlzIHNvcnQg
b2YgdGhpbmcuICAqLworZXh0ZXJuIGNoYXIgY29uc3QgKnZhbGlkYXRlZF9ib3VuZGFyeTsK
KwogI2VuZGlmCmRpZmYgLS1naXQgYS9zcmMvcGNyZXNlYXJjaC5jIGIvc3JjL3BjcmVzZWFy
Y2guYwppbmRleCAxYjE1ZTUzLi42ZjAxNmI2IDEwMDY0NAotLS0gYS9zcmMvcGNyZXNlYXJj
aC5jCisrKyBiL3NyYy9wY3Jlc2VhcmNoLmMKQEAgLTE1NiwyOCArMTU2LDkxIEBAIFBleGVj
dXRlIChjaGFyIGNvbnN0ICpidWYsIHNpemVfdCBzaXplLCBzaXplX3QgKm1hdGNoX3NpemUs
CiAgIGNoYXIgY29uc3QgKmxpbmVfc3RhcnQgPSBidWY7CiAgIGludCBlID0gUENSRV9FUlJP
Ul9OT01BVENIOwogICBjaGFyIGNvbnN0ICpsaW5lX2VuZDsKKyAgY2hhciBjb25zdCAqdmFs
aWRhdGVkID0gdmFsaWRhdGVkX2JvdW5kYXJ5OworCisgIC8qIElmIHRoZSBpbnB1dCB0eXBl
IGlzIHVua25vd24sIHRoZSBjYWxsZXIgaXMgc3RpbGwgdGVzdGluZyB0aGUKKyAgICAgaW5w
dXQsIHdoaWNoIG1lYW5zIHRoZSBjdXJyZW50IGJ1ZmZlciBjYW5ub3QgY29udGFpbiBlbmNv
ZGluZworICAgICBlcnJvcnMgYW5kIGEgbXVsdGlsaW5lIHNlYXJjaCBpcyB0eXBpY2FsbHkg
bW9yZSBlZmZpY2llbnQuCisgICAgIE90aGVyd2lzZSwgYSBzaW5nbGUtbGluZSBzZWFyY2gg
aXMgdHlwaWNhbGx5IGZhc3Rlciwgc28gdGhhdAorICAgICBwY3JlX2V4ZWMgZG9lc24ndCB3
YXN0ZSB0aW1lIHZhbGlkYXRpbmcgdGhlIGVudGlyZSBpbnB1dAorICAgICBidWZmZXIuICAq
LworICBib29sIG11bHRpbGluZSA9IGlucHV0X3RleHRiaW4gPT0gVEVYVEJJTl9VTktOT1dO
OwogCi0gIC8qIHBjcmVfZXhlYyBtaXNoYW5kbGVzIG1hdGNoZXMgdGhhdCBjcm9zcyBsaW5l
IGJvdW5kYXJpZXMuCi0gICAgIFBDUkVfTVVMVElMSU5FIGlzbid0IGEgd2luLCBwYXJ0bHkg
YmVjYXVzZSBpdCdzIGluY29tcGF0aWJsZSB3aXRoCi0gICAgIC16LCBhbmQgcGFydGx5IGJl
Y2F1c2UgaXQgY2hlY2tzIHRoZSBlbnRpcmUgaW5wdXQgYnVmZmVyIGFuZCBpcwotICAgICB0
aGVyZWZvcmUgc2xvdyBvbiBhIGxhcmdlIGJ1ZmZlciBjb250YWluaW5nIG1hbnkgbWF0Y2hl
cy4KLSAgICAgQXZvaWQgdGhlc2UgcHJvYmxlbXMgYnkgbWF0Y2hpbmcgbGluZS1ieS1saW5l
LiAgKi8KICAgZm9yICg7IHAgPCBidWYgKyBzaXplOyBwID0gbGluZV9zdGFydCA9IGxpbmVf
ZW5kICsgMSkKICAgICB7Ci0gICAgICBsaW5lX2VuZCA9IG1lbWNociAocCwgZW9sYnl0ZSwg
YnVmICsgc2l6ZSAtIHApOworICAgICAgYm9vbCB0b29fYmlnOwogCi0gICAgICBpZiAoSU5U
X01BWCA8IGxpbmVfZW5kIC0gcCkKKyAgICAgIGlmIChtdWx0aWxpbmUpCisgICAgICAgIHsK
KyAgICAgICAgICBzaXplX3QgcGNyZV9zaXplX21heCA9IE1JTiAoSU5UX01BWCwgU0laRV9N
QVggLSAxKTsKKyAgICAgICAgICBzaXplX3Qgc2Nhbl9zaXplID0gTUlOIChwY3JlX3NpemVf
bWF4ICsgMSwgYnVmICsgc2l6ZSAtIHApOworICAgICAgICAgIGxpbmVfZW5kID0gbWVtcmNo
ciAocCwgZW9sYnl0ZSwgc2Nhbl9zaXplKTsKKyAgICAgICAgICB0b29fYmlnID0gISBsaW5l
X2VuZDsKKyAgICAgICAgfQorICAgICAgZWxzZQorICAgICAgICB7CisgICAgICAgICAgbGlu
ZV9lbmQgPSBtZW1jaHIgKHAsIGVvbGJ5dGUsIGJ1ZiArIHNpemUgLSBwKTsKKyAgICAgICAg
ICB0b29fYmlnID0gSU5UX01BWCA8IGxpbmVfZW5kIC0gcDsKKyAgICAgICAgfQorCisgICAg
ICBpZiAodG9vX2JpZykKICAgICAgICAgZXJyb3IgKEVYSVRfVFJPVUJMRSwgMCwgXygiZXhj
ZWVkZWQgUENSRSdzIGxpbmUgbGVuZ3RoIGxpbWl0IikpOwogCi0gICAgICAvKiBUcmVhdCBl
bmNvZGluZy1lcnJvciBieXRlcyBhcyBkYXRhIHRoYXQgY2Fubm90IG1hdGNoLiAgKi8KICAg
ICAgIGZvciAoOzspCiAgICAgICAgIHsKLSAgICAgICAgICBpbnQgb3B0aW9ucyA9IGJvbCA/
IDAgOiBQQ1JFX05PVEJPTDsKLSAgICAgICAgICBpbnQgdmFsaWRfYnl0ZXM7Ci0gICAgICAg
ICAgZSA9IHBjcmVfZXhlYyAoY3JlLCBleHRyYSwgcCwgbGluZV9lbmQgLSBwLCAwLCBvcHRp
b25zLCBzdWIsIE5TVUIpOwotICAgICAgICAgIGlmIChlICE9IFBDUkVfRVJST1JfQkFEVVRG
OCkKLSAgICAgICAgICAgIGJyZWFrOwotICAgICAgICAgIHZhbGlkX2J5dGVzID0gc3ViWzBd
OworICAgICAgICAgIC8qIFNraXAgcGFzdCBieXRlcyB0aGF0IGFyZSBlYXNpbHkgZGV0ZXJt
aW5lZCB0byBiZSBlbmNvZGluZworICAgICAgICAgICAgIGVycm9ycywgdHJlYXRpbmcgdGhl
bSBhcyBkYXRhIHRoYXQgY2Fubm90IG1hdGNoLiAgVGhpcyBpcworICAgICAgICAgICAgIGZh
c3RlciB0aGFuIGhhdmluZyBwY3JlX2V4ZWMgY2hlY2sgdGhlbS4gICovCisgICAgICAgICAg
d2hpbGUgKG1iY2xlbl9jYWNoZVt0b191Y2hhciAoKnApXSA9PSAoc2l6ZV90KSAtMSkKKyAg
ICAgICAgICAgIHsKKyAgICAgICAgICAgICAgcCsrOworICAgICAgICAgICAgICBib2wgPSBm
YWxzZTsKKyAgICAgICAgICAgIH0KKworICAgICAgICAgIC8qIENoZWNrIGZvciBhbiBlbXB0
eSBtYXRjaDsgdGhpcyBpcyBmYXN0ZXIgdGhhbiBsZXR0aW5nCisgICAgICAgICAgICAgcGNy
ZV9leGVjIGRvIGl0LiAgKi8KKyAgICAgICAgICBpbnQgc2VhcmNoX2J5dGVzID0gbGluZV9l
bmQgLSBwOworICAgICAgICAgIGlmIChzZWFyY2hfYnl0ZXMgPT0gMCkKKyAgICAgICAgICAg
IHsKKyAgICAgICAgICAgICAgc3ViWzBdID0gc3ViWzFdID0gMDsKKyAgICAgICAgICAgICAg
ZSA9IGVtcHR5X21hdGNoW2JvbF07CisgICAgICAgICAgICAgIGJyZWFrOworICAgICAgICAg
ICAgfQorCisgICAgICAgICAgaW50IG9wdGlvbnMgPSAwOworICAgICAgICAgIGlmICghYm9s
KQorICAgICAgICAgICAgb3B0aW9ucyB8PSBQQ1JFX05PVEJPTDsKKyAgICAgICAgICBpZiAo
bXVsdGlsaW5lIHx8IHAgKyBzZWFyY2hfYnl0ZXMgPD0gdmFsaWRhdGVkKQorICAgICAgICAg
ICAgb3B0aW9ucyB8PSBQQ1JFX05PX1VURjhfQ0hFQ0s7CisKKyAgICAgICAgICBpbnQgdmFs
aWRfYnl0ZXMgPSB2YWxpZGF0ZWQgLSBwOworICAgICAgICAgIGlmICh2YWxpZF9ieXRlcyA8
IDApCisgICAgICAgICAgICB7CisgICAgICAgICAgICAgIGUgPSBwY3JlX2V4ZWMgKGNyZSwg
ZXh0cmEsIHAsIHNlYXJjaF9ieXRlcywgMCwKKyAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgb3B0aW9ucywgc3ViLCBOU1VCKTsKKyAgICAgICAgICAgICAgaWYgKGUgIT0gUENSRV9F
UlJPUl9CQURVVEY4KQorICAgICAgICAgICAgICAgIHsKKyAgICAgICAgICAgICAgICAgIHZh
bGlkYXRlZCA9IHAgKyBzZWFyY2hfYnl0ZXM7CisgICAgICAgICAgICAgICAgICBpZiAoMCA8
IGUgJiYgbXVsdGlsaW5lICYmIHN1YlsxXSAtIHN1YlswXSAhPSAwKQorICAgICAgICAgICAg
ICAgICAgICB7CisgICAgICAgICAgICAgICAgICAgICAgY2hhciBjb25zdCAqbmwgPSBtZW1j
aHIgKHAgKyBzdWJbMF0sIGVvbGJ5dGUsCisgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgIHN1YlsxXSAtIHN1YlswXSk7CisgICAgICAgICAgICAgICAg
ICAgICAgaWYgKG5sKQorICAgICAgICAgICAgICAgICAgICAgICAgeworICAgICAgICAgICAg
ICAgICAgICAgICAgICAvKiBUaGlzIG1hdGNoIGNyb3NzZXMgYSBsaW5lIGJvdW5kYXJ5OyBy
ZWplY3QgaXQuICAqLworICAgICAgICAgICAgICAgICAgICAgICAgICBwICs9IHN1YlswXTsK
KyAgICAgICAgICAgICAgICAgICAgICAgICAgbGluZV9lbmQgPSBubDsKKyAgICAgICAgICAg
ICAgICAgICAgICAgICAgY29udGludWU7CisgICAgICAgICAgICAgICAgICAgICAgICB9Cisg
ICAgICAgICAgICAgICAgICAgIH0KKyAgICAgICAgICAgICAgICAgIGJyZWFrOworICAgICAg
ICAgICAgICAgIH0KKyAgICAgICAgICAgICAgdmFsaWRfYnl0ZXMgPSBzdWJbMF07CisgICAg
ICAgICAgICAgIHZhbGlkYXRlZCA9IHAgKyB2YWxpZF9ieXRlczsKKyAgICAgICAgICAgIH0K
KworICAgICAgICAgIC8qIFRyeSB0byBtYXRjaCB0aGUgc3RyaW5nIGJlZm9yZSB0aGUgZW5j
b2RpbmcgZXJyb3IuCisgICAgICAgICAgICAgQWdhaW4sIGhhbmRsZSB0aGUgZW1wdHktbWF0
Y2ggY2FzZSBzcGVjaWFsbHksIGZvciBzcGVlZC4gICovCiAgICAgICAgICAgaWYgKHZhbGlk
X2J5dGVzID09IDApCiAgICAgICAgICAgICB7CiAgICAgICAgICAgICAgIHN1YlsxXSA9IDA7
CkBAIC0xODksNiArMjUyLDggQEAgUGV4ZWN1dGUgKGNoYXIgY29uc3QgKmJ1Ziwgc2l6ZV90
IHNpemUsIHNpemVfdCAqbWF0Y2hfc2l6ZSwKICAgICAgICAgICAgICAgICAgICAgICAgICAg
IHN1YiwgTlNVQik7CiAgICAgICAgICAgaWYgKGUgIT0gUENSRV9FUlJPUl9OT01BVENIKQog
ICAgICAgICAgICAgYnJlYWs7CisKKyAgICAgICAgICAvKiBUcmVhdCB0aGUgZW5jb2Rpbmcg
ZXJyb3IgYXMgZGF0YSB0aGF0IGNhbm5vdCBtYXRjaC4gICovCiAgICAgICAgICAgcCArPSB2
YWxpZF9ieXRlcyArIDE7CiAgICAgICAgICAgYm9sID0gZmFsc2U7CiAgICAgICAgIH0KQEAg
LTE5OCw2ICsyNjMsOCBAQCBQZXhlY3V0ZSAoY2hhciBjb25zdCAqYnVmLCBzaXplX3Qgc2l6
ZSwgc2l6ZV90ICptYXRjaF9zaXplLAogICAgICAgYm9sID0gdHJ1ZTsKICAgICB9CiAKKyAg
dmFsaWRhdGVkX2JvdW5kYXJ5ID0gdmFsaWRhdGVkOworCiAgIGlmIChlIDw9IDApCiAgICAg
ewogICAgICAgc3dpdGNoIChlKQpAQCAtMjI0LDggKzI5MSwyOSBAQCBQZXhlY3V0ZSAoY2hh
ciBjb25zdCAqYnVmLCBzaXplX3Qgc2l6ZSwgc2l6ZV90ICptYXRjaF9zaXplLAogICAgIH0K
ICAgZWxzZQogICAgIHsKLSAgICAgIGNoYXIgY29uc3QgKmJlZyA9IHN0YXJ0X3B0ciA/IHAg
KyBzdWJbMF0gOiBsaW5lX3N0YXJ0OwotICAgICAgY2hhciBjb25zdCAqZW5kID0gc3RhcnRf
cHRyID8gcCArIHN1YlsxXSA6IGxpbmVfZW5kICsgMTsKKyAgICAgIGNoYXIgY29uc3QgKm1h
dGNoYmVnID0gcCArIHN1YlswXTsKKyAgICAgIGNoYXIgY29uc3QgKm1hdGNoZW5kID0gcCAr
IHN1YlsxXTsKKyAgICAgIGNoYXIgY29uc3QgKmJlZzsKKyAgICAgIGNoYXIgY29uc3QgKmVu
ZDsKKyAgICAgIGlmIChzdGFydF9wdHIpCisgICAgICAgIHsKKyAgICAgICAgICBiZWcgPSBt
YXRjaGJlZzsKKyAgICAgICAgICBlbmQgPSBtYXRjaGVuZDsKKyAgICAgICAgfQorICAgICAg
ZWxzZSBpZiAobXVsdGlsaW5lKQorICAgICAgICB7CisgICAgICAgICAgY2hhciBjb25zdCAq
cHJldl9ubCA9IG1lbXJjaHIgKGxpbmVfc3RhcnQgLSAxLCBlb2xieXRlLAorICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBtYXRjaGJlZyAtIChsaW5lX3N0YXJ0
IC0gMSkpOworICAgICAgICAgIGNoYXIgY29uc3QgKm5leHRfbmwgPSBtZW1jaHIgKG1hdGNo
ZW5kLCBlb2xieXRlLAorICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
IGxpbmVfZW5kICsgMSAtIG1hdGNoZW5kKTsKKyAgICAgICAgICBiZWcgPSBwcmV2X25sICsg
MTsKKyAgICAgICAgICBlbmQgPSBuZXh0X25sICsgMTsKKyAgICAgICAgfQorICAgICAgZWxz
ZQorICAgICAgICB7CisgICAgICAgICAgYmVnID0gbGluZV9zdGFydDsKKyAgICAgICAgICBl
bmQgPSBsaW5lX2VuZCArIDE7CisgICAgICAgIH0KICAgICAgICptYXRjaF9zaXplID0gZW5k
IC0gYmVnOwogICAgICAgcmV0dXJuIGJlZyAtIGJ1ZjsKICAgICB9Ci0tIAoxLjkuMwoK
--------------090004040002010007040709
Content-Type: text/plain; charset=UTF-8;
 name="0006-grep-skip-past-holes-efficiently.patch"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="0006-grep-skip-past-holes-efficiently.patch"

RnJvbSAxZjFhNzRlMTQyMzk1NGE2ODdiZmY0Yjc5NDUxMDE0NDVkZDU3MDYxIE1vbiBTZXAg
MTcgMDA6MDA6MDAgMjAwMQpGcm9tOiBQYXVsIEVnZ2VydCA8ZWdnZXJ0QGNzLnVjbGEuZWR1
PgpEYXRlOiBUdWUsIDE2IFNlcCAyMDE0IDE4OjE5OjUzIC0wNzAwClN1YmplY3Q6IFtQQVRD
SCA2LzZdIGdyZXA6IHNraXAgcGFzdCBob2xlcyBlZmZpY2llbnRseQoKVGFrZSBhZHZhbnRh
Z2Ugb2YgdGhlIHJlbGF4ZWQgcnVsZXMgZm9yIHRyZWF0aW5nIG5vbi10ZXh0IGJ5dGVzIGlu
CmJpbmFyeSBkYXRhLCBieSBlZmZpY2llbnRseSBza2lwcGluZyBwYXN0IGhvbGVzIG9uIHBs
YXRmb3JtcwpzdXBwb3J0aW5nIGxzZWVrJ3MgU0VFS19EQVRBIGZsYWcuCk9uIG9uZSB0ZXN0
IG9uIGEgY2lyY2EtMjAwOCBTdW4gRmlyZSBWNDB6IHJ1bm5pbmcgU29sYXJpcyAxMS4yLAon
Z3JlcCB4JyB0b29rIDAuMDA5IHJlYWwtdGltZSBzZWNvbmRzIHRvIHNjYW4gYSBob2xleSBm
aWxlIG9mIHNpemUKOSwyMjMsMzcyLDAzNiw4NTQsNzc1LDgwMiBieXRlcywgZm9yIGEgbm9t
aW5hbCBzY2FuIHJhdGUgb2YgMSBaQi9zLgpncmVwIDIuMjAncyBzY2FuIHJhdGUgb24gdGhp
cyBwbGF0Zm9ybSB3YXMgODQzIE1CL3MsIHNvIHRoaXMgaXMgYQpzcGVlZHVwIGJ5IGEgZmFj
dG9yIG9mIDEuMiB0cmlsbGlvbi4gIFRoZSBzcGVlZHVwIGZhY3RvciBpcyBub3QKYXMgZ3Jl
YXQgb24gR05VL0xpbnV4IGhvc3RzLCBkdWUgdG8gd2hhdCBhcHBlYXIgdG8gYmUgU0VFS19E
QVRBCmluZWZmaWNpZW5jaWVzLCBidXQgcHJlc3VtYWJseSB0aGlzIHdpbGwgYmUgY2xlYXJl
ZCB1cCBpbiB0aW1lLgoqIE5FV1M6IERvY3VtZW50IHRoaXMuCiogc3JjL2dyZXAuYywgc3Jj
L2dyZXAuaCAoZW9sYnl0ZSk6IE5vdyBjaGFyLCBub3QgdW5zaWduZWQgY2hhci4KVGhpcyBp
cyBmb3IgY29tcGF0aWJpbGl0eSB3aXRoIHRoZSByZXN0IG9mIHRoZSBjb2RlLgpUaGUgb2xk
IChwZXJmb3JtYW5jZT8pIHJlYXNvbnMgZm9yICd1bnNpZ25lZCBjaGFyJyBhcmUgbm90IG1v
b3QuCiogc3JjL2dyZXAuYyAoc2tpcF9udWxzLCBza2lwX2VtcHR5X2xpbmVzLCBzZWVrX2Rh
dGFfZmFpbGVkKToKTmV3IHN0YXRpYyB2YXJzLgoodG90YWxubCk6IE1vdmUgdXAsIHNpbmNl
IGl0J3MgYWJvdXQgaW5wdXQsIG5vdCBvdXRwdXQsIGFuZApmaWxsYnVmIG5vdyB1c2VzIGl0
LgooYWRkX2NvdW50KTogTW92ZSB1cCwgc2luY2UgZmlsbGJ1ZiBub3cgdXNlcyBpdC4KKGFs
bF96ZXJvcyk6IE5ldyBmdW5jdGlvbi4KKGZpbGxidWYpOiBVc2UgU0VFS19EQVRBIHRvIHNr
aXAgcGFzdCBob2xlcyBlZmZpY2llbnRseSwKb24gc3lzdGVtcyB0aGF0IHN1cHBvcnQgdGhp
cy4KKGdyZXAsIG1haW4pOiBTZXQgdGhlIG5ldyBzdGF0aWMgdmFycy4KLS0tCiBORVdTICAg
ICAgIHwgIDMgKysrCiBzcmMvZ3JlcC5jIHwgNzYgKysrKysrKysrKysrKysrKysrKysrKysr
KysrKysrKysrKysrKysrKysrKysrKystLS0tLS0tLS0tLS0tLS0KIHNyYy9ncmVwLmggfCAg
MiArLQogMyBmaWxlcyBjaGFuZ2VkLCA2MiBpbnNlcnRpb25zKCspLCAxOSBkZWxldGlvbnMo
LSkKCmRpZmYgLS1naXQgYS9ORVdTIGIvTkVXUwppbmRleCA3MzMzMThkLi40ZTAxOTVjIDEw
MDY0NAotLS0gYS9ORVdTCisrKyBiL05FV1MKQEAgLTQsNiArNCw5IEBAIEdOVSBncmVwIE5F
V1MgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAtKi0gb3V0bGluZSAtKi0K
IAogKiogSW1wcm92ZW1lbnRzCiAKKyAgUGVyZm9ybWFuY2UgaGFzIGJlZW4gZ3JlYXRseSBp
bXByb3ZlZCBmb3Igc2VhcmNoaW5nIGZpbGVzIGNvbnRhaW5pbmcKKyAgaG9sZXMsIG9uIHBs
YXRmb3JtcyB3aGVyZSBsc2VlaydzIFNFRUtfSE9MRSBmbGFnIHdvcmtzIGVmZmljaWVudGx5
LgorCiAgIFBlcmZvcm1hbmNlIGhhcyBpbXByb3ZlZCBmb3IgdmVyeSBsb25nIHN0cmluZ3Mg
aW4gcGF0dGVybnMuCiAKICAgSWYgYSBmaWxlIGNvbnRhaW5zIGRhdGEgaW1wcm9wZXJseSBl
bmNvZGVkIGZvciB0aGUgY3VycmVudCBsb2NhbGUsCmRpZmYgLS1naXQgYS9zcmMvZ3JlcC5j
IGIvc3JjL2dyZXAuYwppbmRleCBlM2M0OTI1Li4zZTk0ODA0IDEwMDY0NAotLS0gYS9zcmMv
Z3JlcC5jCisrKyBiL3NyYy9ncmVwLmMKQEAgLTM1MCw3ICszNTAsNyBAQCBzdGF0aWMgc3Ry
dWN0IG9wdGlvbiBjb25zdCBsb25nX29wdGlvbnNbXSA9CiBib29sIG1hdGNoX2ljYXNlOwog
Ym9vbCBtYXRjaF93b3JkczsKIGJvb2wgbWF0Y2hfbGluZXM7Ci11bnNpZ25lZCBjaGFyIGVv
bGJ5dGU7CitjaGFyIGVvbGJ5dGU7CiBlbnVtIHRleHRiaW4gaW5wdXRfdGV4dGJpbjsKIGNo
YXIgY29uc3QgKnZhbGlkYXRlZF9ib3VuZGFyeTsKIApAQCAtNTYzLDYgKzU2MywxMCBAQCBz
dGF0aWMgb2ZmX3QgYnVmb2Zmc2V0OwkJLyogUmVhZCBvZmZzZXQ7IGRlZmluZWQgb24gcmVn
dWxhciBmaWxlcy4gICovCiBzdGF0aWMgb2ZmX3QgYWZ0ZXJfbGFzdF9tYXRjaDsJLyogUG9p
bnRlciBhZnRlciBsYXN0IG1hdGNoaW5nIGxpbmUgdGhhdAogICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICB3b3VsZCBoYXZlIGJlZW4gb3V0cHV0IGlmIHdlIHdlcmUKICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgb3V0cHV0dGluZyBjaGFyYWN0ZXJz
LiAqLworc3RhdGljIGJvb2wgc2tpcF9udWxzOwkJLyogU2tpcCAnXDAnIGluIGRhdGEuICAq
Lworc3RhdGljIGJvb2wgc2tpcF9lbXB0eV9saW5lczsJLyogU2tpcCBlbXB0eSBsaW5lcyBp
biBkYXRhLiAgKi8KK3N0YXRpYyBib29sIHNlZWtfZGF0YV9mYWlsZWQ7CS8qIGxzZWVrIHdp
dGggU0VFS19EQVRBIGZhaWxlZC4gICovCitzdGF0aWMgdWludG1heF90IHRvdGFsbmw7CS8q
IFRvdGFsIG5ld2xpbmUgY291bnQgYmVmb3JlIGxhc3RubC4gKi8KIAogLyogUmV0dXJuIFZB
TCBhbGlnbmVkIHRvIHRoZSBuZXh0IG11bHRpcGxlIG9mIEFMSUdOTUVOVC4gIFZBTCBjYW4g
YmUKICAgIGFuIGludGVnZXIgb3IgYSBwb2ludGVyLiAgQm90aCBhcmdzIG11c3QgYmUgZnJl
ZSBvZiBzaWRlIGVmZmVjdHMuICAqLwpAQCAtNTcxLDYgKzU3NSwyNyBAQCBzdGF0aWMgb2Zm
X3QgYWZ0ZXJfbGFzdF9tYXRjaDsJLyogUG9pbnRlciBhZnRlciBsYXN0IG1hdGNoaW5nIGxp
bmUgdGhhdAogICAgPyAodmFsKSBcCiAgICA6ICh2YWwpICsgKChhbGlnbm1lbnQpIC0gKHNp
emVfdCkgKHZhbCkgJSAoYWxpZ25tZW50KSkpCiAKKy8qIEFkZCB0d28gbnVtYmVycyB0aGF0
IGNvdW50IGlucHV0IGJ5dGVzIG9yIGxpbmVzLCBhbmQgcmVwb3J0IGFuCisgICBlcnJvciBp
ZiB0aGUgYWRkaXRpb24gb3ZlcmZsb3dzLiAgKi8KK3N0YXRpYyB1aW50bWF4X3QKK2FkZF9j
b3VudCAodWludG1heF90IGEsIHVpbnRtYXhfdCBiKQoreworICB1aW50bWF4X3Qgc3VtID0g
YSArIGI7CisgIGlmIChzdW0gPCBhKQorICAgIGVycm9yIChFWElUX1RST1VCTEUsIDAsIF8o
ImlucHV0IGlzIHRvbyBsYXJnZSB0byBjb3VudCIpKTsKKyAgcmV0dXJuIHN1bTsKK30KKwor
LyogUmV0dXJuIHRydWUgaWYgQlVGIChvZiBzaXplIFNJWkUpIGlzIGFsbCB6ZXJvcy4gICov
CitzdGF0aWMgYm9vbAorYWxsX3plcm9zIChjaGFyIGNvbnN0ICpidWYsIHNpemVfdCBzaXpl
KQoreworICBmb3IgKGNoYXIgY29uc3QgKnAgPSBidWY7IHAgPCBidWYgKyBzaXplOyBwKysp
CisgICAgaWYgKCpwKQorICAgICAgcmV0dXJuIGZhbHNlOworICByZXR1cm4gdHJ1ZTsKK30K
KwogLyogUmVzZXQgdGhlIGJ1ZmZlciBmb3IgYSBuZXcgZmlsZSwgcmV0dXJuaW5nIGZhbHNl
IGlmIHdlIHNob3VsZCBza2lwIGl0LgogICAgSW5pdGlhbGl6ZSBvbiB0aGUgZmlyc3QgdGlt
ZSB0aHJvdWdoLiAqLwogc3RhdGljIGJvb2wKQEAgLTY3NCwxMyArNjk5LDMzIEBAIGZpbGxi
dWYgKHNpemVfdCBzYXZlLCBzdHJ1Y3Qgc3RhdCBjb25zdCAqc3QpCiAgIHJlYWRzaXplID0g
YnVmZmVyICsgYnVmYWxsb2MgLSByZWFkYnVmOwogICByZWFkc2l6ZSAtPSByZWFkc2l6ZSAl
IHBhZ2VzaXplOwogCi0gIGZpbGxzaXplID0gc2FmZV9yZWFkIChidWZkZXNjLCByZWFkYnVm
LCByZWFkc2l6ZSk7Ci0gIGlmIChmaWxsc2l6ZSA9PSBTQUZFX1JFQURfRVJST1IpCisgIHdo
aWxlICh0cnVlKQogICAgIHsKLSAgICAgIGZpbGxzaXplID0gMDsKLSAgICAgIGNjID0gZmFs
c2U7CisgICAgICBmaWxsc2l6ZSA9IHNhZmVfcmVhZCAoYnVmZGVzYywgcmVhZGJ1ZiwgcmVh
ZHNpemUpOworICAgICAgaWYgKGZpbGxzaXplID09IFNBRkVfUkVBRF9FUlJPUikKKyAgICAg
ICAgeworICAgICAgICAgIGZpbGxzaXplID0gMDsKKyAgICAgICAgICBjYyA9IGZhbHNlOwor
ICAgICAgICB9CisgICAgICBidWZvZmZzZXQgKz0gZmlsbHNpemU7CisKKyAgICAgIGlmIChm
aWxsc2l6ZSA9PSAwIHx8ICFza2lwX251bHMgfHwgIWFsbF96ZXJvcyAocmVhZGJ1ZiwgZmls
bHNpemUpKQorICAgICAgICBicmVhazsKKyAgICAgIHRvdGFsbmwgPSBhZGRfY291bnQgKHRv
dGFsbmwsIGZpbGxzaXplKTsKKworICAgICAgaWYgKCFzZWVrX2RhdGFfZmFpbGVkKQorICAg
ICAgICB7CisgICAgICAgICAgb2ZmX3QgZGF0YV9zdGFydCA9IGxzZWVrIChidWZkZXNjLCBi
dWZvZmZzZXQsIFNFRUtfREFUQSk7CisgICAgICAgICAgaWYgKGRhdGFfc3RhcnQgPCAwKQor
ICAgICAgICAgICAgc2Vla19kYXRhX2ZhaWxlZCA9IHRydWU7CisgICAgICAgICAgZWxzZQor
ICAgICAgICAgICAgeworICAgICAgICAgICAgICB0b3RhbG5sID0gYWRkX2NvdW50ICh0b3Rh
bG5sLCBkYXRhX3N0YXJ0IC0gYnVmb2Zmc2V0KTsKKyAgICAgICAgICAgICAgYnVmb2Zmc2V0
ID0gZGF0YV9zdGFydDsKKyAgICAgICAgICAgIH0KKyAgICAgICAgfQogICAgIH0KLSAgYnVm
b2Zmc2V0ICs9IGZpbGxzaXplOworCiAgIGZpbGxzaXplID0gdW5kb3NzaWZ5X2lucHV0IChy
ZWFkYnVmLCBmaWxsc2l6ZSk7CiAgIGJ1ZmxpbSA9IHJlYWRidWYgKyBmaWxsc2l6ZTsKICAg
cmV0dXJuIGNjOwpAQCAtNzE3LDcgKzc2Miw2IEBAIHN0YXRpYyBjaGFyIGNvbnN0ICpsYXN0
bmw7CS8qIFBvaW50ZXIgYWZ0ZXIgbGFzdCBuZXdsaW5lIGNvdW50ZWQuICovCiBzdGF0aWMg
Y2hhciBjb25zdCAqbGFzdG91dDsJLyogUG9pbnRlciBhZnRlciBsYXN0IGNoYXJhY3RlciBv
dXRwdXQ7CiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIE5VTEwgaWYgbm8g
Y2hhcmFjdGVyIGhhcyBiZWVuIG91dHB1dAogICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICBvciBpZiBpdCdzIGNvbmNlcHR1YWxseSBiZWZvcmUgYnVmYmVnLiAqLwotc3Rh
dGljIHVpbnRtYXhfdCB0b3RhbG5sOwkvKiBUb3RhbCBuZXdsaW5lIGNvdW50IGJlZm9yZSBs
YXN0bmwuICovCiBzdGF0aWMgaW50bWF4X3Qgb3V0bGVmdDsJLyogTWF4aW11bSBudW1iZXIg
b2YgbGluZXMgdG8gYmUgb3V0cHV0LiAgKi8KIHN0YXRpYyBpbnRtYXhfdCBwZW5kaW5nOwkv
KiBQZW5kaW5nIGxpbmVzIG9mIG91dHB1dC4KICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgQWx3YXlzIGtlcHQgMCBpZiBvdXRfcXVpZXQgaXMgdHJ1ZS4gICovCkBAIC03
MjYsMTcgKzc3MCw2IEBAIHN0YXRpYyBib29sIGV4aXRfb25fbWF0Y2g7CS8qIEV4aXQgb24g
Zmlyc3QgbWF0Y2guICAqLwogCiAjaW5jbHVkZSAiZG9zYnVmLmMiCiAKLS8qIEFkZCB0d28g
bnVtYmVycyB0aGF0IGNvdW50IGlucHV0IGJ5dGVzIG9yIGxpbmVzLCBhbmQgcmVwb3J0IGFu
Ci0gICBlcnJvciBpZiB0aGUgYWRkaXRpb24gb3ZlcmZsb3dzLiAgKi8KLXN0YXRpYyB1aW50
bWF4X3QKLWFkZF9jb3VudCAodWludG1heF90IGEsIHVpbnRtYXhfdCBiKQotewotICB1aW50
bWF4X3Qgc3VtID0gYSArIGI7Ci0gIGlmIChzdW0gPCBhKQotICAgIGVycm9yIChFWElUX1RS
T1VCTEUsIDAsIF8oImlucHV0IGlzIHRvbyBsYXJnZSB0byBjb3VudCIpKTsKLSAgcmV0dXJu
IHN1bTsKLX0KLQogc3RhdGljIHZvaWQKIG5sc2NhbiAoY2hhciBjb25zdCAqbGltKQogewpA
QCAtMTE3MSw2ICsxMjA0LDggQEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qgc3RhdCBjb25zdCAq
c3QpCiAgIG91dGxlZnQgPSBtYXhfY291bnQ7CiAgIGFmdGVyX2xhc3RfbWF0Y2ggPSAwOwog
ICBwZW5kaW5nID0gMDsKKyAgc2tpcF9udWxzID0gc2tpcF9lbXB0eV9saW5lcyAmJiAhZW9s
OworICBzZWVrX2RhdGFfZmFpbGVkID0gZmFsc2U7CiAKICAgbmxpbmVzID0gMDsKICAgcmVz
aWR1ZSA9IDA7CkBAIC0xMTkzLDYgKzEyMjgsNyBAQCBncmVwIChpbnQgZmQsIHN0cnVjdCBz
dGF0IGNvbnN0ICpzdCkKICAgICAgICAgICAgIHJldHVybiAwOwogICAgICAgICAgIGRvbmVf
b25fbWF0Y2ggPSBvdXRfcXVpZXQgPSB0cnVlOwogICAgICAgICAgIG51bF96YXBwZXIgPSBl
b2w7CisgICAgICAgICAgc2tpcF9udWxzID0gc2tpcF9lbXB0eV9saW5lczsKICAgICAgICAg
fQogICAgIH0KIApAQCAtMTI4MSw2ICsxMzE3LDcgQEAgZ3JlcCAoaW50IGZkLCBzdHJ1Y3Qg
c3RhdCBjb25zdCAqc3QpCiAgICAgICAgICAgICAgIHRleHRiaW4gPSB0YjsKICAgICAgICAg
ICAgICAgZG9uZV9vbl9tYXRjaCA9IG91dF9xdWlldCA9IHRydWU7CiAgICAgICAgICAgICAg
IG51bF96YXBwZXIgPSBlb2w7CisgICAgICAgICAgICAgIHNraXBfbnVscyA9IHNraXBfZW1w
dHlfbGluZXM7CiAgICAgICAgICAgICB9CiAgICAgICAgIH0KICAgICB9CkBAIC0yMzkwLDYg
KzI0MjcsOSBAQCBtYWluIChpbnQgYXJnYywgY2hhciAqKmFyZ3YpCiAKICAgY29tcGlsZSAo
a2V5cywga2V5Y2MpOwogICBmcmVlIChrZXlzKTsKKyAgc2l6ZV90IG1hdGNoX3NpemU7Cisg
IHNraXBfZW1wdHlfbGluZXMgPSAoKGV4ZWN1dGUgKCZlb2xieXRlLCAxLCAmbWF0Y2hfc2l6
ZSwgTlVMTCkgPT0gMCkKKyAgICAgICAgICAgICAgICAgICAgICA9PSBvdXRfaW52ZXJ0KTsK
IAogICBpZiAoKGFyZ2MgLSBvcHRpbmQgPiAxICYmICFub19maWxlbmFtZXMpIHx8IHdpdGhf
ZmlsZW5hbWVzKQogICAgIG91dF9maWxlID0gMTsKZGlmZiAtLWdpdCBhL3NyYy9ncmVwLmgg
Yi9zcmMvZ3JlcC5oCmluZGV4IDIzZDRlOTUuLjg2MjU5ZmIgMTAwNjQ0Ci0tLSBhL3NyYy9n
cmVwLmgKKysrIGIvc3JjL2dyZXAuaApAQCAtMjcsNyArMjcsNyBAQAogZXh0ZXJuIGJvb2wg
bWF0Y2hfaWNhc2U7CS8qIC1pICovCiBleHRlcm4gYm9vbCBtYXRjaF93b3JkczsJLyogLXcg
Ki8KIGV4dGVybiBib29sIG1hdGNoX2xpbmVzOwkvKiAteCAqLwotZXh0ZXJuIHVuc2lnbmVk
IGNoYXIgZW9sYnl0ZTsJLyogLXogKi8KK2V4dGVybiBjaGFyIGVvbGJ5dGU7CQkvKiAteiAq
LwogCiAvKiBBbiBlbnVtIHRleHRiaW4gZGVzY3JpYmVzIHRoZSBmaWxlJ3MgdHlwZSwgaW5m
ZXJyZWQgZnJvbSBkYXRhIHJlYWQKICAgIGJlZm9yZSB0aGUgZmlyc3QgbGluZSBpcyBzZWxl
Y3RlZCBmb3Igb3V0cHV0LiAgKi8KLS0gCjEuOS4zCgo=
--------------090004040002010007040709--




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 13 Sep 2014 00:59:55 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 12 20:59:55 2014
Received: from localhost ([127.0.0.1]:39742 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XSbgv-0000Sa-TY
	for submit <at> debbugs.gnu.org; Fri, 12 Sep 2014 20:59:54 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:38204)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XSbgt-0000SR-Bl
 for 18454 <at> debbugs.gnu.org; Fri, 12 Sep 2014 20:59:52 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id D674FA60010;
 Fri, 12 Sep 2014 17:59:50 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id KQPXwW1SssCC; Fri, 12 Sep 2014 17:59:42 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 1FF9DA60006;
 Fri, 12 Sep 2014 17:59:42 -0700 (PDT)
Message-ID: <541396FD.3080001@HIDDEN>
Date: Fri, 12 Sep 2014 17:59:41 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: Vincent Lefevre <vincent@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <54126023.8020005@HIDDEN> <20140912082008.GC4404@HIDDEN>
 <541323C8.10500@HIDDEN> <20140912220855.GK4404@HIDDEN>
In-Reply-To: <20140912220855.GK4404@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -4.5 (----)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.5 (----)

Vincent Lefevre wrote:

> This is still better than no optimization at all.

We'd have to see; not every optimization is worth the trouble.

> if the behavior is chosen by an option, the user would be aware
> of the meaning of the output, so that this won't really matter.

It'd be better if there wasn't a new grep option simply to avoid a 
libpcre performance bug.

> Could you give some reference?

The pcreunicode man page mentions some of this issue under "Validity of 
UTF-8 string".  My impression is that the actual history of behavior 
changes is more complicated than what that simple summary would suggest.

> This doesn't introduce undefined behavior, just a different
> behavior

Again, it'd be better if grep Just Worked.

> I suppose that this is due
> to the many retries from the pcresearch.c code on binary files (the
> line is split into many sublines, many often consisting of a single
> byte), i.e. the problem is on the grep side.

libpcre is not giving 'grep' an efficient way to search data that can 
contain encoding errors.  This does not mean "the problem is on the grep 
side".

> I don't see how this
> could be solved except by doing the UTF-8 check on the grep side.

There's another way: fix libpcre so that it works on arbitrary binary 
data, without the need for prescreening the data.  That's the 
fundamental problem here.

>>> I often want to take binary files into account
>>
>> In those cases I suggest using a unibyte C locale.
>
> I still want "." to match a single (valid) UTF-8 character.

How about this idea instead?  Use a unibyte C locale, and write a 
unibyte regular expression C that matches a single valid UTF-8 character 
(using whatever definition you like for UTF-8).  Then, you can use . to 
match single bytes and C to match characters.  This gives you all the 
power you need, without the slowdown due to UTF-8 processing, a slowdown 
that will be inevitable no matter how we change grep or libpcre.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 22:09:00 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 12 18:09:00 2014
Received: from localhost ([127.0.0.1]:39707 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XSZ1X-0004rh-LR
	for submit <at> debbugs.gnu.org; Fri, 12 Sep 2014 18:09:00 -0400
Received: from ioooi.vinc17.net ([92.243.22.117]:47303)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1XSZ1V-0004rY-5w
 for 18454 <at> debbugs.gnu.org; Fri, 12 Sep 2014 18:08:57 -0400
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id A1796131;
 Sat, 13 Sep 2014 00:08:55 +0200 (CEST)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id 6017A21A079; Sat, 13 Sep 2014 00:08:55 +0200 (CEST)
Date: Sat, 13 Sep 2014 00:08:55 +0200
From: Vincent Lefevre <vincent@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Message-ID: <20140912220855.GK4404@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <54126023.8020005@HIDDEN>
 <20140912082008.GC4404@HIDDEN>
 <541323C8.10500@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <541323C8.10500@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25)
X-Spam-Score: -2.2 (--)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.2 (--)

On 2014-09-12 09:48:08 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >I think that (1) is rather simple
> 
> You may think it simple for the REs you're interested in, but someone else
> might say "hey! that doesn't cover the REs *I'm* interested in!". Solving
> the problem in general is nontrivial.

This is still better than no optimization at all.

> >But this is already the case:
> 
> I was assuming the case where the input data contains an encoding error (not
> a null byte) that is transformed to a null byte before the user sees it.
> 
> Really, this null-byte-replacement business would be just too weird.  I
> don't see it as a viable general-purpose solution.

Anyway since the problem can exist with null bytes, the problem
needs to be solved for null bytes. But this is also already the
case:

$ printf "a\0b\n" | grep -a 'a..*b'
a^@b

(where the "^@" is in reverse video). So, the only "issue" would
be that

$ printf "a\x91b\n" | grep -a 'a..*b'

would output "a^@b" instead of... possibly something worse. Indeed,
outputting invalid UTF-8 sequences to the terminal is bad. Ideally
you would output "a<91>b" with "<91>" in reverse video. At some
price (this would be slower).

Now, if the behavior is chosen by an option, the user would be aware
of the meaning of the output, so that this won't really matter.

> >Parsing UTF-8 is standard.
> 
> It's a standard that keeps evolving, different releases of libpcre
> have done it differently, and I expect things to continue to evolve.

Could you give some reference? IMHO, this looks more like a bug.

Anyway, UTF-8 sequences that are valid today will still be valid in
the future. The only possible change is that new sequences become
valid in the future. So, the only possible problem is that such new
sequences would be converted to null bytes while this shouldn't be
done. This doesn't introduce undefined behavior, just a different
behavior (note that this difference would also exist between two
libpcre versions, thus not a big problem, and this will be fixable).

> Have you investigated why libpcre is so *slow* when doing UTF-8 checking?

AFAIK, this is not due to libpcre UTF-8 checking, otherwise it would
also be very slow on valid text files too. I suppose that this is due
to the many retries from the pcresearch.c code on binary files (the
line is split into many sublines, many often consisting of a single
byte), i.e. the problem is on the grep side. I don't see how this
could be solved except by doing the UTF-8 check on the grep side.

> >I often want to take binary files into account
> 
> In those cases I suggest using a unibyte C locale.

But I still want "." to match a single (valid) UTF-8 character.
Well, using the C locale on binary files and UTF-8 on text files
might be acceptable. But how can one do that with a recursive
grep?

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 16:48:22 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 12 12:48:22 2014
Received: from localhost ([127.0.0.1]:39610 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XSU1F-0005V2-9t
	for submit <at> debbugs.gnu.org; Fri, 12 Sep 2014 12:48:21 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:44656)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XSU1C-0005Us-Fr
 for 18454 <at> debbugs.gnu.org; Fri, 12 Sep 2014 12:48:19 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 850C8A6001D;
 Fri, 12 Sep 2014 09:48:17 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id 6fYlKLl7e-Ok; Fri, 12 Sep 2014 09:48:08 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id C64F1A60010;
 Fri, 12 Sep 2014 09:48:08 -0700 (PDT)
Message-ID: <541323C8.10500@HIDDEN>
Date: Fri, 12 Sep 2014 09:48:08 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: Vincent Lefevre <vincent@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
 <54126023.8020005@HIDDEN> <20140912082008.GC4404@HIDDEN>
In-Reply-To: <20140912082008.GC4404@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -4.5 (----)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.5 (----)

Vincent Lefevre wrote:

> I think that (1) is rather simple

You may think it simple for the REs you're interested in, but someone 
else might say "hey! that doesn't cover the REs *I'm* interested in!". 
Solving the problem in general is nontrivial.

> But this is already the case:

I was assuming the case where the input data contains an encoding error 
(not a null byte) that is transformed to a null byte before the user 
sees it.

Really, this null-byte-replacement business would be just too weird.  I 
don't see it as a viable general-purpose solution.

> Parsing UTF-8 is standard.

It's a standard that keeps evolving, different releases of libpcre have 
done it differently, and I expect things to continue to evolve.  It's 
not something I would want to maintain separately from libpcre itself.

Have you investigated why libpcre is so *slow* when doing UTF-8 
checking?  Why would libpcre be 10x slower than grep's checking by 
hand?!?  I don't get it.  Surely there's a simple fix on the libpcre side.

> I often want to take binary files into account

In those cases I suggest using a unibyte C locale.  This should solve 
the performance problem.  Really, unibyte is the way to go here; it's 
gonna be faster for large binary scanning no matter what is done about 
this UTF-8 business.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 08:20:14 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Sep 12 04:20:14 2014
Received: from localhost ([127.0.0.1]:38837 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XSM5V-0000Lp-OB
	for submit <at> debbugs.gnu.org; Fri, 12 Sep 2014 04:20:14 -0400
Received: from ioooi.vinc17.net ([92.243.22.117]:47131)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1XSM5S-0000Le-HH
 for 18454 <at> debbugs.gnu.org; Fri, 12 Sep 2014 04:20:11 -0400
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id 3FEFA131;
 Fri, 12 Sep 2014 10:20:09 +0200 (CEST)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id BB05D21A079; Fri, 12 Sep 2014 10:20:08 +0200 (CEST)
Date: Fri, 12 Sep 2014 10:20:08 +0200
From: Vincent Lefevre <vincent@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
Message-ID: <20140912082008.GC4404@HIDDEN>
References: <20140912012449.GB18162@HIDDEN>
 <54126023.8020005@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <54126023.8020005@HIDDEN>
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25)
X-Spam-Score: -2.5 (--)
X-Debbugs-Envelope-To: 18454
Cc: 18454 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.5 (--)

On 2014-09-11 19:53:23 -0700, Paul Eggert wrote:
> Vincent Lefevre wrote:
> >Things could be done in grep:
> >
> >1. Ignore -P when the pattern would have the same meaning without -P
> >    (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
> >    at least for the simplest cases).
> >
> >2. Call PCRE in the C locale when this is equivalent.
> 
> I had already considered these ideas along with several others, but they
> would require grep to parse and analyze the Perl regular expression.  I
> don't know the PCRE syntax and it would take some time to write a parser.
> And even if I wrote one, the next PCRE release would likely change the
> syntax.  It sounds very painful to maintain.

I think that (1) is rather simple, even though optimization could
be missed on some patterns: ERE and PCRE have a large equivalent
subclass. The pattern could be examined left to right and would
consist of:
  - Normal characters.
  - ".", "^" at the beginning, "$" (alone) at the end.
  - [] with normal characters inside.
  - "*", "+", "?", "{...}" form not followed by one of "*+?{".
  - "|" and "(" not followed by one of these 4 characters.
  - "\" followed by one of ".^$[*+?{".
  - Some "\" + letter sequences could be recognised as well.

Something like that (I haven't checked carefully). There could be
another option to allow such an optimization or not.

> >3. Transform invalid bytes to null bytes in-place before the PCRE
> >    call. This changes the current semantic, but:
> >    * the semantic on invalid bytes has never been specified, AFAIK;
> >    * the best *practical* behavior may not be the current one
> 
> As we've already discussed, this would be incompatible with how invalid
> bytes are treated by other matchers.

The same thing could be done with other matchers (in an optional way).

> And would have undesirable practical effects, e.g., the pattern
> 'a..*b' would match data that would look like "ab" on many screens
> (because the null byte would vanish). It's a real kludge that will
> bite users.

But this is already the case:

$ printf "a\0b\n"
ab
$ printf "a\0b\n" | grep 'a..*b'
Binary file (standard input) matches

The transformation won't touch null bytes. It would just interpret
invalid bytes as null bytes, so that they get matched by ".".

> Even if we went along with the kludge, grep does not know what bytes
> PCRE considers to be invalid without invoking PCRE, which is what
> it's doing now. (Yes, PCRE says it's parsing UTF-8, but there are
> different ways to do that and they don't all agree.)

It would be a bug in PCRE. Parsing UTF-8 is standard. This is
sumarized by:

       0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

(from the Linux utf-8(7) man page), everything else being invalid.

Note that the pcre_exec(3) man page even say:

    PCRE_NO_UTF8_CHECK  Do not check the subject for UTF-8 validity

assuming that the check can be done on the user's side, i.e. in a
standard way.

> Here's a different idea. How about invoking grep with the
> --binary-files=without-match option? This should avoid much of the
> libpcre performance problem, without having to change 'grep'.

I often want to take binary files into account, for instance because
executables can contain text I search for (error messages...). There
may be other examples.

-- 
Vincent Lefèvre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 18454 <at> debbugs.gnu.org:


Received: (at 18454) by debbugs.gnu.org; 12 Sep 2014 02:53:30 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Sep 11 22:53:30 2014
Received: from localhost ([127.0.0.1]:38726 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XSGzJ-0006Zr-T3
	for submit <at> debbugs.gnu.org; Thu, 11 Sep 2014 22:53:30 -0400
Received: from smtp.cs.ucla.edu ([131.179.128.62]:44629)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1XSGzG-0006Zh-9T
 for 18454 <at> debbugs.gnu.org; Thu, 11 Sep 2014 22:53:27 -0400
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id DA853A6001D;
 Thu, 11 Sep 2014 19:53:24 -0700 (PDT)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id noZDRkBwjWgI; Thu, 11 Sep 2014 19:53:23 -0700 (PDT)
Received: from [192.168.1.9] (pool-71-177-17-123.lsanca.dsl-w.verizon.net
 [71.177.17.123])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id 76DCBA60001;
 Thu, 11 Sep 2014 19:53:23 -0700 (PDT)
Message-ID: <54126023.8020005@HIDDEN>
Date: Thu, 11 Sep 2014 19:53:23 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.1.1
MIME-Version: 1.0
To: Vincent Lefevre <vincent@HIDDEN>, 18454 <at> debbugs.gnu.org
Subject: Re: bug#18454: Improve performance when -P (PCRE) is used in UTF-8
 locales
References: <20140912012449.GB18162@HIDDEN>
In-Reply-To: <20140912012449.GB18162@HIDDEN>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -4.8 (----)
X-Debbugs-Envelope-To: 18454
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.8 (----)

Vincent Lefevre wrote:
> Things could be done in grep:
>
> 1. Ignore -P when the pattern would have the same meaning without -P
>     (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
>     at least for the simplest cases).
>
> 2. Call PCRE in the C locale when this is equivalent.

I had already considered these ideas along with several others, but they 
would require grep to parse and analyze the Perl regular expression.  I 
don't know the PCRE syntax and it would take some time to write a 
parser.  And even if I wrote one, the next PCRE release would likely 
change the syntax.  It sounds very painful to maintain.

> 3. Transform invalid bytes to null bytes in-place before the PCRE
>     call. This changes the current semantic, but:
>     * the semantic on invalid bytes has never been specified, AFAIK;
>     * the best *practical* behavior may not be the current one

As we've already discussed, this would be incompatible with how invalid 
bytes are treated by other matchers.  And would have undesirable 
practical effects, e.g., the pattern 'a..*b' would match data that would 
look like "ab" on many screens (because the null byte would vanish). 
It's a real kludge that will bite users.

Even if we went along with the kludge, grep does not know what bytes 
PCRE considers to be invalid without invoking PCRE, which is what it's 
doing now.  (Yes, PCRE says it's parsing UTF-8, but there are different 
ways to do that and they don't all agree.)  I suppose grep could 
reengineer libpcre's internals, to exactly duplicate the algorithm that 
libpcre uses to decide when bytes are invalid (except to do it 10X 
faster :-), but then that'd be another thing to maintain in parallel 
with libpcre.

All of these changes sound like a lot of work, which nobody is willing 
to do.

Here's a different idea.  How about invoking grep with the 
--binary-files=without-match option?  This should avoid much of the 
libpcre performance problem, without having to change 'grep'.




Information forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 12 Sep 2014 01:25:17 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Sep 11 21:25:17 2014
Received: from localhost ([127.0.0.1]:38673 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1XSFbw-0004Kf-B2
	for submit <at> debbugs.gnu.org; Thu, 11 Sep 2014 21:25:16 -0400
Received: from eggs.gnu.org ([208.118.235.92]:48932)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <vincent@HIDDEN>) id 1XSFbt-0004KW-92
 for submit <at> debbugs.gnu.org; Thu, 11 Sep 2014 21:25:13 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <vincent@HIDDEN>) id 1XSFbm-0006Gw-92
 for submit <at> debbugs.gnu.org; Thu, 11 Sep 2014 21:25:13 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([208.118.235.17]:59272)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <vincent@HIDDEN>) id 1XSFbm-0006GY-6s
 for submit <at> debbugs.gnu.org; Thu, 11 Sep 2014 21:25:06 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:50626)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <vincent@HIDDEN>) id 1XSFbg-0007jp-0S
 for bug-grep@HIDDEN; Thu, 11 Sep 2014 21:25:06 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <vincent@HIDDEN>) id 1XSFbY-0005zy-Qf
 for bug-grep@HIDDEN; Thu, 11 Sep 2014 21:24:59 -0400
Received: from ioooi.vinc17.net ([92.243.22.117]:57473)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <vincent@HIDDEN>) id 1XSFbY-0005zo-KU
 for bug-grep@HIDDEN; Thu, 11 Sep 2014 21:24:52 -0400
Received: from smtp-xvii.vinc17.net (128.119.75.86.rev.sfr.net [86.75.119.128])
 by ioooi.vinc17.net (Postfix) with ESMTPSA id 3BA322CC;
 Fri, 12 Sep 2014 03:24:50 +0200 (CEST)
Received: by xvii.vinc17.org (Postfix, from userid 1000)
 id E827921A079; Fri, 12 Sep 2014 03:24:49 +0200 (CEST)
Date: Fri, 12 Sep 2014 03:24:49 +0200
From: Vincent Lefevre <vincent@HIDDEN>
To: bug-grep@HIDDEN
Subject: Improve performance when -P (PCRE) is used in UTF-8 locales
Message-ID: <20140912012449.GB18162@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
X-Mailer-Info: http://www.vinc17.net/mutt/
User-Agent: Mutt/1.5.23-6361-vl-r59709 (2014-07-25)
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 208.118.235.17
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

With the patch that fixes bug 18266, grep -P works again on binary
files (with invalid UTF-8 sequences), but it is now significantly
slower than old versions (which could yield undefined behavior).

Timings with the Debian packages on my personal svn working copy
(binary + text files):

2.18-2   0.9s with -P, 0.4s without -P
2.20-3  11.6s with -P, 0.4s without -P

On this example, that's a 13x slowdown! Though the performance issue
would better be fixed in libpcre3, I suppose that it is not so simple
and won't occur any time soon. Things could be done in grep:

1. Ignore -P when the pattern would have the same meaning without -P
   (patterns could also be transformed, e.g. "a\d+b" -> "a[0-9]\+b",
   at least for the simplest cases).

2. Call PCRE in the C locale when this is equivalent.

3. Transform invalid bytes to null bytes in-place before the PCRE
   call. This changes the current semantic, but:
   * the semantic on invalid bytes has never been specified, AFAIK;
   * the best *practical* behavior may not be the current one
     (I personally prefer to be able to match invalid bytes, just
     like one can match top-bit-set characters in the C locale, and
     seeing such invalid bytes as equivalent to null bytes would
     not be a problem for most users, IMHO -- things can also be
     configurable).

--=20
Vincent Lef=E8vre <vincent@HIDDEN> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)




Acknowledgement sent to Vincent Lefevre <vincent@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#18454; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.