GNU bug report logs - #56079
Missing performance optimisation: Word start/end tests

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Severity: wishlist; Reported by: sur-behoffski <sur_behoffski@HIDDEN>; dated Sun, 19 Jun 2022 05:40:02 UTC; Maintainer for grep is bug-grep@HIDDEN.
Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 56079 <at> debbugs.gnu.org:


Received: (at 56079) by debbugs.gnu.org; 19 Jun 2022 15:26:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Jun 19 11:26:53 2022
Received: from localhost ([127.0.0.1]:52869 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1o2wpB-00015x-JO
	for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 11:26:53 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:57754)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1o2wp9-00015h-8y
 for 56079 <at> debbugs.gnu.org; Sun, 19 Jun 2022 11:26:51 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id B067A16012F;
 Sun, 19 Jun 2022 08:26:44 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id FvfqwimWPR6Z; Sun, 19 Jun 2022 08:26:44 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 11500160130;
 Sun, 19 Jun 2022 08:26:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id O8SdL2n_R9pV; Sun, 19 Jun 2022 08:26:43 -0700 (PDT)
Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id BBEEA16012F;
 Sun, 19 Jun 2022 08:26:43 -0700 (PDT)
Message-ID: <d9b2c584-1d15-4275-c4cd-a04efbab5d29@HIDDEN>
Date: Sun, 19 Jun 2022 10:26:43 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.1
Subject: Re: bug#56079: Missing performance optimisation: Word start/end tests
Content-Language: en-US
To: sur-behoffski <sur_behoffski@HIDDEN>
References: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 56079
Cc: 56079 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

On 6/19/22 00:38, sur-behoffski wrote:
>          a\<b
>          [abc]\>[def]
>
Yes, and this comes up even in strict POSIX without GNU extensions, as 
"a^" is a valid extended regular expression that cannot match anything.

I don't know whether dfa.c, regex.c, etc. optimize for this, but if not 
it would be nice if they did. For example, 'grep -E "a^" FOO' can 
silently exit with status 1 without even opening FOO.





Information forwarded to bug-grep@HIDDEN:
bug#56079; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 19 Jun 2022 05:39:23 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sun Jun 19 01:39:23 2022
Received: from localhost ([127.0.0.1]:50443 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1o2ned-0007hp-8S
	for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 01:39:23 -0400
Received: from lists.gnu.org ([209.51.188.17]:34950)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sur_behoffski@HIDDEN>) id 1o2nea-0007hg-FY
 for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 01:39:22 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:47608)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <sur_behoffski@HIDDEN>)
 id 1o2neY-0004RX-Cd
 for bug-grep@HIDDEN; Sun, 19 Jun 2022 01:39:20 -0400
Received: from ipmail06.adl3.internode.on.net ([150.101.137.16]:29271)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <sur_behoffski@HIDDEN>) id 1o2neV-0000WC-Ls
 for bug-grep@HIDDEN; Sun, 19 Jun 2022 01:39:18 -0400
X-SMTP-MATCH: 0
Received: from 114-30-111-149.tpgi.com.au (HELO [192.168.178.210])
 ([114.30.111.149])
 by ipmail06.adl3.internode.on.net with ESMTP; 19 Jun 2022 15:08:46 +0930
Message-ID: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN>
Date: Sun, 19 Jun 2022 15:08:46 +0930
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.10.0
To: bug-grep@HIDDEN
Content-Language: en-AU
From: sur-behoffski <sur_behoffski@HIDDEN>
Subject: Missing performance optimisation: Word start/end tests
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Received-SPF: none client-ip=150.101.137.16;
 envelope-from=sur_behoffski@HIDDEN; helo=ipmail06.adl3.internode.on.net
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001,
 SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

G'day,

I'm in the throes of a massive rewrite of "hstbm", which combined
a very-quick-and-dirty lexer, no parser, and an optimised
Boyer-Moore-style search where I had made some incremental
improvements.  The only release is at savannah.non-gnu.org:

      https://savannah.nongnu.org/projects/hstbm

It's been over six years since the first and only release; Lua
fans will note that I have had another project that has been
active over that time, intended to help people use a
scientific/technical toolkit on a range of GNU/Linux machines.

--

I'm now in the process of trialling an all-singing, all-dancing
lexer, with the philosophy that it tries to capture the pattern
syntax and semantics, without resorting to parser constructs
such as an AST.  [I'm currently at a hairy point where the meaning
of characters such as "^" can vary, based on constructs such as
"(" (start-of-group) and/or "|" (alternation)... where does lexing
stop and parsing start?!]

One thing that is captured is predicates e.g. relating to IS_WORD:
	 "IS_WORD_YES"   (0x01),
         "IS_WORD_NO"    (0x10), and
         "IS_WORD_MAYBE" (0x11).

I've found some patterns containing word start/end boundary checks
that are impossible to match in practice, e.g.:

        a\<b
        [abc]\>[def]

Grep does not recognise these cases, and so spends time ploughing
through the text for a match that can never occur.  My lexing code,
in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or,
equivalently, pairs of "IS_WORD_NO"), and arranges the lexical
token stream such that the very first token is (effectively)
MATCH_FAILED -- without any effort to inspect the haystack buffer.
This can reduce runtimes for large haystack input from seconds to
milliseconds.

While this is not a terribly common case, it's an easy item to
check for; it's possible that, in the future, patterns may become
less hand-crafted and more machine-crafted, and so this case may
become more relevant.

cheers,

sur-behoffski (Brenton Hoff)
programmer, Grouse Software

["sur-" means "meta-", it's a commentary on a peculiar Australian
event:  See "Tony Abbott" + "Captain's Pick" + "Prince Philip".
Absolutely no disrespect is intended to Honour-receivers at any level;
I am grateful for your service, and how you have enriched society.]




Acknowledgement sent to sur-behoffski <sur_behoffski@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#56079; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Sat, 2 Jul 2022 22:30:03 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.