GNU logs - #56079, boring messages


Message sent to bug-grep@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#56079: Missing performance optimisation: Word start/end tests
Resent-From: sur-behoffski <sur_behoffski@HIDDEN>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-grep@HIDDEN
Resent-Date: Sun, 19 Jun 2022 05:40:02 +0000
Resent-Message-ID: <handler.56079.B.165561716329629 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: report 56079
X-GNU-PR-Package: grep
X-GNU-PR-Keywords: 
To: 56079 <at> debbugs.gnu.org
X-Debbugs-Original-To: bug-grep@HIDDEN
Received: via spool by submit <at> debbugs.gnu.org id=B.165561716329629
          (code B ref -1); Sun, 19 Jun 2022 05:40:02 +0000
Received: (at submit) by debbugs.gnu.org; 19 Jun 2022 05:39:23 +0000
Received: from localhost ([127.0.0.1]:50443 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1o2ned-0007hp-8S
	for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 01:39:23 -0400
Received: from lists.gnu.org ([209.51.188.17]:34950)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <sur_behoffski@HIDDEN>) id 1o2nea-0007hg-FY
 for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 01:39:22 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:47608)
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <sur_behoffski@HIDDEN>)
 id 1o2neY-0004RX-Cd
 for bug-grep@HIDDEN; Sun, 19 Jun 2022 01:39:20 -0400
Received: from ipmail06.adl3.internode.on.net ([150.101.137.16]:29271)
 by eggs.gnu.org with esmtp (Exim 4.90_1)
 (envelope-from <sur_behoffski@HIDDEN>) id 1o2neV-0000WC-Ls
 for bug-grep@HIDDEN; Sun, 19 Jun 2022 01:39:18 -0400
X-SMTP-MATCH: 0
Received: from 114-30-111-149.tpgi.com.au (HELO [192.168.178.210])
 ([114.30.111.149])
 by ipmail06.adl3.internode.on.net with ESMTP; 19 Jun 2022 15:08:46 +0930
Message-ID: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN>
Date: Sun, 19 Jun 2022 15:08:46 +0930
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.10.0
Content-Language: en-AU
From: sur-behoffski <sur_behoffski@HIDDEN>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Received-SPF: none client-ip=150.101.137.16;
 envelope-from=sur_behoffski@HIDDEN; helo=ipmail06.adl3.internode.on.net
X-Spam_score_int: -18
X-Spam_score: -1.9
X-Spam_bar: -
X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001,
 SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: -2.3 (--)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

G'day,

I'm in the throes of a massive rewrite of "hstbm", which combined
a very-quick-and-dirty lexer, no parser, and an optimised
Boyer-Moore-style search where I had made some incremental
improvements.  The only release is at savannah.non-gnu.org:

      https://savannah.nongnu.org/projects/hstbm

It's been over six years since the first and only release; Lua
fans will note that I have had another project that has been
active over that time, intended to help people use a
scientific/technical toolkit on a range of GNU/Linux machines.

--

I'm now in the process of trialling an all-singing, all-dancing
lexer, with the philosophy that it tries to capture the pattern
syntax and semantics, without resorting to parser constructs
such as an AST.  [I'm currently at a hairy point where the meaning
of characters such as "^" can vary, based on constructs such as
"(" (start-of-group) and/or "|" (alternation)... where does lexing
stop and parsing start?!]

One thing that is captured is predicates e.g. relating to IS_WORD:
	 "IS_WORD_YES"   (0x01),
         "IS_WORD_NO"    (0x10), and
         "IS_WORD_MAYBE" (0x11).

I've found some patterns containing word start/end boundary checks
that are impossible to match in practice, e.g.:

        a\<b
        [abc]\>[def]

Grep does not recognise these cases, and so spends time ploughing
through the text for a match that can never occur.  My lexing code,
in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or,
equivalently, pairs of "IS_WORD_NO"), and arranges the lexical
token stream such that the very first token is (effectively)
MATCH_FAILED -- without any effort to inspect the haystack buffer.
This can reduce runtimes for large haystack input from seconds to
milliseconds.

While this is not a terribly common case, it's an easy item to
check for; it's possible that, in the future, patterns may become
less hand-crafted and more machine-crafted, and so this case may
become more relevant.

cheers,

sur-behoffski (Brenton Hoff)
programmer, Grouse Software

["sur-" means "meta-", it's a commentary on a peculiar Australian
event:  See "Tony Abbott" + "Captain's Pick" + "Prince Philip".
Absolutely no disrespect is intended to Honour-receivers at any level;
I am grateful for your service, and how you have enriched society.]




Message sent:


Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-Mailer: MIME-tools 5.505 (Entity 5.505)
Content-Type: text/plain; charset=utf-8
X-Loop: help-debbugs@HIDDEN
From: help-debbugs@HIDDEN (GNU bug Tracking System)
To: sur-behoffski <sur_behoffski@HIDDEN>
Subject: bug#56079: Acknowledgement (Missing performance optimisation:
 Word start/end tests)
Message-ID: <handler.56079.B.165561716329629.ack <at> debbugs.gnu.org>
References: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN>
X-Gnu-PR-Message: ack 56079
X-Gnu-PR-Package: grep
Reply-To: 56079 <at> debbugs.gnu.org
Date: Sun, 19 Jun 2022 05:40:02 +0000

Thank you for filing a new bug report with debbugs.gnu.org.

This is an automatically generated reply to let you know your message
has been received.

Your message is being forwarded to the package maintainers and other
interested parties for their attention; they will reply in due course.

Your message has been sent to the package maintainer(s):
 bug-grep@HIDDEN

If you wish to submit further information on this problem, please
send it to 56079 <at> debbugs.gnu.org.

Please do not send mail to help-debbugs@HIDDEN unless you wish
to report a problem with the Bug-tracking system.

--=20
56079: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D56079
GNU Bug Tracking System
Contact help-debbugs@HIDDEN with problems


Message sent to bug-grep@HIDDEN:


X-Loop: help-debbugs@HIDDEN
Subject: bug#56079: Missing performance optimisation: Word start/end tests
Resent-From: Paul Eggert <eggert@HIDDEN>
Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
Resent-CC: bug-grep@HIDDEN
Resent-Date: Sun, 19 Jun 2022 15:27:02 +0000
Resent-Message-ID: <handler.56079.B56079.16556524134217 <at> debbugs.gnu.org>
Resent-Sender: help-debbugs@HIDDEN
X-GNU-PR-Message: followup 56079
X-GNU-PR-Package: grep
X-GNU-PR-Keywords: 
To: sur-behoffski <sur_behoffski@HIDDEN>
Cc: 56079 <at> debbugs.gnu.org
Received: via spool by 56079-submit <at> debbugs.gnu.org id=B56079.16556524134217
          (code B ref 56079); Sun, 19 Jun 2022 15:27:02 +0000
Received: (at 56079) by debbugs.gnu.org; 19 Jun 2022 15:26:53 +0000
Received: from localhost ([127.0.0.1]:52869 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1o2wpB-00015x-JO
	for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 11:26:53 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:57754)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1o2wp9-00015h-8y
 for 56079 <at> debbugs.gnu.org; Sun, 19 Jun 2022 11:26:51 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id B067A16012F;
 Sun, 19 Jun 2022 08:26:44 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id FvfqwimWPR6Z; Sun, 19 Jun 2022 08:26:44 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 11500160130;
 Sun, 19 Jun 2022 08:26:44 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id O8SdL2n_R9pV; Sun, 19 Jun 2022 08:26:43 -0700 (PDT)
Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id BBEEA16012F;
 Sun, 19 Jun 2022 08:26:43 -0700 (PDT)
Message-ID: <d9b2c584-1d15-4275-c4cd-a04efbab5d29@HIDDEN>
Date: Sun, 19 Jun 2022 10:26:43 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.1
Content-Language: en-US
References: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
In-Reply-To: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

On 6/19/22 00:38, sur-behoffski wrote:
>          a\<b
>          [abc]\>[def]
>
Yes, and this comes up even in strict POSIX without GNU extensions, as 
"a^" is a valid extended regular expression that cannot match anything.

I don't know whether dfa.c, regex.c, etc. optimize for this, but if not 
it would be nice if they did. For example, 'grep -E "a^" FOO' can 
silently exit with status 1 without even opening FOO.





Message received at control <at> debbugs.gnu.org:


Received: (at control) by debbugs.gnu.org; 2 Jul 2022 22:26:34 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jul 02 18:26:34 2022
Received: from localhost ([127.0.0.1]:43016 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1o7lZR-0002bO-R4
	for submit <at> debbugs.gnu.org; Sat, 02 Jul 2022 18:26:33 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35928)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1o7lZP-0002bA-Es
 for control <at> debbugs.gnu.org; Sat, 02 Jul 2022 18:26:31 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 12DBB160143
 for <control <at> debbugs.gnu.org>; Sat,  2 Jul 2022 15:26:26 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id 9dzAa2cUdWHo for <control <at> debbugs.gnu.org>;
 Sat,  2 Jul 2022 15:26:25 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 567EE160145
 for <control <at> debbugs.gnu.org>; Sat,  2 Jul 2022 15:26:25 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id wzbBght_QXna for <control <at> debbugs.gnu.org>;
 Sat,  2 Jul 2022 15:26:25 -0700 (PDT)
Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id F1E18160143
 for <control <at> debbugs.gnu.org>; Sat,  2 Jul 2022 15:26:24 -0700 (PDT)
Message-ID: <32055ca4-b721-1d5b-bc48-01d001b6283e@HIDDEN>
Date: Sat, 2 Jul 2022 17:26:24 -0500
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.9.1
Content-Language: en-US
To: control <at> debbugs.gnu.org
From: Paul Eggert <eggert@HIDDEN>
Subject: grep bug report maintenance
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: control
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

severity 56079 wishlist

close 26205

close 20990







Last modified: Sat, 2 Jul 2022 22:30:03 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.