X-Loop: help-debbugs@HIDDEN Subject: bug#56079: Missing performance optimisation: Word start/end tests Resent-From: sur-behoffski <sur_behoffski@HIDDEN> Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> Resent-CC: bug-grep@HIDDEN Resent-Date: Sun, 19 Jun 2022 05:40:02 +0000 Resent-Message-ID: <handler.56079.B.165561716329629 <at> debbugs.gnu.org> Resent-Sender: help-debbugs@HIDDEN X-GNU-PR-Message: report 56079 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: 56079 <at> debbugs.gnu.org X-Debbugs-Original-To: bug-grep@HIDDEN Received: via spool by submit <at> debbugs.gnu.org id=B.165561716329629 (code B ref -1); Sun, 19 Jun 2022 05:40:02 +0000 Received: (at submit) by debbugs.gnu.org; 19 Jun 2022 05:39:23 +0000 Received: from localhost ([127.0.0.1]:50443 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1o2ned-0007hp-8S for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 01:39:23 -0400 Received: from lists.gnu.org ([209.51.188.17]:34950) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <sur_behoffski@HIDDEN>) id 1o2nea-0007hg-FY for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 01:39:22 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:47608) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from <sur_behoffski@HIDDEN>) id 1o2neY-0004RX-Cd for bug-grep@HIDDEN; Sun, 19 Jun 2022 01:39:20 -0400 Received: from ipmail06.adl3.internode.on.net ([150.101.137.16]:29271) by eggs.gnu.org with esmtp (Exim 4.90_1) (envelope-from <sur_behoffski@HIDDEN>) id 1o2neV-0000WC-Ls for bug-grep@HIDDEN; Sun, 19 Jun 2022 01:39:18 -0400 X-SMTP-MATCH: 0 Received: from 114-30-111-149.tpgi.com.au (HELO [192.168.178.210]) ([114.30.111.149]) by ipmail06.adl3.internode.on.net with ESMTP; 19 Jun 2022 15:08:46 +0930 Message-ID: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN> Date: Sun, 19 Jun 2022 15:08:46 +0930 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 Content-Language: en-AU From: sur-behoffski <sur_behoffski@HIDDEN> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Received-SPF: none client-ip=150.101.137.16; envelope-from=sur_behoffski@HIDDEN; helo=ipmail06.adl3.internode.on.net X-Spam_score_int: -18 X-Spam_score: -1.9 X-Spam_bar: - X-Spam_report: (-1.9 / 5.0 requ) BAYES_00=-1.9, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -3.3 (---) G'day, I'm in the throes of a massive rewrite of "hstbm", which combined a very-quick-and-dirty lexer, no parser, and an optimised Boyer-Moore-style search where I had made some incremental improvements. The only release is at savannah.non-gnu.org: https://savannah.nongnu.org/projects/hstbm It's been over six years since the first and only release; Lua fans will note that I have had another project that has been active over that time, intended to help people use a scientific/technical toolkit on a range of GNU/Linux machines. -- I'm now in the process of trialling an all-singing, all-dancing lexer, with the philosophy that it tries to capture the pattern syntax and semantics, without resorting to parser constructs such as an AST. [I'm currently at a hairy point where the meaning of characters such as "^" can vary, based on constructs such as "(" (start-of-group) and/or "|" (alternation)... where does lexing stop and parsing start?!] One thing that is captured is predicates e.g. relating to IS_WORD: "IS_WORD_YES" (0x01), "IS_WORD_NO" (0x10), and "IS_WORD_MAYBE" (0x11). I've found some patterns containing word start/end boundary checks that are impossible to match in practice, e.g.: a\<b [abc]\>[def] Grep does not recognise these cases, and so spends time ploughing through the text for a match that can never occur. My lexing code, in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or, equivalently, pairs of "IS_WORD_NO"), and arranges the lexical token stream such that the very first token is (effectively) MATCH_FAILED -- without any effort to inspect the haystack buffer. This can reduce runtimes for large haystack input from seconds to milliseconds. While this is not a terribly common case, it's an easy item to check for; it's possible that, in the future, patterns may become less hand-crafted and more machine-crafted, and so this case may become more relevant. cheers, sur-behoffski (Brenton Hoff) programmer, Grouse Software ["sur-" means "meta-", it's a commentary on a peculiar Australian event: See "Tony Abbott" + "Captain's Pick" + "Prince Philip". Absolutely no disrespect is intended to Honour-receivers at any level; I am grateful for your service, and how you have enriched society.]
Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) Content-Type: text/plain; charset=utf-8 X-Loop: help-debbugs@HIDDEN From: help-debbugs@HIDDEN (GNU bug Tracking System) To: sur-behoffski <sur_behoffski@HIDDEN> Subject: bug#56079: Acknowledgement (Missing performance optimisation: Word start/end tests) Message-ID: <handler.56079.B.165561716329629.ack <at> debbugs.gnu.org> References: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN> X-Gnu-PR-Message: ack 56079 X-Gnu-PR-Package: grep Reply-To: 56079 <at> debbugs.gnu.org Date: Sun, 19 Jun 2022 05:40:02 +0000 Thank you for filing a new bug report with debbugs.gnu.org. This is an automatically generated reply to let you know your message has been received. Your message is being forwarded to the package maintainers and other interested parties for their attention; they will reply in due course. Your message has been sent to the package maintainer(s): bug-grep@HIDDEN If you wish to submit further information on this problem, please send it to 56079 <at> debbugs.gnu.org. Please do not send mail to help-debbugs@HIDDEN unless you wish to report a problem with the Bug-tracking system. --=20 56079: https://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D56079 GNU Bug Tracking System Contact help-debbugs@HIDDEN with problems
X-Loop: help-debbugs@HIDDEN Subject: bug#56079: Missing performance optimisation: Word start/end tests Resent-From: Paul Eggert <eggert@HIDDEN> Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> Resent-CC: bug-grep@HIDDEN Resent-Date: Sun, 19 Jun 2022 15:27:02 +0000 Resent-Message-ID: <handler.56079.B56079.16556524134217 <at> debbugs.gnu.org> Resent-Sender: help-debbugs@HIDDEN X-GNU-PR-Message: followup 56079 X-GNU-PR-Package: grep X-GNU-PR-Keywords: To: sur-behoffski <sur_behoffski@HIDDEN> Cc: 56079 <at> debbugs.gnu.org Received: via spool by 56079-submit <at> debbugs.gnu.org id=B56079.16556524134217 (code B ref 56079); Sun, 19 Jun 2022 15:27:02 +0000 Received: (at 56079) by debbugs.gnu.org; 19 Jun 2022 15:26:53 +0000 Received: from localhost ([127.0.0.1]:52869 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1o2wpB-00015x-JO for submit <at> debbugs.gnu.org; Sun, 19 Jun 2022 11:26:53 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:57754) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1o2wp9-00015h-8y for 56079 <at> debbugs.gnu.org; Sun, 19 Jun 2022 11:26:51 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id B067A16012F; Sun, 19 Jun 2022 08:26:44 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id FvfqwimWPR6Z; Sun, 19 Jun 2022 08:26:44 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 11500160130; Sun, 19 Jun 2022 08:26:44 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id O8SdL2n_R9pV; Sun, 19 Jun 2022 08:26:43 -0700 (PDT) Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id BBEEA16012F; Sun, 19 Jun 2022 08:26:43 -0700 (PDT) Message-ID: <d9b2c584-1d15-4275-c4cd-a04efbab5d29@HIDDEN> Date: Sun, 19 Jun 2022 10:26:43 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Content-Language: en-US References: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN> From: Paul Eggert <eggert@HIDDEN> In-Reply-To: <d9caf7c8-ccb2-b8a2-7158-e5071bb82f91@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -3.3 (---) On 6/19/22 00:38, sur-behoffski wrote: > a\<b > [abc]\>[def] > Yes, and this comes up even in strict POSIX without GNU extensions, as "a^" is a valid extended regular expression that cannot match anything. I don't know whether dfa.c, regex.c, etc. optimize for this, but if not it would be nice if they did. For example, 'grep -E "a^" FOO' can silently exit with status 1 without even opening FOO.
Received: (at control) by debbugs.gnu.org; 2 Jul 2022 22:26:34 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jul 02 18:26:34 2022 Received: from localhost ([127.0.0.1]:43016 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1o7lZR-0002bO-R4 for submit <at> debbugs.gnu.org; Sat, 02 Jul 2022 18:26:33 -0400 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:35928) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1o7lZP-0002bA-Es for control <at> debbugs.gnu.org; Sat, 02 Jul 2022 18:26:31 -0400 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 12DBB160143 for <control <at> debbugs.gnu.org>; Sat, 2 Jul 2022 15:26:26 -0700 (PDT) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 9dzAa2cUdWHo for <control <at> debbugs.gnu.org>; Sat, 2 Jul 2022 15:26:25 -0700 (PDT) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 567EE160145 for <control <at> debbugs.gnu.org>; Sat, 2 Jul 2022 15:26:25 -0700 (PDT) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id wzbBght_QXna for <control <at> debbugs.gnu.org>; Sat, 2 Jul 2022 15:26:25 -0700 (PDT) Received: from [192.168.0.205] (ip72-206-2-24.fv.ks.cox.net [72.206.2.24]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id F1E18160143 for <control <at> debbugs.gnu.org>; Sat, 2 Jul 2022 15:26:24 -0700 (PDT) Message-ID: <32055ca4-b721-1d5b-bc48-01d001b6283e@HIDDEN> Date: Sat, 2 Jul 2022 17:26:24 -0500 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.9.1 Content-Language: en-US To: control <at> debbugs.gnu.org From: Paul Eggert <eggert@HIDDEN> Subject: grep bug report maintenance Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: control X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -3.3 (---) severity 56079 wishlist close 26205 close 20990
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.