Paul Eggert <eggert@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Received: (at 26864) by debbugs.gnu.org; 30 Dec 2019 18:51:24 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Dec 30 13:51:24 2019 Received: from localhost ([127.0.0.1]:33399 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1im08S-0008DL-II for submit <at> debbugs.gnu.org; Mon, 30 Dec 2019 13:51:24 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:53068) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1im08R-0008D6-Om for 26864 <at> debbugs.gnu.org; Mon, 30 Dec 2019 13:51:24 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id D51361600CC; Mon, 30 Dec 2019 10:51:17 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id 8z5x63RC6NZ5; Mon, 30 Dec 2019 10:51:17 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 082181604F5; Mon, 30 Dec 2019 10:51:17 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id bxoPK1XLedma; Mon, 30 Dec 2019 10:51:16 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D19871600CC; Mon, 30 Dec 2019 10:51:16 -0800 (PST) Subject: Re: bug#26864: Clarification on obscure regular expressions mentioned in known bugs From: Paul Eggert <eggert@HIDDEN> To: Sundeep Agarwal <learnbyexample.net@HIDDEN> References: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN> <d5ec9605-2ec3-59dd-008b-0fa488108848@HIDDEN> Organization: UCLA Computer Science Department Message-ID: <00dd8490-be41-6337-299a-8d8edd37287f@HIDDEN> Date: Mon, 30 Dec 2019 10:51:16 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <d5ec9605-2ec3-59dd-008b-0fa488108848@HIDDEN> Content-Type: multipart/mixed; boundary="------------FF71BBABC368369DBBDCFF4A" Content-Language: en-US X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 26864 Cc: 26864 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -3.3 (---) This is a multi-part message in MIME format. --------------FF71BBABC368369DBBDCFF4A Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On further thought, we shouldn't be encouraging palindromish REs in the manual, so that naive users aren't drawn into this mess. So I installed the attached further patch to the documentation. Of course it would be better to fix the back-reference bugs but this is low priority and I doubt whether anybody has the time. --------------FF71BBABC368369DBBDCFF4A Content-Type: text/x-patch; charset=UTF-8; name="0001-doc-don-t-encourage-back-references.patch" Content-Disposition: attachment; filename="0001-doc-don-t-encourage-back-references.patch" Content-Transfer-Encoding: quoted-printable From 3e22ab3552fc2ebde6b99ceb2e46700bdbc55072 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@HIDDEN> Date: Mon, 30 Dec 2019 10:41:43 -0800 Subject: [PATCH] =3D?UTF-8?q?doc:=3D20don=3DE2=3D80=3D99t=3D20encourage=3D= 20back-referen?=3D =3D?UTF-8?q?ces?=3D MIME-Version: 1.0 Content-Type: text/plain; charset=3DUTF-8 Content-Transfer-Encoding: 8bit * doc/grep.texi (Usage): Remove palindrome question. Bondioni=E2=80=99s RE makes grep issue a =E2=80=98grep: stack overflow=E2=80=99 diagnostic, = and we shouldn=E2=80=99t be encouraging fancy back-references anyway, due to all the bugs in this area (Bug#26864). Plus, the allusion to =E2=80=9CGNU extensions=E2=80=9D doesn't seem to be correct here. --- doc/grep.texi | 25 ------------------------- 1 file changed, 25 deletions(-) diff --git a/doc/grep.texi b/doc/grep.texi index aceaf33..7c2b865 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -1800,31 +1800,6 @@ Use the special file name @samp{-}: cat /etc/passwd | grep 'alain' - /etc/motd @end example =20 -@item -@cindex palindromes -How to express palindromes in a regular expression? - -It can be done by using back-references; -for example, -a palindrome of 4 characters can be written with a BRE: - -@example -grep -w -e '\(.\)\(.\).\2\1' file -@end example - -It matches the word ``radar'' or ``civic.'' - -Guglielmo Bondioni proposed a single RE -that finds all palindromes up to 19 characters long -using @w{9 subexpressions} and @w{9 back-references}: - -@smallexample -grep -E -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1$' = file -@end smallexample - -Note this is done by using GNU ERE extensions; -it might not be portable to other implementations of @command{grep}. - @item Why is this back-reference failing? =20 --=20 2.17.1 --------------FF71BBABC368369DBBDCFF4A--
bug-grep@HIDDEN
:bug#26864
; Package grep
.
Full text available.Received: (at 26864) by debbugs.gnu.org; 30 Dec 2019 09:01:33 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon Dec 30 04:01:33 2019 Received: from localhost ([127.0.0.1]:60485 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1ilqvc-0006Pt-Jf for submit <at> debbugs.gnu.org; Mon, 30 Dec 2019 04:01:33 -0500 Received: from zimbra.cs.ucla.edu ([131.179.128.68]:50806) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <eggert@HIDDEN>) id 1ilqva-0006Pd-2P for 26864 <at> debbugs.gnu.org; Mon, 30 Dec 2019 04:01:31 -0500 Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id DD3491600CC; Mon, 30 Dec 2019 01:01:23 -0800 (PST) Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id GmP9jkzU3Imd; Mon, 30 Dec 2019 01:01:22 -0800 (PST) Received: from localhost (localhost [127.0.0.1]) by zimbra.cs.ucla.edu (Postfix) with ESMTP id 482D3160263; Mon, 30 Dec 2019 01:01:22 -0800 (PST) X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu Received: from zimbra.cs.ucla.edu ([127.0.0.1]) by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id 35T0GaF-tsSi; Mon, 30 Dec 2019 01:01:22 -0800 (PST) Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com [23.242.74.103]) by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 1074B1600CC; Mon, 30 Dec 2019 01:01:22 -0800 (PST) Subject: Re: bug#26864: Clarification on obscure regular expressions mentioned in known bugs To: Sundeep Agarwal <learnbyexample.net@HIDDEN> References: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN> From: Paul Eggert <eggert@HIDDEN> Organization: UCLA Computer Science Department Message-ID: <d5ec9605-2ec3-59dd-008b-0fa488108848@HIDDEN> Date: Mon, 30 Dec 2019 01:01:21 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2 MIME-Version: 1.0 In-Reply-To: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN> Content-Type: multipart/mixed; boundary="------------036640D1E10207FDC54E5662" Content-Language: en-US X-Spam-Score: -2.3 (--) X-Debbugs-Envelope-To: 26864 Cc: 26864 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -3.3 (---) This is a multi-part message in MIME format. --------------036640D1E10207FDC54E5662 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit On 5/10/17 1:48 AM, Sundeep Agarwal wrote: > my question is whether these regular expression examples come under > 'obscure regular expressions' mentioned in the man page. No, they're bugs in grep's regular expression implementation, which is taken from glibc. I filed a bug report against glibc here: https://sourceware.org/bugzilla/show_bug.cgi?id=25322 and installed the attached patches to GNU grep to try to document this mess better. The first patch is the important one; the other two merely standardize spelling and fix a typo in the first patch. Thanks for reporting the problem. --------------036640D1E10207FDC54E5662 Content-Type: text/x-patch; charset=UTF-8; name="0001-doc-mention-back-reference-bugs.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0001-doc-mention-back-reference-bugs.patch" From eac1e4d50a73bb33c35e5f9f95a201e22d295827 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@HIDDEN> Date: Mon, 30 Dec 2019 00:48:20 -0800 Subject: [PATCH 1/3] doc: mention back-reference bugs Inspired by Bug#26864. * doc/grep.texi (Known Bugs): New section. Mention back-reference issues. --- doc/grep.texi | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/doc/grep.texi b/doc/grep.texi index 42832ab..8866ec4 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -1536,6 +1536,8 @@ When multiple regular expressions are given with @option{-e} or from a file (@samp{-f @var{file}}), back-references are local to each expression. +@xref{Known Bugs}, for some known problems with back-references. + @node Basic vs Extended @section Basic vs Extended Regular Expressions @cindex basic regular expressions @@ -1965,6 +1967,11 @@ GNU bug report logs for @command{grep}}. If you find a bug not listed there, please email it to @email{bug-grep@@gnu.org} to create a new bug report. +@menu +* Known Bugs:: +@end menu + +@node Known Bugs @section Known Bugs @cindex Bugs, known @@ -1974,7 +1981,17 @@ In addition, certain other obscure regular expressions require exponential time and space, and may cause @command{grep} to run out of memory. -Back-references are very slow, and may require exponential time. +Back-references can greatly slow down matching, as they can generate +exponentially many matching possibilities that can consume both time +and memory to explore. Also, the POSIX specification for +back-references is at times unclear. Furthermore, many regular +expression implementations have back-reference bugs that can cause +programs to return incorrect answers or even crash, and fixing these +bugs has often been low-priority---for example, as of 2019 the GNU C +library bug database contained back-reference bugs 52, 10844, 11053, +and 23522, with little sign of forthcoming fixes. Luckily, +back-references are rarely useful and it should be little trouble to +avoid them in practical applications. @node Copying -- 2.17.1 --------------036640D1E10207FDC54E5662 Content-Type: text/x-patch; charset=UTF-8; name="0002-doc-spell-back-reference-more-consistently.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0002-doc-spell-back-reference-more-consistently.patch" From bc5ac38040eb49c751e39700651f2b98cd866228 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@HIDDEN> Date: Mon, 30 Dec 2019 00:52:10 -0800 Subject: [PATCH 2/3] doc: spell "back-reference" more consistently --- NEWS | 12 ++++++------ README | 2 +- doc/grep.in.1 | 2 +- src/dfasearch.c | 12 ++++++------ tests/backref | 2 +- tests/pcre-wx-backref | 2 +- tests/tests | 4 ++-- 7 files changed, 18 insertions(+), 18 deletions(-) diff --git a/NEWS b/NEWS index 4f04ec3..e4083a1 100644 --- a/NEWS +++ b/NEWS @@ -22,7 +22,7 @@ GNU grep NEWS -*- outline -*- [Bug#37716 introduced in grep 3.2] A performance bug has been fixed when grep is given many patterns, - each with no backreference. + each with no back-reference. [Bug#33249 introduced in grep 2.5] A performance bug has been fixed for patterns like '01.2' that @@ -47,7 +47,7 @@ GNU grep NEWS -*- outline -*- the following would print nothing (it should print the input line): echo 123-x|LC_ALL=C grep '.\bx' Using a multibyte locale, using certain regexp constructs (some ranges, - backreferences), or forcing use of the PCRE matcher via --perl-regexp (-P) + back-references), or forcing use of the PCRE matcher via --perl-regexp (-P) would avoid the bug. [bug introduced in grep 3.2] @@ -240,7 +240,7 @@ GNU grep NEWS -*- outline -*- grep -z would match strings it should not. To trigger the bug, you'd have to use a regular expression including an anchor (^ or $) and a - feature like a range or a backreference, causing grep to forego its DFA + feature like a range or a back-reference, causing grep to forego its DFA matcher and resort to using re_search. With a multibyte locale, that matcher could mistakenly match a string containing a newline. For example, this command: @@ -473,7 +473,7 @@ GNU grep NEWS -*- outline -*- Previously it was unreliable, and sometimes crashed or looped. [bug introduced in grep-2.16] - grep -P now works with -w and -x and backreferences. Before, + grep -P now works with -w and -x and back-references. Before, echo aa|grep -Pw '(.)\1' would fail to match, yet echo aa|grep -Pw '(.)\2' would match. @@ -809,7 +809,7 @@ GNU grep NEWS -*- outline -*- X{0,0} is implemented correctly. It used to be a synonym of X{0,1}. [bug present since "the beginning"] - In multibyte locales, regular expressions including backreferences + In multibyte locales, regular expressions including back-references no longer exhibit quadratic complexity (i.e., they are orders of magnitude faster). [bug present since multi-byte character set support was introduced in 2.5.2] @@ -967,7 +967,7 @@ Version 2.5 - The new option --line-buffered fflush on everyline. There is a noticeable slow down when forcing line buffering. - - Back references are now local to the regex. + - Back-references are now local to the regex. grep -e '\(a\)\1' -e '\(b\)\1' The last backref \1 in the second expression refer to \(b\) diff --git a/README b/README index 0120973..2c15d9b 100644 --- a/README +++ b/README @@ -17,7 +17,7 @@ twice as fast as stock Unix egrep) hybridized with a Boyer-Moore-Gosper search for a fixed string that eliminates impossible text from being considered by the full regexp matcher without necessarily having to look at every character. The result is typically many times faster -than Unix grep or egrep. (Regular expressions containing backreferencing +than Unix grep or egrep. (Regular expressions containing back-references will run more slowly, however.) See the files AUTHORS and THANKS for a list of authors and other contributors. diff --git a/doc/grep.in.1 b/doc/grep.in.1 index a91b2a6..e095d5c 100644 --- a/doc/grep.in.1 +++ b/doc/grep.in.1 @@ -920,7 +920,7 @@ Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole expression may be enclosed in parentheses to override these precedence rules and form a subexpression. -.SS "Back References and Subexpressions" +.SS "Back-references and Subexpressions" The back-reference .BI \e n\c \&, where diff --git a/src/dfasearch.c b/src/dfasearch.c index b1a242a..0160d71 100644 --- a/src/dfasearch.c +++ b/src/dfasearch.c @@ -103,8 +103,8 @@ kwsmusts (struct dfa_comp *dc) dfamustfree (dm); } -/* Return true if KEYS, of length LEN, might contain a backreference. - Return false if KEYS cannot contain a backreference. +/* Return true if KEYS, of length LEN, might contain a back-reference. + Return false if KEYS cannot contain a back-reference. BS_SAFE is true of encodings where a backslash cannot appear as the last byte of a multibyte character. */ static bool _GL_ATTRIBUTE_PURE @@ -116,13 +116,13 @@ possible_backrefs_in_pattern (char const *keys, ptrdiff_t len, bool bs_safe) multibyte character. */ int second_backslash = bs_safe ? '\\' : CHAR_MAX + 1; - /* This code can return true even if KEYS lacks a backreference, for + /* This code can return true even if KEYS lacks a back-reference, for patterns like [\2], or for encodings where '\' appears as the last byte of a multibyte character. However, false alarms should be rare and do not affect correctness. */ /* Do not look for a backslash in the pattern's last byte, since it - can't be part of a backreference and this streamlines the code. */ + can't be part of a back-reference and this streamlines the code. */ len--; if (0 <= len) @@ -204,7 +204,7 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits) char const *prev = pattern; - /* Buffer containing backreference-free patterns. */ + /* Buffer containing back-reference-free patterns. */ char *buf = NULL; ptrdiff_t buflen = 0; size_t bufalloc = 0; @@ -449,7 +449,7 @@ EGexecute (void *vdc, char const *buf, size_t size, size_t *match_size, end = memchr (next_beg, eol, buflim - next_beg); end = end ? end + 1 : buflim; - /* Successful, no backreferences encountered! */ + /* Successful, no back-references encountered! */ if (!backref) goto success; ptr = beg; diff --git a/tests/backref b/tests/backref index eaa16cf..384b7ca 100755 --- a/tests/backref +++ b/tests/backref @@ -1,5 +1,5 @@ #! /bin/sh -# Test for backreferences and other things. +# Test for back-references and other things. # # Copyright (C) 2001, 2006, 2009-2019 Free Software Foundation, Inc. # diff --git a/tests/pcre-wx-backref b/tests/pcre-wx-backref index 63536f9..b63009a 100755 --- a/tests/pcre-wx-backref +++ b/tests/pcre-wx-backref @@ -1,5 +1,5 @@ #! /bin/sh -# Before grep-2.19, grep -P and -w/-x would not with a backreference. +# Before grep-2.19, grep -P and -w/-x would not work with a back-reference. # # Copyright (C) 2014-2019 Free Software Foundation, Inc. # diff --git a/tests/tests b/tests/tests index af9ae8a..f95c28a 100644 --- a/tests/tests +++ b/tests/tests @@ -152,7 +152,7 @@ a\\$ & a$ a\\$ & a\$ a\\$ & a\ a\ -# back references, ugh +# back-references, ugh ##a\(b\)\2c bC ESUBREG ##a\(b\1\)c bC ESUBREG a\(b*\)c\1d b abbcbbd abbcbbd bb @@ -166,7 +166,7 @@ a\(\([bc]\)\2\)*d b abbcbd a\(\(b\)*\2\)*d b abbbd abbbd # here is a case that no NFA implementation does right \(ab*\)[ab]*\1 b ababaaa ababaaa a -# check out normal matching in the presence of back refs +# check out normal matching in the presence of back-references \(a\)\1bcd b aabcd aabcd \(a\)\1bc*d b aabcd aabcd \(a\)\1bc*d b aabd aabd -- 2.17.1 --------------036640D1E10207FDC54E5662 Content-Type: text/x-patch; charset=UTF-8; name="0003-doc-fix-bug-typo.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="0003-doc-fix-bug-typo.patch" From 71635837d14c842ceb8a0c096b52656936ac4965 Mon Sep 17 00:00:00 2001 From: Paul Eggert <eggert@HIDDEN> Date: Mon, 30 Dec 2019 00:58:03 -0800 Subject: [PATCH 3/3] doc: fix bug# typo --- doc/grep.texi | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/grep.texi b/doc/grep.texi index 8866ec4..13b8df0 100644 --- a/doc/grep.texi +++ b/doc/grep.texi @@ -1989,7 +1989,7 @@ expression implementations have back-reference bugs that can cause programs to return incorrect answers or even crash, and fixing these bugs has often been low-priority---for example, as of 2019 the GNU C library bug database contained back-reference bugs 52, 10844, 11053, -and 23522, with little sign of forthcoming fixes. Luckily, +and 25322, with little sign of forthcoming fixes. Luckily, back-references are rarely useful and it should be little trouble to avoid them in practical applications. -- 2.17.1 --------------036640D1E10207FDC54E5662--
bug-grep@HIDDEN
:bug#26864
; Package grep
.
Full text available.Received: (at submit) by debbugs.gnu.org; 10 May 2017 15:30:46 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Wed May 10 11:30:46 2017 Received: from localhost ([127.0.0.1]:36898 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1d8TZd-0004UH-Ku for submit <at> debbugs.gnu.org; Wed, 10 May 2017 11:30:46 -0400 Received: from eggs.gnu.org ([208.118.235.92]:57237) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIe-0008Bh-Gp for submit <at> debbugs.gnu.org; Wed, 10 May 2017 04:48:49 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIX-0000yo-SO for submit <at> debbugs.gnu.org; Wed, 10 May 2017 04:48:43 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:48355) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIX-0000yk-Ph for submit <at> debbugs.gnu.org; Wed, 10 May 2017 04:48:41 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:48202) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIW-0000fS-9e for bug-grep@HIDDEN; Wed, 10 May 2017 04:48:41 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIU-0000xh-Uk for bug-grep@HIDDEN; Wed, 10 May 2017 04:48:40 -0400 Received: from mail-io0-x236.google.com ([2607:f8b0:4001:c06::236]:34900) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIU-0000xM-OD for bug-grep@HIDDEN; Wed, 10 May 2017 04:48:38 -0400 Received: by mail-io0-x236.google.com with SMTP id f102so9032854ioi.2 for <bug-grep@HIDDEN>; Wed, 10 May 2017 01:48:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=g8rFM6AJFsFrND1KqT3vfY9hVySitjYb7SOXrO8QIMw=; b=edzuTLexr3/Sya9BujBdOH3uB+7fnHJgBMDvUlVWLUrwKEBIZK0EPdDlzApI7I1xdu FipFOtKG/HzXJBH6FVQvwYuf3vYUaTgPZnSMcRFJsKVPVkPdDOpihp0apbTQ8sJDzWod /MpP52LUFaE23AAIREthtPXdhO2nBZTDGVpDYSQ6eG9e0g4ye2QeKoc7c6uVFVEOIsg5 v980gIrqGgtirz3H/5QM1yRPgMA9DcnNI4M9dqvlV8LYSW0gxW+wu841ERiSln4+G/6v O15Vf0eIIht1Zhzu/9nysZyJS5QFLLfB2svPbpkRab27HeQ/OpWdxtUpYmZ4CvltxUne aSmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=g8rFM6AJFsFrND1KqT3vfY9hVySitjYb7SOXrO8QIMw=; b=FhBqfj9LKsJWVbt0YOSDcAyRSu8hxyTlcwaxS97vfW9uWZZygOGk6kmdvBWjmYPjGg d/kczJyWfgNMfZGoKmYPNyc9IvFZVMyiTmUj1T/HaVe3BlnHWpdO+r4IU5P1MqOUAViI 440JdKt1NwGa9SiFIPOqNyEtlwvdJowzahK6Jt2+twKgjoP9Mev9pC7PCIAbZ2SkqeVA u9PEtoJP00n34CbQen6gqjEUF5zO9kxPpm10Zl8WjYBZNYyp8ofjYRGM+4bT3bVKB3jL +HFhwtsdb/CvjNJYB/UHBKHCKJs+2BHuNvxnUE6qghwZyXaaK7GXHznP3iBD8FBQnG7p EuGg== X-Gm-Message-State: AODbwcANAwM3CpvwWbD5Nptu0hWQcXHXdMNXoliC8HLJq8b6lwYg3aEW ZKPgIAmsfg/kPXwKrrBxxEjF5En+IapK X-Received: by 10.107.10.223 with SMTP id 92mr2677407iok.108.1494406117424; Wed, 10 May 2017 01:48:37 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.140.205 with HTTP; Wed, 10 May 2017 01:48:37 -0700 (PDT) From: Sundeep Agarwal <learnbyexample.net@HIDDEN> Date: Wed, 10 May 2017 14:18:37 +0530 Message-ID: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN> Subject: Clarification on obscure regular expressions mentioned in known bugs To: bug-grep@HIDDEN Content-Type: multipart/alternative; boundary=001a113f7bb0c7831b054f2789ad X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-Mailman-Approved-At: Wed, 10 May 2017 11:30:44 -0400 X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -4.0 (----) --001a113f7bb0c7831b054f2789ad Content-Type: text/plain; charset=UTF-8 Hello, From the man page, version 'grep (GNU grep) 2.25' -------------------------------- Known Bugs Large repetition counts in the {n,m} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory. Back-references are very slow, and may require exponential time. -------------------------------- I was trying a regular expression to find words from dictionary that have two different instances of repeated letters, for example the words: misspellings, chilliness, woodcutter etc $ # gives no output $ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words $ # works as expected with PCRE $ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words Abbott Annabelle Annette Appaloosa Appleseed I asked regarding this on https://stackoverflow.com/questions/43572924/ere-adding-quantifier-to-group-with-inner-group-and-back-reference and other forums. It helped to identify some more cases like $ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){2}' aazbbycc $ # no output $ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){3}' and $ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,3}){3}' aazbbycc $ # no output $ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,4}){3}' $ # seems dependent on character class clashing with back reference characters and quantifier count $ # a, b and c are the characters matching back reference $ echo 'aazbbycc' | grep -E '(([a-z])\2[abcyz]{0,4}){2}' aazbbycc $ echo 'aazbbycc' | grep -E '(([a-z])\2[abcyz]{0,4}){3}' $ echo 'aazbbycc' | grep -E '(([a-z])\2[abyz]{0,4}){3}' $ echo 'aazbbycc' | grep -E '(([a-z])\2[byz]{0,4}){3}' $ echo 'aazbbycc' | grep -E '(([a-z])\2[acyz]{0,4}){3}' aazbbycc $ echo 'aazbbycc' | grep -E '(([a-z])\2[ayz]{0,4}){3}' aazbbycc $ echo 'aazbbycc' | grep -E '(([a-z])\2[cyz]{0,4}){3}' aazbbycc $ echo 'aazbbycc' | grep -E '(([a-z])\2[yz]{0,4}){3}' aazbbycc The same behavior is seen with 'sed (GNU sed) 4.2.2' as well. For ex: $ echo 'aazbbycc' | sed -nE '/(([a-z])\2[a-z]{0,3}){3}/p' aazbbycc $ # no output $ echo 'aazbbycc' | sed -nE '/(([a-z])\2[a-z]{0,4}){3}/p' So, my question is whether these regular expression examples come under 'obscure regular expressions' mentioned in the man page. If so, I feel there should be an error message displayed instead of no output Regards, Sundeep --001a113f7bb0c7831b054f2789ad Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable <div dir=3D"ltr">Hello,<div><br></div><div>From the man page, version '= grep (GNU grep) 2.25'</div><div><br></div><div>------------------------= --------</div><div><div>=C2=A0 =C2=A0Known Bugs</div><div>=C2=A0 =C2=A0 =C2= =A0 =C2=A0Large repetition counts in the {n,m} construct may cause grep to = use lots of =C2=A0memory. =C2=A0 In =C2=A0addition, =C2=A0certain other obs= cure regular expressions require exponential time and space, and may cause = grep to run out of memory.</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0Back-refere= nces are very slow, and may require exponential time.</div></div><div><div>= --------------------------------</div></div><div><br></div><div>I was tryin= g a regular expression to find words from dictionary that have two differen= t instances of repeated letters, for example the words: misspellings, chill= iness,=C2=A0woodcutter etc</div><div><br></div><div>$ # gives no output</di= v><div><div>$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share= /dict/words</div></div><div><br></div><div>$ # works as expected with PCRE<= /div><div><div>$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/sh= are/dict/words</div><div>Abbott</div><div>Annabelle</div><div>Annette</div>= <div>Appaloosa</div><div>Appleseed</div></div><div><br></div><div><br></div= ><div>I asked regarding this on <a href=3D"https://stackoverflow.com/questi= ons/43572924/ere-adding-quantifier-to-group-with-inner-group-and-back-refer= ence">https://stackoverflow.com/questions/43572924/ere-adding-quantifier-to= -group-with-inner-group-and-back-reference</a> and other forums. It helped = to identify some more cases like</div><div><br></div><div>$ echo 'aazbb= ycc' | grep -E '(([a-z])\2[a-z]*){2}'<br></div><div><div>aazbby= cc</div><div>$ # no output</div><div>$ echo 'aazbbycc' | grep -E &#= 39;(([a-z])\2[a-z]*){3}'</div></div><div><br></div><div>and</div><div><= br></div><div><div>$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]= {0,3}){3}'</div><div>aazbbycc</div><div>$ # no output</div><div>$ echo = 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,4}){3}'</div></div>= <div><br></div><div>$ # seems dependent on character class clashing with ba= ck reference characters and quantifier count</div><div>$ # a, b and c are t= he characters matching back reference</div><div><div>$ echo 'aazbbycc&#= 39; | grep -E '(([a-z])\2[abcyz]{0,4}){2}'</div><div>aazbbycc</div>= <div>$ echo 'aazbbycc' | grep -E '(([a-z])\2[abcyz]{0,4}){3}= 9;</div><div>$ echo 'aazbbycc' | grep -E '(([a-z])\2[abyz]{0,4}= ){3}'</div><div>$ echo 'aazbbycc' | grep -E '(([a-z])\2[byz= ]{0,4}){3}'</div><div>$ echo 'aazbbycc' | grep -E '(([a-z])= \2[acyz]{0,4}){3}'</div><div>aazbbycc</div><div>$ echo 'aazbbycc= 9; | grep -E '(([a-z])\2[ayz]{0,4}){3}'</div><div>aazbbycc</div><di= v>$ echo 'aazbbycc' | grep -E '(([a-z])\2[cyz]{0,4}){3}'</d= iv><div>aazbbycc</div><div>$ echo 'aazbbycc' | grep -E '(([a-z]= )\2[yz]{0,4}){3}'</div><div>aazbbycc</div></div><div><br></div><div><di= v>The same behavior is seen with 'sed (GNU sed) 4.2.2'=C2=A0as well= . For ex:<br></div></div><div><br></div><div>$ echo 'aazbbycc' | se= d -nE '/(([a-z])\2[a-z]{0,3}){3}/p'</div><div>aazbbycc</div><div>$ = # no output</div><div><div>$ echo 'aazbbycc' | sed -nE '/(([a-z= ])\2[a-z]{0,4}){3}/p'</div></div><div><br></div><div><br></div><div>So,= my question is whether these regular expression examples come under 'o= bscure regular expressions' mentioned in the man page. If so, I feel th= ere should be an error message displayed instead of no output</div><div><br= ></div><div>Regards,</div><div>Sundeep</div><div><br></div><div><br></div><= /div> --001a113f7bb0c7831b054f2789ad--
Sundeep Agarwal <learnbyexample.net@HIDDEN>
:bug-grep@HIDDEN
.
Full text available.bug-grep@HIDDEN
:bug#26864
; Package grep
.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.