GNU bug report logs - #26864
Clarification on obscure regular expressions mentioned in known bugs

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Reported by: Sundeep Agarwal <learnbyexample.net@HIDDEN>; merged with #36148; dated Wed, 10 May 2017 15:31:02 UTC; Maintainer for grep is bug-grep@HIDDEN.
Merged 26864 36148. Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 26864 <at> debbugs.gnu.org:


Received: (at 26864) by debbugs.gnu.org; 30 Dec 2019 18:51:24 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Dec 30 13:51:24 2019
Received: from localhost ([127.0.0.1]:33399 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1im08S-0008DL-II
	for submit <at> debbugs.gnu.org; Mon, 30 Dec 2019 13:51:24 -0500
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:53068)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1im08R-0008D6-Om
 for 26864 <at> debbugs.gnu.org; Mon, 30 Dec 2019 13:51:24 -0500
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id D51361600CC;
 Mon, 30 Dec 2019 10:51:17 -0800 (PST)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id 8z5x63RC6NZ5; Mon, 30 Dec 2019 10:51:17 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 082181604F5;
 Mon, 30 Dec 2019 10:51:17 -0800 (PST)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id bxoPK1XLedma; Mon, 30 Dec 2019 10:51:16 -0800 (PST)
Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com
 [23.242.74.103])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id D19871600CC;
 Mon, 30 Dec 2019 10:51:16 -0800 (PST)
Subject: Re: bug#26864: Clarification on obscure regular expressions mentioned
 in known bugs
From: Paul Eggert <eggert@HIDDEN>
To: Sundeep Agarwal <learnbyexample.net@HIDDEN>
References: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN>
 <d5ec9605-2ec3-59dd-008b-0fa488108848@HIDDEN>
Organization: UCLA Computer Science Department
Message-ID: <00dd8490-be41-6337-299a-8d8edd37287f@HIDDEN>
Date: Mon, 30 Dec 2019 10:51:16 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.2
MIME-Version: 1.0
In-Reply-To: <d5ec9605-2ec3-59dd-008b-0fa488108848@HIDDEN>
Content-Type: multipart/mixed; boundary="------------FF71BBABC368369DBBDCFF4A"
Content-Language: en-US
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 26864
Cc: 26864 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

This is a multi-part message in MIME format.
--------------FF71BBABC368369DBBDCFF4A
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

On further thought, we shouldn't be encouraging palindromish REs in the manual,
so that naive users aren't drawn into this mess. So I installed the attached
further patch to the documentation.

Of course it would be better to fix the back-reference bugs but this is low
priority and I doubt whether anybody has the time.

--------------FF71BBABC368369DBBDCFF4A
Content-Type: text/x-patch; charset=UTF-8;
 name="0001-doc-don-t-encourage-back-references.patch"
Content-Disposition: attachment;
 filename="0001-doc-don-t-encourage-back-references.patch"
Content-Transfer-Encoding: quoted-printable

From 3e22ab3552fc2ebde6b99ceb2e46700bdbc55072 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@HIDDEN>
Date: Mon, 30 Dec 2019 10:41:43 -0800
Subject: [PATCH] =3D?UTF-8?q?doc:=3D20don=3DE2=3D80=3D99t=3D20encourage=3D=
20back-referen?=3D
 =3D?UTF-8?q?ces?=3D
MIME-Version: 1.0
Content-Type: text/plain; charset=3DUTF-8
Content-Transfer-Encoding: 8bit

* doc/grep.texi (Usage): Remove palindrome question.  Bondioni=E2=80=99s
RE makes grep issue a =E2=80=98grep: stack overflow=E2=80=99 diagnostic, =
and we
shouldn=E2=80=99t be encouraging fancy back-references anyway, due to all
the bugs in this area (Bug#26864).  Plus, the allusion to
=E2=80=9CGNU extensions=E2=80=9D doesn't seem to be correct here.
---
 doc/grep.texi | 25 -------------------------
 1 file changed, 25 deletions(-)

diff --git a/doc/grep.texi b/doc/grep.texi
index aceaf33..7c2b865 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1800,31 +1800,6 @@ Use the special file name @samp{-}:
 cat /etc/passwd | grep 'alain' - /etc/motd
 @end example
=20
-@item
-@cindex palindromes
-How to express palindromes in a regular expression?
-
-It can be done by using back-references;
-for example,
-a palindrome of 4 characters can be written with a BRE:
-
-@example
-grep -w -e '\(.\)\(.\).\2\1' file
-@end example
-
-It matches the word ``radar'' or ``civic.''
-
-Guglielmo Bondioni proposed a single RE
-that finds all palindromes up to 19 characters long
-using @w{9 subexpressions} and @w{9 back-references}:
-
-@smallexample
-grep -E -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1$' =
file
-@end smallexample
-
-Note this is done by using GNU ERE extensions;
-it might not be portable to other implementations of @command{grep}.
-
 @item
 Why is this back-reference failing?
=20
--=20
2.17.1


--------------FF71BBABC368369DBBDCFF4A--




Information forwarded to bug-grep@HIDDEN:
bug#26864; Package grep. Full text available.

Message received at 26864 <at> debbugs.gnu.org:


Received: (at 26864) by debbugs.gnu.org; 30 Dec 2019 09:01:33 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Dec 30 04:01:33 2019
Received: from localhost ([127.0.0.1]:60485 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ilqvc-0006Pt-Jf
	for submit <at> debbugs.gnu.org; Mon, 30 Dec 2019 04:01:33 -0500
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:50806)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1ilqva-0006Pd-2P
 for 26864 <at> debbugs.gnu.org; Mon, 30 Dec 2019 04:01:31 -0500
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id DD3491600CC;
 Mon, 30 Dec 2019 01:01:23 -0800 (PST)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id GmP9jkzU3Imd; Mon, 30 Dec 2019 01:01:22 -0800 (PST)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 482D3160263;
 Mon, 30 Dec 2019 01:01:22 -0800 (PST)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id 35T0GaF-tsSi; Mon, 30 Dec 2019 01:01:22 -0800 (PST)
Received: from [192.168.1.9] (cpe-23-242-74-103.socal.res.rr.com
 [23.242.74.103])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 1074B1600CC;
 Mon, 30 Dec 2019 01:01:22 -0800 (PST)
Subject: Re: bug#26864: Clarification on obscure regular expressions mentioned
 in known bugs
To: Sundeep Agarwal <learnbyexample.net@HIDDEN>
References: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
Message-ID: <d5ec9605-2ec3-59dd-008b-0fa488108848@HIDDEN>
Date: Mon, 30 Dec 2019 01:01:21 -0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.2.2
MIME-Version: 1.0
In-Reply-To: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN>
Content-Type: multipart/mixed; boundary="------------036640D1E10207FDC54E5662"
Content-Language: en-US
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 26864
Cc: 26864 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)

This is a multi-part message in MIME format.
--------------036640D1E10207FDC54E5662
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

On 5/10/17 1:48 AM, Sundeep Agarwal wrote:

> my question is whether these regular expression examples come under
> 'obscure regular expressions' mentioned in the man page.

No, they're bugs in grep's regular expression implementation, which is taken
from glibc. I filed a bug report against glibc here:

https://sourceware.org/bugzilla/show_bug.cgi?id=25322

and installed the attached patches to GNU grep to try to document this mess
better. The first patch is the important one; the other two merely standardize
spelling and fix a typo in the first patch. Thanks for reporting the problem.

--------------036640D1E10207FDC54E5662
Content-Type: text/x-patch; charset=UTF-8;
 name="0001-doc-mention-back-reference-bugs.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="0001-doc-mention-back-reference-bugs.patch"

From eac1e4d50a73bb33c35e5f9f95a201e22d295827 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@HIDDEN>
Date: Mon, 30 Dec 2019 00:48:20 -0800
Subject: [PATCH 1/3] doc: mention back-reference bugs

Inspired by Bug#26864.
* doc/grep.texi (Known Bugs): New section.
Mention back-reference issues.
---
 doc/grep.texi | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/doc/grep.texi b/doc/grep.texi
index 42832ab..8866ec4 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1536,6 +1536,8 @@ When multiple regular expressions are given with
 @option{-e} or from a file (@samp{-f @var{file}}),
 back-references are local to each expression.
 
+@xref{Known Bugs}, for some known problems with back-references.
+
 @node Basic vs Extended
 @section Basic vs Extended Regular Expressions
 @cindex basic regular expressions
@@ -1965,6 +1967,11 @@ GNU bug report logs for @command{grep}}.
 If you find a bug not listed there, please email it to
 @email{bug-grep@@gnu.org} to create a new bug report.
 
+@menu
+* Known Bugs::
+@end menu
+
+@node Known Bugs
 @section Known Bugs
 @cindex Bugs, known
 
@@ -1974,7 +1981,17 @@ In addition, certain other
 obscure regular expressions require exponential time and
 space, and may cause @command{grep} to run out of memory.
 
-Back-references are very slow, and may require exponential time.
+Back-references can greatly slow down matching, as they can generate
+exponentially many matching possibilities that can consume both time
+and memory to explore.  Also, the POSIX specification for
+back-references is at times unclear.  Furthermore, many regular
+expression implementations have back-reference bugs that can cause
+programs to return incorrect answers or even crash, and fixing these
+bugs has often been low-priority---for example, as of 2019 the GNU C
+library bug database contained back-reference bugs 52, 10844, 11053,
+and 23522, with little sign of forthcoming fixes.  Luckily,
+back-references are rarely useful and it should be little trouble to
+avoid them in practical applications.
 
 
 @node Copying
-- 
2.17.1


--------------036640D1E10207FDC54E5662
Content-Type: text/x-patch; charset=UTF-8;
 name="0002-doc-spell-back-reference-more-consistently.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="0002-doc-spell-back-reference-more-consistently.patch"

From bc5ac38040eb49c751e39700651f2b98cd866228 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@HIDDEN>
Date: Mon, 30 Dec 2019 00:52:10 -0800
Subject: [PATCH 2/3] doc: spell "back-reference" more consistently

---
 NEWS                  | 12 ++++++------
 README                |  2 +-
 doc/grep.in.1         |  2 +-
 src/dfasearch.c       | 12 ++++++------
 tests/backref         |  2 +-
 tests/pcre-wx-backref |  2 +-
 tests/tests           |  4 ++--
 7 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/NEWS b/NEWS
index 4f04ec3..e4083a1 100644
--- a/NEWS
+++ b/NEWS
@@ -22,7 +22,7 @@ GNU grep NEWS                                    -*- outline -*-
   [Bug#37716 introduced in grep 3.2]
 
   A performance bug has been fixed when grep is given many patterns,
-  each with no backreference.
+  each with no back-reference.
   [Bug#33249 introduced in grep 2.5]
 
   A performance bug has been fixed for patterns like '01.2' that
@@ -47,7 +47,7 @@ GNU grep NEWS                                    -*- outline -*-
   the following would print nothing (it should print the input line):
     echo 123-x|LC_ALL=C grep '.\bx'
   Using a multibyte locale, using certain regexp constructs (some ranges,
-  backreferences), or forcing use of the PCRE matcher via --perl-regexp (-P)
+  back-references), or forcing use of the PCRE matcher via --perl-regexp (-P)
   would avoid the bug.
   [bug introduced in grep 3.2]
 
@@ -240,7 +240,7 @@ GNU grep NEWS                                    -*- outline -*-
 
   grep -z would match strings it should not.  To trigger the bug, you'd
   have to use a regular expression including an anchor (^ or $) and a
-  feature like a range or a backreference, causing grep to forego its DFA
+  feature like a range or a back-reference, causing grep to forego its DFA
   matcher and resort to using re_search.  With a multibyte locale, that
   matcher could mistakenly match a string containing a newline.
   For example, this command:
@@ -473,7 +473,7 @@ GNU grep NEWS                                    -*- outline -*-
   Previously it was unreliable, and sometimes crashed or looped.
   [bug introduced in grep-2.16]
 
-  grep -P now works with -w and -x and backreferences. Before,
+  grep -P now works with -w and -x and back-references. Before,
   echo aa|grep -Pw '(.)\1' would fail to match, yet
   echo aa|grep -Pw '(.)\2' would match.
 
@@ -809,7 +809,7 @@ GNU grep NEWS                                    -*- outline -*-
   X{0,0} is implemented correctly.  It used to be a synonym of X{0,1}.
   [bug present since "the beginning"]
 
-  In multibyte locales, regular expressions including backreferences
+  In multibyte locales, regular expressions including back-references
   no longer exhibit quadratic complexity (i.e., they are orders
   of magnitude faster). [bug present since multi-byte character set
   support was introduced in 2.5.2]
@@ -967,7 +967,7 @@ Version 2.5
   - The new option --line-buffered fflush on everyline.  There is a noticeable
     slow down when forcing line buffering.
 
-  - Back references  are now local to the regex.
+  - Back-references are now local to the regex.
     grep -e '\(a\)\1' -e '\(b\)\1'
     The last backref \1 in the second expression refer to \(b\)
 
diff --git a/README b/README
index 0120973..2c15d9b 100644
--- a/README
+++ b/README
@@ -17,7 +17,7 @@ twice as fast as stock Unix egrep) hybridized with a Boyer-Moore-Gosper
 search for a fixed string that eliminates impossible text from being
 considered by the full regexp matcher without necessarily having to
 look at every character.  The result is typically many times faster
-than Unix grep or egrep.  (Regular expressions containing backreferencing
+than Unix grep or egrep.  (Regular expressions containing back-references
 will run more slowly, however.)
 
 See the files AUTHORS and THANKS for a list of authors and other contributors.
diff --git a/doc/grep.in.1 b/doc/grep.in.1
index a91b2a6..e095d5c 100644
--- a/doc/grep.in.1
+++ b/doc/grep.in.1
@@ -920,7 +920,7 @@ Repetition takes precedence over concatenation, which in turn
 takes precedence over alternation.
 A whole expression may be enclosed in parentheses
 to override these precedence rules and form a subexpression.
-.SS "Back References and Subexpressions"
+.SS "Back-references and Subexpressions"
 The back-reference
 .BI \e n\c
 \&, where
diff --git a/src/dfasearch.c b/src/dfasearch.c
index b1a242a..0160d71 100644
--- a/src/dfasearch.c
+++ b/src/dfasearch.c
@@ -103,8 +103,8 @@ kwsmusts (struct dfa_comp *dc)
   dfamustfree (dm);
 }
 
-/* Return true if KEYS, of length LEN, might contain a backreference.
-   Return false if KEYS cannot contain a backreference.
+/* Return true if KEYS, of length LEN, might contain a back-reference.
+   Return false if KEYS cannot contain a back-reference.
    BS_SAFE is true of encodings where a backslash cannot appear as the
    last byte of a multibyte character.  */
 static bool _GL_ATTRIBUTE_PURE
@@ -116,13 +116,13 @@ possible_backrefs_in_pattern (char const *keys, ptrdiff_t len, bool bs_safe)
      multibyte character.  */
   int second_backslash = bs_safe ? '\\' : CHAR_MAX + 1;
 
-  /* This code can return true even if KEYS lacks a backreference, for
+  /* This code can return true even if KEYS lacks a back-reference, for
      patterns like [\2], or for encodings where '\' appears as the last
      byte of a multibyte character.  However, false alarms should be
      rare and do not affect correctness.  */
 
   /* Do not look for a backslash in the pattern's last byte, since it
-     can't be part of a backreference and this streamlines the code.  */
+     can't be part of a back-reference and this streamlines the code.  */
   len--;
 
   if (0 <= len)
@@ -204,7 +204,7 @@ GEAcompile (char *pattern, size_t size, reg_syntax_t syntax_bits)
 
   char const *prev = pattern;
 
-  /* Buffer containing backreference-free patterns.  */
+  /* Buffer containing back-reference-free patterns.  */
   char *buf = NULL;
   ptrdiff_t buflen = 0;
   size_t bufalloc = 0;
@@ -449,7 +449,7 @@ EGexecute (void *vdc, char const *buf, size_t size, size_t *match_size,
           end = memchr (next_beg, eol, buflim - next_beg);
           end = end ? end + 1 : buflim;
 
-          /* Successful, no backreferences encountered! */
+          /* Successful, no back-references encountered! */
           if (!backref)
             goto success;
           ptr = beg;
diff --git a/tests/backref b/tests/backref
index eaa16cf..384b7ca 100755
--- a/tests/backref
+++ b/tests/backref
@@ -1,5 +1,5 @@
 #! /bin/sh
-# Test for backreferences and other things.
+# Test for back-references and other things.
 #
 # Copyright (C) 2001, 2006, 2009-2019 Free Software Foundation, Inc.
 #
diff --git a/tests/pcre-wx-backref b/tests/pcre-wx-backref
index 63536f9..b63009a 100755
--- a/tests/pcre-wx-backref
+++ b/tests/pcre-wx-backref
@@ -1,5 +1,5 @@
 #! /bin/sh
-# Before grep-2.19, grep -P and -w/-x would not with a backreference.
+# Before grep-2.19, grep -P and -w/-x would not work with a back-reference.
 #
 # Copyright (C) 2014-2019 Free Software Foundation, Inc.
 #
diff --git a/tests/tests b/tests/tests
index af9ae8a..f95c28a 100644
--- a/tests/tests
+++ b/tests/tests
@@ -152,7 +152,7 @@ a\\$		&	a$
 a\\$		&	a\$
 a\\$		&	a\	a\
 
-# back references, ugh
+# back-references, ugh
 ##a\(b\)\2c	bC	ESUBREG
 ##a\(b\1\)c	bC	ESUBREG
 a\(b*\)c\1d	b	abbcbbd	abbcbbd	bb
@@ -166,7 +166,7 @@ a\(\([bc]\)\2\)*d	b	abbcbd
 a\(\(b\)*\2\)*d		b	abbbd	abbbd
 # here is a case that no NFA implementation does right
 \(ab*\)[ab]*\1	b	ababaaa	ababaaa	a
-# check out normal matching in the presence of back refs
+# check out normal matching in the presence of back-references
 \(a\)\1bcd	b	aabcd	aabcd
 \(a\)\1bc*d	b	aabcd	aabcd
 \(a\)\1bc*d	b	aabd	aabd
-- 
2.17.1


--------------036640D1E10207FDC54E5662
Content-Type: text/x-patch; charset=UTF-8;
 name="0003-doc-fix-bug-typo.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="0003-doc-fix-bug-typo.patch"

From 71635837d14c842ceb8a0c096b52656936ac4965 Mon Sep 17 00:00:00 2001
From: Paul Eggert <eggert@HIDDEN>
Date: Mon, 30 Dec 2019 00:58:03 -0800
Subject: [PATCH 3/3] doc: fix bug# typo

---
 doc/grep.texi | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/grep.texi b/doc/grep.texi
index 8866ec4..13b8df0 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -1989,7 +1989,7 @@ expression implementations have back-reference bugs that can cause
 programs to return incorrect answers or even crash, and fixing these
 bugs has often been low-priority---for example, as of 2019 the GNU C
 library bug database contained back-reference bugs 52, 10844, 11053,
-and 23522, with little sign of forthcoming fixes.  Luckily,
+and 25322, with little sign of forthcoming fixes.  Luckily,
 back-references are rarely useful and it should be little trouble to
 avoid them in practical applications.
 
-- 
2.17.1


--------------036640D1E10207FDC54E5662--




Information forwarded to bug-grep@HIDDEN:
bug#26864; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 10 May 2017 15:30:46 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed May 10 11:30:46 2017
Received: from localhost ([127.0.0.1]:36898 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1d8TZd-0004UH-Ku
	for submit <at> debbugs.gnu.org; Wed, 10 May 2017 11:30:46 -0400
Received: from eggs.gnu.org ([208.118.235.92]:57237)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIe-0008Bh-Gp
 for submit <at> debbugs.gnu.org; Wed, 10 May 2017 04:48:49 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIX-0000yo-SO
 for submit <at> debbugs.gnu.org; Wed, 10 May 2017 04:48:43 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 HTML_MESSAGE,T_DKIM_INVALID autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:48355)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <learnbyexample.net@HIDDEN>)
 id 1d8NIX-0000yk-Ph
 for submit <at> debbugs.gnu.org; Wed, 10 May 2017 04:48:41 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:48202)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIW-0000fS-9e
 for bug-grep@HIDDEN; Wed, 10 May 2017 04:48:41 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <learnbyexample.net@HIDDEN>) id 1d8NIU-0000xh-Uk
 for bug-grep@HIDDEN; Wed, 10 May 2017 04:48:40 -0400
Received: from mail-io0-x236.google.com ([2607:f8b0:4001:c06::236]:34900)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16)
 (Exim 4.71) (envelope-from <learnbyexample.net@HIDDEN>)
 id 1d8NIU-0000xM-OD
 for bug-grep@HIDDEN; Wed, 10 May 2017 04:48:38 -0400
Received: by mail-io0-x236.google.com with SMTP id f102so9032854ioi.2
 for <bug-grep@HIDDEN>; Wed, 10 May 2017 01:48:38 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=mime-version:from:date:message-id:subject:to;
 bh=g8rFM6AJFsFrND1KqT3vfY9hVySitjYb7SOXrO8QIMw=;
 b=edzuTLexr3/Sya9BujBdOH3uB+7fnHJgBMDvUlVWLUrwKEBIZK0EPdDlzApI7I1xdu
 FipFOtKG/HzXJBH6FVQvwYuf3vYUaTgPZnSMcRFJsKVPVkPdDOpihp0apbTQ8sJDzWod
 /MpP52LUFaE23AAIREthtPXdhO2nBZTDGVpDYSQ6eG9e0g4ye2QeKoc7c6uVFVEOIsg5
 v980gIrqGgtirz3H/5QM1yRPgMA9DcnNI4M9dqvlV8LYSW0gxW+wu841ERiSln4+G/6v
 O15Vf0eIIht1Zhzu/9nysZyJS5QFLLfB2svPbpkRab27HeQ/OpWdxtUpYmZ4CvltxUne
 aSmw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:mime-version:from:date:message-id:subject:to;
 bh=g8rFM6AJFsFrND1KqT3vfY9hVySitjYb7SOXrO8QIMw=;
 b=FhBqfj9LKsJWVbt0YOSDcAyRSu8hxyTlcwaxS97vfW9uWZZygOGk6kmdvBWjmYPjGg
 d/kczJyWfgNMfZGoKmYPNyc9IvFZVMyiTmUj1T/HaVe3BlnHWpdO+r4IU5P1MqOUAViI
 440JdKt1NwGa9SiFIPOqNyEtlwvdJowzahK6Jt2+twKgjoP9Mev9pC7PCIAbZ2SkqeVA
 u9PEtoJP00n34CbQen6gqjEUF5zO9kxPpm10Zl8WjYBZNYyp8ofjYRGM+4bT3bVKB3jL
 +HFhwtsdb/CvjNJYB/UHBKHCKJs+2BHuNvxnUE6qghwZyXaaK7GXHznP3iBD8FBQnG7p
 EuGg==
X-Gm-Message-State: AODbwcANAwM3CpvwWbD5Nptu0hWQcXHXdMNXoliC8HLJq8b6lwYg3aEW
 ZKPgIAmsfg/kPXwKrrBxxEjF5En+IapK
X-Received: by 10.107.10.223 with SMTP id 92mr2677407iok.108.1494406117424;
 Wed, 10 May 2017 01:48:37 -0700 (PDT)
MIME-Version: 1.0
Received: by 10.107.140.205 with HTTP; Wed, 10 May 2017 01:48:37 -0700 (PDT)
From: Sundeep Agarwal <learnbyexample.net@HIDDEN>
Date: Wed, 10 May 2017 14:18:37 +0530
Message-ID: <CANVHbZoPcSC7yUYtXzzOrp62EyPXyqG6fqEqXeswCmJ5ONuVbg@HIDDEN>
Subject: Clarification on obscure regular expressions mentioned in known bugs
To: bug-grep@HIDDEN
Content-Type: multipart/alternative; boundary=001a113f7bb0c7831b054f2789ad
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Wed, 10 May 2017 11:30:44 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

--001a113f7bb0c7831b054f2789ad
Content-Type: text/plain; charset=UTF-8

Hello,

From the man page, version 'grep (GNU grep) 2.25'

--------------------------------
   Known Bugs
       Large repetition counts in the {n,m} construct may cause grep to use
lots of  memory.   In  addition,  certain other obscure regular expressions
require exponential time and space, and may cause grep to run out of memory.
       Back-references are very slow, and may require exponential time.
--------------------------------

I was trying a regular expression to find words from dictionary that have
two different instances of repeated letters, for example the words:
misspellings, chilliness, woodcutter etc

$ # gives no output
$ grep -m5 -xiE '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words

$ # works as expected with PCRE
$ grep -m5 -xiP '([a-z]*([a-z])\2[a-z]*){2}' /usr/share/dict/words
Abbott
Annabelle
Annette
Appaloosa
Appleseed


I asked regarding this on
https://stackoverflow.com/questions/43572924/ere-adding-quantifier-to-group-with-inner-group-and-back-reference
and other forums. It helped to identify some more cases like

$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){2}'
aazbbycc
$ # no output
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]*){3}'

and

$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,3}){3}'
aazbbycc
$ # no output
$ echo 'aazbbycc' | grep -E '(([a-z])\2[a-z]{0,4}){3}'

$ # seems dependent on character class clashing with back reference
characters and quantifier count
$ # a, b and c are the characters matching back reference
$ echo 'aazbbycc' | grep -E '(([a-z])\2[abcyz]{0,4}){2}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[abcyz]{0,4}){3}'
$ echo 'aazbbycc' | grep -E '(([a-z])\2[abyz]{0,4}){3}'
$ echo 'aazbbycc' | grep -E '(([a-z])\2[byz]{0,4}){3}'
$ echo 'aazbbycc' | grep -E '(([a-z])\2[acyz]{0,4}){3}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[ayz]{0,4}){3}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[cyz]{0,4}){3}'
aazbbycc
$ echo 'aazbbycc' | grep -E '(([a-z])\2[yz]{0,4}){3}'
aazbbycc

The same behavior is seen with 'sed (GNU sed) 4.2.2' as well. For ex:

$ echo 'aazbbycc' | sed -nE '/(([a-z])\2[a-z]{0,3}){3}/p'
aazbbycc
$ # no output
$ echo 'aazbbycc' | sed -nE '/(([a-z])\2[a-z]{0,4}){3}/p'


So, my question is whether these regular expression examples come under
'obscure regular expressions' mentioned in the man page. If so, I feel
there should be an error message displayed instead of no output

Regards,
Sundeep

--001a113f7bb0c7831b054f2789ad
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hello,<div><br></div><div>From the man page, version &#39;=
grep (GNU grep) 2.25&#39;</div><div><br></div><div>------------------------=
--------</div><div><div>=C2=A0 =C2=A0Known Bugs</div><div>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0Large repetition counts in the {n,m} construct may cause grep to =
use lots of =C2=A0memory. =C2=A0 In =C2=A0addition, =C2=A0certain other obs=
cure regular expressions require exponential time and space, and may cause =
grep to run out of memory.</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=A0Back-refere=
nces are very slow, and may require exponential time.</div></div><div><div>=
--------------------------------</div></div><div><br></div><div>I was tryin=
g a regular expression to find words from dictionary that have two differen=
t instances of repeated letters, for example the words: misspellings, chill=
iness,=C2=A0woodcutter etc</div><div><br></div><div>$ # gives no output</di=
v><div><div>$ grep -m5 -xiE &#39;([a-z]*([a-z])\2[a-z]*){2}&#39; /usr/share=
/dict/words</div></div><div><br></div><div>$ # works as expected with PCRE<=
/div><div><div>$ grep -m5 -xiP &#39;([a-z]*([a-z])\2[a-z]*){2}&#39; /usr/sh=
are/dict/words</div><div>Abbott</div><div>Annabelle</div><div>Annette</div>=
<div>Appaloosa</div><div>Appleseed</div></div><div><br></div><div><br></div=
><div>I asked regarding this on <a href=3D"https://stackoverflow.com/questi=
ons/43572924/ere-adding-quantifier-to-group-with-inner-group-and-back-refer=
ence">https://stackoverflow.com/questions/43572924/ere-adding-quantifier-to=
-group-with-inner-group-and-back-reference</a> and other forums. It helped =
to identify some more cases like</div><div><br></div><div>$ echo &#39;aazbb=
ycc&#39; | grep -E &#39;(([a-z])\2[a-z]*){2}&#39;<br></div><div><div>aazbby=
cc</div><div>$ # no output</div><div>$ echo &#39;aazbbycc&#39; | grep -E &#=
39;(([a-z])\2[a-z]*){3}&#39;</div></div><div><br></div><div>and</div><div><=
br></div><div><div>$ echo &#39;aazbbycc&#39; | grep -E &#39;(([a-z])\2[a-z]=
{0,3}){3}&#39;</div><div>aazbbycc</div><div>$ # no output</div><div>$ echo =
&#39;aazbbycc&#39; | grep -E &#39;(([a-z])\2[a-z]{0,4}){3}&#39;</div></div>=
<div><br></div><div>$ # seems dependent on character class clashing with ba=
ck reference characters and quantifier count</div><div>$ # a, b and c are t=
he characters matching back reference</div><div><div>$ echo &#39;aazbbycc&#=
39; | grep -E &#39;(([a-z])\2[abcyz]{0,4}){2}&#39;</div><div>aazbbycc</div>=
<div>$ echo &#39;aazbbycc&#39; | grep -E &#39;(([a-z])\2[abcyz]{0,4}){3}&#3=
9;</div><div>$ echo &#39;aazbbycc&#39; | grep -E &#39;(([a-z])\2[abyz]{0,4}=
){3}&#39;</div><div>$ echo &#39;aazbbycc&#39; | grep -E &#39;(([a-z])\2[byz=
]{0,4}){3}&#39;</div><div>$ echo &#39;aazbbycc&#39; | grep -E &#39;(([a-z])=
\2[acyz]{0,4}){3}&#39;</div><div>aazbbycc</div><div>$ echo &#39;aazbbycc&#3=
9; | grep -E &#39;(([a-z])\2[ayz]{0,4}){3}&#39;</div><div>aazbbycc</div><di=
v>$ echo &#39;aazbbycc&#39; | grep -E &#39;(([a-z])\2[cyz]{0,4}){3}&#39;</d=
iv><div>aazbbycc</div><div>$ echo &#39;aazbbycc&#39; | grep -E &#39;(([a-z]=
)\2[yz]{0,4}){3}&#39;</div><div>aazbbycc</div></div><div><br></div><div><di=
v>The same behavior is seen with &#39;sed (GNU sed) 4.2.2&#39;=C2=A0as well=
. For ex:<br></div></div><div><br></div><div>$ echo &#39;aazbbycc&#39; | se=
d -nE &#39;/(([a-z])\2[a-z]{0,3}){3}/p&#39;</div><div>aazbbycc</div><div>$ =
# no output</div><div><div>$ echo &#39;aazbbycc&#39; | sed -nE &#39;/(([a-z=
])\2[a-z]{0,4}){3}/p&#39;</div></div><div><br></div><div><br></div><div>So,=
 my question is whether these regular expression examples come under &#39;o=
bscure regular expressions&#39; mentioned in the man page. If so, I feel th=
ere should be an error message displayed instead of no output</div><div><br=
></div><div>Regards,</div><div>Sundeep</div><div><br></div><div><br></div><=
/div>

--001a113f7bb0c7831b054f2789ad--




Acknowledgement sent to Sundeep Agarwal <learnbyexample.net@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#26864; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Thu, 2 Jan 2020 09:45:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.