GNU bug report logs -
#22028
grep -Pc / grep -P | wc -l inconsistent results
Previous Next
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 22028 in the body.
You can then email your comments to 22028 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#22028
; Package
grep
.
(Fri, 27 Nov 2015 11:30:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Jaroslav Skarvada <jskarvad <at> redhat.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Fri, 27 Nov 2015 11:30:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hi,
it seems for long files which starts with non binary data and if PCRE matcher
is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then it
switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits
on next match causing bogus -Pc results.
Reproducer:
$ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt
1
$ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l
2
The ./filtered.txt is long enough text file, that contains some NULLs after the
first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646
Original downstream bugzilla:
https://bugzilla.redhat.com/attachment.cgi?id=1080646
Attached is my attempt to fix it, but it may be not the right way
how to fix it. Especially the question is whether it should stop when
it finds binary data or not. But at least the grep -Pc / grep -P | wc -l
should behave the same
thanks & regards
Jaroslav
[0001-grep-do-not-stop-on-binary-data-if-counting-in-PCRE.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#22028
; Package
grep
.
(Sat, 28 Nov 2015 06:17:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 22028 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Fri, 27 Nov 2015 06:29:31 -0500 (EST)
Jaroslav Skarvada <jskarvad <at> redhat.com> wrote:
> Hi,
>
> it seems for long files which starts with non binary data and if PCRE matcher
> is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then it
> switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits
> on next match causing bogus -Pc results.
>
> Reproducer:
> $ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt
> 1
> $ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l
> 2
>
> The ./filtered.txt is long enough text file, that contains some NULLs after the
> first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646
>
> Original downstream bugzilla:
> https://bugzilla.redhat.com/attachment.cgi?id=1080646
>
> Attached is my attempt to fix it, but it may be not the right way
> how to fix it. Especially the question is whether it should stop when
> it finds binary data or not. But at least the grep -Pc / grep -P | wc -l
> should behave the same
>
> thanks & regards
>
> Jaroslav
I see that filter.txt is binary file, as NULs are included at line 647.
However, first 32768 bytes are correctly enocoded.
If first 32768 bytes of a file are correct encoding, grep -P marks with
not TEXTBIN_TEXT but TEXTBIN_UNKNOWN, and if grep found first match,
marks with TEXTBIN_TEXT. However, grep -P -c does not do last behavior.
grep -P treats as TEXTBIN_UNKNOWN, and if grep found first match, treats
as text file. However, grep -P -c does not do it.
So you can get number of matched lines with grep -a -P -c.
Thanks,
Norihiro
[0001-grep-P-grep-Pc-consistent-results.patch (text/plain, attachment)]
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Thu, 31 Dec 2015 07:28:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Jaroslav Skarvada <jskarvad <at> redhat.com>
:
bug acknowledged by developer.
(Thu, 31 Dec 2015 07:28:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 22028-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Thanks for the bug report and fix, Jaroslav. And thanks, Norihiro, for the test
case; I think I independently came up with something similar to your grep.c fix
in my earlier patches today and so I expect that part of your changes are no
longer needed. I installed the attached combined patch for this bug and am
marking it as done.
[0001-grep-c-should-keep-counting-after-binary-data.patch (text/x-diff, attachment)]
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Thu, 28 Jan 2016 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 86 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.