GNU bug report logs -
#29668
grep: Fatal problem with (big) file
Previous Next
Reported by: pg <pasi.vitsa <at> yahoo.com>
Date: Mon, 11 Dec 2017 22:03:02 UTC
Severity: normal
Done: Paul Eggert <eggert <at> cs.ucla.edu>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 29668 in the body.
You can then email your comments to 29668 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Mon, 11 Dec 2017 22:03:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
pg <pasi.vitsa <at> yahoo.com>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Mon, 11 Dec 2017 22:03:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Hello!
$ awk '/Volvo/' Tieliikenne5.0.csv | wc -l
266175
$ grep Volvo Tieliikenne5.0.csv | wc -l
1638
$ echo $? (after "grep Volvo Tieliikenne5.0.csv" only too)
0
$ ack Volvo Tieliikenne5.0.csv | wc -l
266175
The file contain 5 milj. lines. It is the vehicle DB dump of Finland:
http://trafiopendata.97.fi/opendata/171009_Tieliikenne_5_0.zip
$ uname -a
Linux pg-desktop 4.10.0-40-generic #44~16.04.1-Ubuntu SMP Thu Nov 9
15:37:44 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Fatal error with ”small” file too:
$ awk '/Volvo/' Tieliikenne5.0.csv > volvot.csv
$ awk '/N3/' volvot.csv | wc -l
17822
$ grep N3 volvot.csv | wc -l
1701
$ wc -l volvot.csv
266175 volvot.csv
BR
pg
PS: Ubuntu webmaster - pls put error rep adr into your system and fwd
msg?
PPS: toimitus - Kyllä mää ennen olen osannut grepata;-)
PPPS: pointer error again? use perl or die!
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Mon, 11 Dec 2017 23:37:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 29668 <at> debbugs.gnu.org (full text, mbox):
On Mon, 11 Dec 2017 23:45:25 +0200
pg <pasi.vitsa <at> yahoo.com> wrote:
> $ awk '/Volvo/' Tieliikenne5.0.csv | wc -l
> 266175
> $ grep Volvo Tieliikenne5.0.csv | wc -l
> 1638
> $ awk '/N3/' volvot.csv | wc -l
> 17822
> $ grep N3 volvot.csv | wc -l
> 1701
Perhaps, characters not to be able to recognize in your locale included
in Tieliikenne 5.0.csv and volvot.csv are included. Try below.
--
$ env LC_ALL=C grep 'Volvo' Tieliikenne\ 5.0.csv | wc -l
266175
or
$ grep -a 'Volvo' Tieliikenne\ 5.0.csv | wc -l
266175
--
$ env LC_ALL=C grep N3 volvot.csv | wc -l
17822
or
$ grep -a N3 volvot.csv | wc -l
17822
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Wed, 13 Dec 2017 00:29:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 29668 <at> debbugs.gnu.org (full text, mbox):
On 12/11/2017 03:36 PM, Norihiro Tanaka wrote:
> Perhaps, characters not to be able to recognize in your locale included
> in Tieliikenne 5.0.csv and volvot.csv are included.
Yes, that's the problem. The original 'grep' output ended in "Binary
file Tieliikenne5.0.csv matches" but the user didn't see that. Perhaps
we should send that diagnostic to stderr as well.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Wed, 13 Dec 2017 23:26:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 29668 <at> debbugs.gnu.org (full text, mbox):
On Tue, 12 Dec 2017 16:28:09 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 12/11/2017 03:36 PM, Norihiro Tanaka wrote:
> > Perhaps, characters not to be able to recognize in your locale included
> > in Tieliikenne 5.0.csv and volvot.csv are included.
>
> Yes, that's the problem. The original 'grep' output ended in "Binary file Tieliikenne5.0.csv matches" but the user didn't see that. Perhaps we should send that diagnostic to stderr as well.
I don't seem that that's problem. the user pass output of grep to wc -l,
so `Binary file ... matches' line is also counted by `wc' as one line.
$ env LC_ALL=C grep 'Volvo' Tieliikenne\ 5.0.csv | wc -l
266175
$ env LC_ALL=en_US.utf8 grep 'Volvo' Tieliikenne\ 5.0.csv | wc -l
241264
$ env LC_ALL=en_US.utf8 grep 'Volvo' Tieliikenne\ 5.0.csv | tail -1
Binary file Tieliikenne 5.0.csv matches
$ env LC_ALL=C grep N3 volvot.csv | wc -l
17822
$ env LC_ALL=en_US.utf8 grep N3 volvot.csv | wc -l
11741
$ env LC_ALL=en_US.utf8 grep N3 volvot.csv | tail -1
Binary file volvot.csv matches
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Thu, 14 Dec 2017 00:05:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 29668 <at> debbugs.gnu.org (full text, mbox):
On 12/13/2017 03:25 PM, Norihiro Tanaka wrote:
> I don't seem that that's problem. the user pass output of grep to wc -l,
> so `Binary file ... matches' line is also counted by `wc' as one line.
The intent of 'grep PATTERN | wc -l' is to count the number of matches,
like 'grep -c PATTERN' would. But it doesn't work that way here. E.g.,
on Fedora 27 with LANG=en_US.UTF-8:
$ grep -c Volvo Tieliikenne5.0.csv
266175
$ grep Volvo Tieliikenne5.0.csv | wc -l
241264
$ grep Volvo Tieliikenne5.0.csv | tail -n 1
Binary file Tieliikenne5.0.csv matches
If the "Binary file ... matches" line were sent to stdout instead of to
stderr, the problem would be more obvious to the user:
$ grep -c Volvo Tieliikenne5.0.csv
266175
$ grep Volvo Tieliikenne5.0.csv | wc -l
Binary file Tieliikenne5.0.csv matches
241264
$ grep Volvo Tieliikenne5.0.csv | tail -n 1
Binary file Tieliikenne5.0.csv matches
T;2017-09-29;75;01;;;19550000;;;;;1;1570;;3000;2595;1670;;01;2200;20.6;4;false;false;Volvo;;;;;01;;01;977;;;841;;5092946
I believe that in the past I've thought that the "Binary file" message
should be sent to stdout, but these examples are a reasonably compelling
reason to send them to stderr instead.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Sat, 16 Dec 2017 00:27:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 29668 <at> debbugs.gnu.org (full text, mbox):
On Wed, 13 Dec 2017 16:03:57 -0800
Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 12/13/2017 03:25 PM, Norihiro Tanaka wrote:
> > I don't seem that that's problem. the user pass output of grep to wc -l,
> > so `Binary file ... matches' line is also counted by `wc' as one line.
>
> The intent of 'grep PATTERN | wc -l' is to count the number of matches, like 'grep -c PATTERN' would. But it doesn't work that way here. E.g., on Fedora 27 with LANG=en_US.UTF-8:
>
> $ grep -c Volvo Tieliikenne5.0.csv
> 266175
> $ grep Volvo Tieliikenne5.0.csv | wc -l
> 241264
> $ grep Volvo Tieliikenne5.0.csv | tail -n 1
> Binary file Tieliikenne5.0.csv matches
>
> If the "Binary file ... matches" line were sent to stdout instead of to stderr, the problem would be more obvious to the user:
>
> $ grep -c Volvo Tieliikenne5.0.csv
> 266175
> $ grep Volvo Tieliikenne5.0.csv | wc -l
> Binary file Tieliikenne5.0.csv matches
> 241264
> $ grep Volvo Tieliikenne5.0.csv | tail -n 1
> Binary file Tieliikenne5.0.csv matches
> T;2017-09-29;75;01;;;19550000;;;;;1;1570;;3000;2595;1670;;01;2200;20.6;4;false;false;Volvo;;;;;01;;01;977;;;841;;5092946
>
> I believe that in the past I've thought that the "Binary file" message should be sent to stdout, but these examples are a reasonably compelling reason to send them to stderr instead.
In addition, the following problem can also occur.
$ printf 'Binary file a.txt matches\n' >a.txt
$ env LC_ALL=en_US.utf8 grep B a.txt
Binary file a.txt matches
$ printf '\xFFB\n' >a.txt
$ env LC_ALL=en_US.utf8 grep B a.txt
Binary file a.txt matches
Both are same output. However, the former displays the contents of the
matched line, OTOH the latter is not so. if "Binary file" is sent to stdout,
a user can not distinguish whether a.txt is text file or a binary file
without opening the file.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Thu, 02 Jan 2020 08:55:02 GMT)
Full text and
rfc822 format available.
Message #23 received at 29668 <at> debbugs.gnu.org (full text, mbox):
Jason, thanks for reporting this grep bug <https://bugs.gnu.org/33552>. It
strikes me that this is related to another grep bug <https://bugs.gnu.org/29668>
concerning the "Binary files ..." message. Although they're not the same bug,
it's likely that fixing one will also entail fixing the other. So I'll add a
message to both bug reports to this effect.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Thu, 17 Sep 2020 18:47:01 GMT)
Full text and
rfc822 format available.
Message #26 received at 29668 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Attached are two related 'grep' patches, one prompted by Bug#33552 "Possible bug
with handling -I option" and the other by Bug#29668 "grep: Fatal problem with
(big) file". Although I'd normally install these on grep master, Jim has started
the ball rolling on the next grep release so I'll cc this to him to see whether
these patches can be squeezed in before the next release.
[0001-Suppress-Binary-file-FOO-matches-if-I.patch (text/x-patch, attachment)]
[0002-Send-Binary-file-FOO-matches-to-stderr.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Thu, 17 Sep 2020 19:06:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 29668 <at> debbugs.gnu.org (full text, mbox):
On Thu, Sep 17, 2020 at 11:46 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> Attached are two related 'grep' patches, one prompted by Bug#33552 "Possible bug
> with handling -I option" and the other by Bug#29668 "grep: Fatal problem with
> (big) file". Although I'd normally install these on grep master, Jim has started
> the ball rolling on the next grep release so I'll cc this to him to see whether
> these patches can be squeezed in before the next release.
Nice! Thank you for resolving those.
The first one did indeed simplify numerous tests.
Both look fine and seem uncontroversial, so please go ahead and push them.
I'll probably update to latest gnulib this evening and then make a new snapshot.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Fri, 18 Sep 2020 03:00:02 GMT)
Full text and
rfc822 format available.
Message #32 received at 29668 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 9/17/20 3:03 PM, Jim Meyering wrote:
> The alternative is to change that "B" to a "b", which should be fine,
> now that it's only emitted to stderr.
Makes sense.
NEWS should be updated accordingly - but when I looked into doing that I came up
with the attached more-elaborate patch, which changes this new diagnostic and
two other unusual-format diagnostics, so that they use the same "grep: FILENAME:
MESSAGE" form that grep uses everywhere else. Whaddya think?
[0001-grep-be-more-consistent-about-diagnostic-format.patch (text/x-patch, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#29668
; Package
grep
.
(Fri, 18 Sep 2020 14:07:01 GMT)
Full text and
rfc822 format available.
Message #35 received at 29668 <at> debbugs.gnu.org (full text, mbox):
On Thu, Sep 17, 2020 at 7:59 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> On 9/17/20 3:03 PM, Jim Meyering wrote:
> > The alternative is to change that "B" to a "b", which should be fine,
> > now that it's only emitted to stderr.
>
> Makes sense.
>
> NEWS should be updated accordingly - but when I looked into doing that I came up
> with the attached more-elaborate patch, which changes this new diagnostic and
> two other unusual-format diagnostics, so that they use the same "grep: FILENAME:
> MESSAGE" form that grep uses everywhere else. Whaddya think?
Nice. Dropping the quote module (even if negligible size delta) is a
fine side effect. You're welcome to push that.
Thanks!
Reply sent
to
Paul Eggert <eggert <at> cs.ucla.edu>
:
You have taken responsibility.
(Mon, 21 Sep 2020 17:56:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
pg <pasi.vitsa <at> yahoo.com>
:
bug acknowledged by developer.
(Mon, 21 Sep 2020 17:56:03 GMT)
Full text and
rfc822 format available.
Message #40 received at 29668-done <at> debbugs.gnu.org (full text, mbox):
On 9/17/20 12:04 PM, Jim Meyering wrote:
> please go ahead and push them.
As that's been done and the bug fixes are now installed, I'm closing both bug
reports.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 20 Oct 2020 11:24:12 GMT)
Full text and
rfc822 format available.
This bug report was last modified 3 years and 189 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.