GNU bug report logs - #70540
grep -c -r | grep -v ':0$'

Previous Next

Package: grep;

Reported by: "Dale R. Worley" <worley <at> alum.mit.edu>

Date: Tue, 23 Apr 2024 18:34:13 UTC

Severity: normal

To reply to this bug, email your comments to 70540 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Tue, 23 Apr 2024 18:34:16 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Dale R. Worley" <worley <at> alum.mit.edu>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 23 Apr 2024 18:34:17 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Dale R. Worley" <worley <at> alum.mit.edu>
To: bug-grep <at> gnu.org
Subject: grep -c -r | grep -v ':0$'
Date: Tue, 23 Apr 2024 14:32:38 -0400
At least once a week, and often several times a day, I want to search a
tree of files to list the files in a directory containing a pattern,
along with the *numbers* of patterns in the files.  Usually this is
because I'm looking for a file that contains a number of instances of
the pattern, from among which I will choose to copy something.  But
often the total number of files to be examined is large, and the total
number of matches in any file might also be large.

So "grep -r" is inconvenient, because it may return many more matches
than I want to examine, and it can be hard to see what all the
alternative files are among the large number of matches that can be
returned from any one file.

And "grep -c -r" is inconvenient, because it lists every file, even the
large number containing no match.

The idiom I usually use is "grep -c -r [pattern] [directory] | grep -v
':0$'", which lists the match counts, but only for files with non-zero
counts.

However, it seems "natural" to me that "grep -c -l", that is, "grep
--count --files-with-matches", should give me this result.  The current
(ver. 3.6) behavior of grep is that combination acts like
--files-with-matches alone.

Looking at the comments in the grep code, it seems that Posix specifies
that --count and --files-with-matches are incompatible, and thus this
change would be upward-compatible with Posix.

I've written a draft code revision, and it's simple, it doesn't require
changes to the overall code structure.

What do people think?

Dale




Information forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Wed, 24 Apr 2024 12:57:09 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dennis Clarke <dclarke <at> blastwave.org>
To: bug-grep <at> gnu.org
Subject: Re: bug#70540: grep -c -r | grep -v ':0$'
Date: Wed, 24 Apr 2024 08:55:49 -0400
On 4/23/24 14:32, Dale R. Worley wrote:
> At least once a week, and often several times a day ...

Dear Sir :

    This is a task I can certainly relate to.  Dragging through massive
storage servers with find and grep is a terrible way to get things done.

> I want to search a tree of files to list the files in a directory
> containing a pattern ...

    That is usually the easy part of the problem.

> along with the *numbers* of patterns in the files.

    That is not the easy part.

> Usually this is because I'm looking for a file that contains a number
> of instances of the pattern, from among which I will choose to copy
> something.

    Perhaps a specific example would be helpful. Do you mean to say that
you run "find" on a directory "./foo" and you search for all filenames
that have a case sensitive pattern "BaR" in the filename? Then within
the result set of filenames you count the instances of the string "BaR"
inside the files that match? Are you only searching text files or will
there be multi-lingual UTF-8 char encoded files? What about binary bit
pattern match?

> But often the total number of files to be examined is large, and the
> total number of matches in any file might also be large.

    Here the word "large" can be tens of millions of files or perhaps
even billions or trillions. Not sure what large means but certainly we
are in the region of something possible with a decent modern server.

> So "grep -r" is inconvenient, because it may return many more matches
> than I want to examine, and it can be hard to see what all the
> alternative files are among the large number of matches that can be
> returned from any one file.
>

    Without really understanding the problem you are trying to solve I
have the sudden feeling what you really want is a custom written bit of
code that walks down the directory structure and then does the read and
inspection of each filename that matches some pattern. Making changes to
grep for that purpose feels like making changes to a good working hammer
in order to produce a chainsaw.  However I am not sure what you mean by
counting a "instances of the pattern". I have to guess that you want any
filename with a pattern match AND twelve or fifty thousand instances of
that pattern within the contents of the file.


>
> What do people think?
>
> Dale

    I think I want to setup an experiment and test this problem.


--
Dennis Clarke
RISC-V/SPARC/PPC/ARM/CISC
UNIX and Linux spoken




Information forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Wed, 24 Apr 2024 20:28:10 GMT) Full text and rfc822 format available.

Message #11 received at 70540 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Dale R. Worley" <worley <at> alum.mit.edu>
Cc: 70540 <at> debbugs.gnu.org
Subject: Re: bug#70540: grep -c -r | grep -v ':0$'
Date: Wed, 24 Apr 2024 13:27:05 -0700
On 4/23/24 11:32 AM, Dale R. Worley wrote:
> However, it seems "natural" to me that "grep -c -l", that is, "grep
> --count --files-with-matches", should give me this result.

Yes, that sounds reasonable. Is your patch a trivial one (10 lines or 
less)? If so, please send it in. If not, please send in copyright 
paperwork for grep (I can send you the form for that). Thanks.




Information forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Thu, 25 Apr 2024 20:11:07 GMT) Full text and rfc822 format available.

Message #14 received at 70540 <at> debbugs.gnu.org (full text, mbox):

From: "Dale R. Worley" <worley <at> alum.mit.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 70540 <at> debbugs.gnu.org
Subject: Re: bug#70540: grep -c -r | grep -v ':0$'
Date: Thu, 25 Apr 2024 16:10:23 -0400
Paul Eggert <eggert <at> cs.ucla.edu> writes:
> On 4/23/24 11:32 AM, Dale R. Worley wrote:
>> However, it seems "natural" to me that "grep -c -l", that is, "grep
>> --count --files-with-matches", should give me this result.
>
> Yes, that sounds reasonable. Is your patch a trivial one (10 lines or 
> less)? If so, please send it in. If not, please send in copyright 
> paperwork for grep (I can send you the form for that). Thanks.

The functional code is a bit less than 10 lines, but adding in comments
and updates to the tests it's significantly longer.  So send me the
form.

One further thing, I haven't written any updates to the manual page or
.texi.  Does anyone have suggestions for a good way to do that?

Also, the code change does *not* implement --count
--files-without-match.  In a sense, that ought to become defined also,
but the output would be the same as for --files-without-match with each
file name getting ":0" appended, which seems not worth the trouble of
implementing.

Dale




Information forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Thu, 25 Apr 2024 20:31:10 GMT) Full text and rfc822 format available.

Message #17 received at 70540 <at> debbugs.gnu.org (full text, mbox):

From: "Dale R. Worley" <worley <at> alum.mit.edu>
To: Dennis Clarke <dclarke <at> blastwave.org>
Cc: 70540 <at> debbugs.gnu.org
Subject: Re: bug#70540: grep -c -r | grep -v ':0$'
Date: Thu, 25 Apr 2024 16:30:01 -0400
Dennis Clarke via Bug reports for GNU grep <bug-grep <at> gnu.org> writes:
> Perhaps a specific example would be helpful.

My most common case is something like

    grep -c --files-with-match 'some fragment of a command' ~/temp/shell.10??

which is searching the log files of my old shell sessions to find the
most recent session that has "many" uses of "some fragment of a
command".  Often, because I want to copy-modify-paste that command to
use in a current shell session.  Conceptually, the search process is:
find all files that mention the command, sort them by number of uses of
the command, then look at the contents of the one or two files with the
*most* uses (because there often are accidental matches in session logs
that don't "really" use the command).  Given that my desired command is
unlikely to output more than 5 or so lines, this patch makes that
process straightforward.

Another case is

    grep -c --files-with-match variable_name ~/bash-5.5.17

where I want to first look into the files that most often mention
variable_name to see exactly how it is used.

If these weren't ad-hoc activities, I would construct a careful pipeline
like

    grep -c -r pattern directory | sort -t: -k2,2r | head -n3

but for ad-hoc use, it seems to me that it's sensible and convenient to
make the combination -c -l do what it intuitively "ought" to do, given
that that change would be upward-compatible with Posix.

Dale




Information forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Thu, 25 Apr 2024 20:55:06 GMT) Full text and rfc822 format available.

Message #20 received at 70540 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: "Dale R. Worley" <worley <at> alum.mit.edu>
Cc: 70540 <at> debbugs.gnu.org
Subject: Re: bug#70540: grep -c -r | grep -v ':0$'
Date: Thu, 25 Apr 2024 13:54:27 -0700
On 4/25/24 13:10, Dale R. Worley wrote:

> One further thing, I haven't written any updates to the manual page or
> .texi.  Does anyone have suggestions for a good way to do that?

If you have time, just edit those two files and include the edits as 
part of your patch. If not, I can write that part.

> Also, the code change does *not* implement --count
> --files-without-match.

Makes sense to me.





Information forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Thu, 25 Apr 2024 21:50:04 GMT) Full text and rfc822 format available.

Message #23 received at 70540 <at> debbugs.gnu.org (full text, mbox):

From: jackson <at> fastmail.com
To: "Paul Eggert" <eggert <at> cs.ucla.edu>, "Dale R. Worley" <worley <at> alum.mit.edu>
Cc: 70540 <at> debbugs.gnu.org
Subject: Re: bug#70540: grep -c -r | grep -v ':0$'
Date: Thu, 25 Apr 2024 16:48:20 -0500
[Message part 1 (text/plain, inline)]
Dale wrote:
> If these weren't ad-hoc activities, I would construct a careful pipeline
> like
> 
>     grep -c -r pattern directory | sort -t: -k2,2r | head -n3
> 
> but for ad-hoc use, it seems to me that it's sensible and convenient to
> make the combination -c -l do what it intuitively "ought" to do, given
> that that change would be upward-compatible with Posix.

When I have frequently used cases like that, and get tired of just typing it all into a shell prompt over and over, I write a little wrapper command (perhaps in C, shell, Python or some combination) that encapsulates the repetitively useful details.  My personal src and bin directories have hundreds of such commands, some dating as far back as early 1980's when I was at Bell Labs.  (Though this habit can become annoying when I am on someone else's computer, and half the command lines I type at a shell prompt fail "command not found".)

-- 
Paul Jackson
  jackson <at> fastmail.fm
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#70540; Package grep. (Sun, 28 Apr 2024 22:46:02 GMT) Full text and rfc822 format available.

Message #26 received at 70540 <at> debbugs.gnu.org (full text, mbox):

From: "Dale R. Worley" <worley <at> alum.mit.edu>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 70540 <at> debbugs.gnu.org
Subject: Re: bug#70540: grep -c -r | grep -v ':0$'
Date: Sun, 28 Apr 2024 18:44:45 -0400
For everyone's critique, here are my changes (to grep 3.11):

diff -u doc/grep.in.1.orig doc/grep.in.1
--- doc/grep.in.1.orig	2024-04-28 18:04:37.494096472 -0400
+++ doc/grep.in.1	2024-04-28 18:24:15.187984393 -0400
@@ -2,7 +2,7 @@
 .de dT
 .ds Dt \\$2
 ..
-.dT Time-stamp: "2019-12-29"
+.dT Time-stamp: "2024-04-28"
 .\" Update the above date whenever a change to either this file or
 .\" grep.c's 'usage' function results in a nontrivial change to the man page.
 .\" In Emacs, you can update the date by running 'M-x time-stamp'
@@ -292,6 +292,9 @@
 With the
 .BR \-v ", " \-\^\-invert\-match
 option (see above), count non-matching lines.
+With the
+.BR \-l ", " \-\^\-files\-with\-matches
+option (see below), only files with non-zero counts are listed.
 .TP
 .BR \-\^\-color [ =\fIWHEN\fP "], " \-\^\-colour [ =\fIWHEN\fP ]
 Surround the matched (non-empty) strings, matching lines, context lines,
diff -u doc/grep.texi.orig doc/grep.texi
--- doc/grep.texi.orig	2024-04-28 18:04:40.302091195 -0400
+++ doc/grep.texi	2024-04-28 18:17:22.567712036 -0400
@@ -301,6 +301,10 @@
 With the @option{-v} (@option{--invert-match}) option,
 count non-matching lines.
 (@option{-c} is specified by POSIX.)
+With the @option{-l} (@option{--files-with-matches}) option,
+only files with non-zero counts are listed.
+(The combination of @option{-c} and @option{-l} is not specified by
+POSIX.)
 
 @item --color[=@var{WHEN}]
 @itemx --colour[=@var{WHEN}]
diff -u src/grep.c.orig src/grep.c
--- src/grep.c.orig	2023-04-10 20:20:47.000000000 -0400
+++ src/grep.c	2024-04-28 17:57:22.527913936 -0400
@@ -1084,6 +1084,8 @@
 static intmax_t out_before;	/* Lines of leading context. */
 static intmax_t out_after;	/* Lines of trailing context. */
 static bool count_matches;	/* Count matching lines.  */
+static bool count_matches_nonzero; /* Count matching lines; only
+				      report files with matches.  */
 static intmax_t max_count;	/* Max number of selected
                                    lines from an input file.  */
 static bool line_buffered;	/* Use line buffering.  */
@@ -1914,17 +1916,20 @@
   count = grep (desc, &st, &ineof);
   if (count_matches)
     {
-      if (out_file)
-        {
-          print_filename ();
-          if (filename_mask)
-            print_sep (SEP_CHAR_SELECTED);
-          else
-            putchar_errno (0);
-        }
-      printf_errno ("%" PRIdMAX "\n", count);
-      if (line_buffered)
-        fflush_errno ();
+      if (!(count_matches_nonzero && count == 0))
+	{
+	  if (out_file)
+	    {
+	      print_filename ();
+	      if (filename_mask)
+		print_sep (SEP_CHAR_SELECTED);
+	      else
+		putchar_errno (0);
+	    }
+	  printf_errno ("%" PRIdMAX "\n", count);
+	  if (line_buffered)
+	    fflush_errno ();
+	}
     }
 
   status = !count;
@@ -2891,9 +2896,16 @@
     }
 
   /* POSIX says -c, -l and -q are mutually exclusive.  In this
-     implementation, -q overrides -l and -L, which in turn override -c.  */
+     implementation, -q overrides -l and -L.  -L in turn overrides -c,
+     but -l is compatible with -c because this implementation uses
+     that combination to specify listing only non-zero counts.  */
   if (exit_on_match | dev_null_output)
     list_files = LISTFILES_NONE;
+  if (count_matches && list_files == LISTFILES_MATCHING)
+    {
+      count_matches_nonzero = true;
+      list_files = LISTFILES_NONE;
+    }
   if ((exit_on_match | dev_null_output) || list_files != LISTFILES_NONE)
     {
       count_matches = false;
diff -u tests/in-eq-out-infloop.orig tests/in-eq-out-infloop
--- tests/in-eq-out-infloop.orig	2024-04-28 18:32:29.435077799 -0400
+++ tests/in-eq-out-infloop	2024-04-28 17:56:59.254957675 -0400
@@ -29,7 +29,7 @@
   compare err.exp err || fail=1
 
   # But with each of the following options it must not exit-2.
-  for i in -q -m1 -l -L; do
+  for i in -q -m1 -l -L -c; do
     timeout 10 grep $i 0 $arg < out >> out 2> err; st=$?
     test $st = 2 && fail=1
   done
diff -u tests/options.orig tests/options
--- tests/options.orig	2024-04-28 18:33:47.645931404 -0400
+++ tests/options	2024-04-28 18:00:23.701573443 -0400
@@ -12,6 +12,9 @@
 # grep [ -E| -F][ -c| -l| -q ][-insvx][-e pattern_list]
 #      -f pattern_file ... [file ...]
 # grep [ -E| -F][ -c| -l| -q ][-insvx] pattern_list [file...]
+#
+# Also checks that the option combination "-c -l" only reports files
+# with non-zero counts.
 
 . "${srcdir=.}/init.sh"; path_prepend_ ../src
 
@@ -46,4 +49,63 @@
         fail=1
 fi
 
+# check the option combination -c -l
+echo 'This file contains foo.' > options.in.foo
+echo 'This file contains bar.' > options.in.bar
+
+# check without options
+output=$( grep foo options.in.* > /dev/null 2>&1 )
+if test $? -ne 0 ; then
+        echo "Options: Wrong status code, test #5a failed"
+        fail=1
+fi
+if test "$output" -ne "options.in.foo:This file contains foo." ; then
+        echo "Options: Wrong output, test #5a failed: $output"
+        fail=1
+fi
+
+# check with -c
+output=$( grep -c foo options.in.* > /dev/null 2>&1 )
+if test $? -ne 0 ; then
+        echo "Options: Wrong status code, test #5b failed"
+        fail=1
+fi
+if test "$output" -ne "options.in.foo:1 options.in.bar:0" ; then
+        echo "Options: Wrong output, test #5b failed: $output"
+        fail=1
+fi
+
+# check with -l
+output=$( grep -l foo options.in.* > /dev/null 2>&1 )
+if test $? -ne 0 ; then
+        echo "Options: Wrong status code, test #5c failed"
+        fail=1
+fi
+if test "$output" -ne "options.in.foo" ; then
+        echo "Options: Wrong output, test #5c failed: $output"
+        fail=1
+fi
+
+# check with -c -l
+output=$( grep -c -l foo options.in.* > /dev/null 2>&1 )
+if test $? -ne 0 ; then
+        echo "Options: Wrong status code, test #5d failed"
+        fail=1
+fi
+if test "$output" -ne "options.in.foo:1" ; then
+        echo "Options: Wrong output, test #5d failed: $output"
+        fail=1
+fi
+
+# check with -v -c -l
+output=$( grep -v -c -l foo options.in.* > /dev/null 2>&1 )
+if test $? -ne 0 ; then
+        echo "Options: Wrong status code, test #5e failed"
+        fail=1
+fi
+if test "$output" -ne "options.in.bar:1" ; then
+        echo "Options: Wrong output, test #5e failed: $output"
+        fail=1
+fi
+
 Exit $fail




This bug report was last modified 5 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.