GNU bug report logs - #19835
RFC: diff: skip initial columns before comparing

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: diffutils; Reported by: Dave Gordon <david.s.gordon@HIDDEN>; dated Wed, 11 Feb 2015 16:51:01 UTC; Maintainer for diffutils is bug-diffutils@HIDDEN.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 11 Feb 2015 16:50:08 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 11 11:50:08 2015
Received: from localhost ([127.0.0.1]:40294 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1YLaUI-00024P-Je
	for submit <at> debbugs.gnu.org; Wed, 11 Feb 2015 11:50:08 -0500
Received: from eggs.gnu.org ([208.118.235.92]:57927)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <david.s.gordon@HIDDEN>) id 1YLYME-00075g-9r
 for submit <at> debbugs.gnu.org; Wed, 11 Feb 2015 09:33:40 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <david.s.gordon@HIDDEN>) id 1YLYM7-0002YZ-Jx
 for submit <at> debbugs.gnu.org; Wed, 11 Feb 2015 09:33:33 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:36182)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <david.s.gordon@HIDDEN>) id 1YLYM7-0002YH-0l
 for submit <at> debbugs.gnu.org; Wed, 11 Feb 2015 09:33:31 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:59644)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <david.s.gordon@HIDDEN>) id 1YLYM5-0002OK-KT
 for bug-diffutils@HIDDEN; Wed, 11 Feb 2015 09:33:30 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <david.s.gordon@HIDDEN>) id 1YLYM2-0002XL-DB
 for bug-diffutils@HIDDEN; Wed, 11 Feb 2015 09:33:29 -0500
Received: from mga11.intel.com ([192.55.52.93]:40945)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <david.s.gordon@HIDDEN>) id 1YLYM2-0002Wu-69
 for bug-diffutils@HIDDEN; Wed, 11 Feb 2015 09:33:26 -0500
Received: from fmsmga003.fm.intel.com ([10.253.24.29])
 by fmsmga102.fm.intel.com with ESMTP; 11 Feb 2015 06:33:22 -0800
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.09,558,1418112000"; d="scan'208";a="453142417"
Received: from dsgordon-linux.isw.intel.com (HELO [10.102.226.149])
 ([10.102.226.149])
 by FMSMGA003.fm.intel.com with ESMTP; 11 Feb 2015 06:18:44 -0800
Message-ID: <54DB6831.7090307@HIDDEN>
Date: Wed, 11 Feb 2015 14:33:21 +0000
From: Dave Gordon <david.s.gordon@HIDDEN>
Organization: Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way,
 Swindon SN3 1RJ
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:31.0) Gecko/20100101 Thunderbird/31.4.0
MIME-Version: 1.0
To: bug-diffutils@HIDDEN
Subject: RFC: diff: skip initial columns before comparing
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
 recognized.
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Wed, 11 Feb 2015 11:50:05 -0500
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

When comparing certain types of files, notably timestamped logfiles
such as the output of dmesg(1), it's necessary to ignore the initial
characters on each line, otherwise every line is different. In the
simplest case, this can be done by applying 'cut(1)' to each input;
but then, important information about when the difference(s) occurred is
lost, and it can be difficult to find the relevant lines in the original
files, especially if they are highly repetitive (as logfiles often are).
When is needed in this situation is to ignore the timestamps for
purposes of comparison, but then include them in any lines copied to
the output.

So this patch adds a new option (long form only) "--ignore-initial=N" to
ignore the first N characters of each line. This is done by skipping the
first N characters of each line in find_and_hash_each_line(), and
likewise lines_differ(). The hashing or comparison of the remaining part
of the line then proceeds as usual.

One subtle point: if both of the lines have less than N characters, the
lines are considered equal iff they have the same length. Usually, the
type of file you would use this option with will have a fixed-format
prefix (which is the part to be ignored), and a line missing this prefix
is generally an indication of a formatting error. So a line with the
prefix but no further content should NOT match an empty line or a line
with a truncated prefix; but we still want two empty lines to match each
other.

For example, with --ignore-initial=10:

These lines match:
[22:47:25] hello
[23:17:24] hello

These lines don't match:
[22:47:25] hello
[23:17:24]

Nor do these:
[22:47:]
[23:17:24]

But these do:
[NOCLOCK]
[CLKFAIL]

Hope this looks useful!
.Dave.

-----------------------
diff --git a/src/diff.c b/src/diff.c
index 50d0365..eccce21 100644
--- a/src/diff.c
+++ b/src/diff.c
@@ -121,6 +121,7 @@ enum
   NO_IGNORE_FILE_NAME_CASE_OPTION,
   NORMAL_OPTION,
   SDIFF_MERGE_ASSIST_OPTION,
+  SKIP_INITIAL_OPTION,
   STRIP_TRAILING_CR_OPTION,
   SUPPRESS_BLANK_EMPTY_OPTION,
   SUPPRESS_COMMON_LINES_OPTION,
@@ -173,6 +174,7 @@ static struct option const longopts[] =
   {"ignore-blank-lines", 0, 0, 'B'},
   {"ignore-case", 0, 0, 'i'},
   {"ignore-file-name-case", 0, 0, IGNORE_FILE_NAME_CASE_OPTION},
+  {"ignore-initial", 1, 0, SKIP_INITIAL_OPTION},
   {"ignore-matching-lines", 1, 0, 'I'},
   {"ignore-space-change", 0, 0, 'b'},
   {"ignore-tab-expansion", 0, 0, 'E'},
@@ -580,6 +582,18 @@ main (int argc, char **argv)
          sdiff_merge_assist = true;
          break;

+       case SKIP_INITIAL_OPTION:
+         numval = strtoumax (optarg, &numend, 10);
+         if (! (0 < numval && numval <= SIZE_MAX) || *numend)
+           try_help ("invalid initial skip '%s'", optarg);
+         if (initial_skip != numval)
+           {
+             if (initial_skip)
+               fatal ("conflicting initial skip options");
+             initial_skip = numval;
+           }
+         break;
+
        case STRIP_TRAILING_CR_OPTION:
          strip_trailing_cr = true;
          break;
@@ -724,7 +738,8 @@ main (int argc, char **argv)
   files_can_be_treated_as_binary =
     (brief & binary
      & ~ (ignore_blank_lines | ignore_case | strip_trailing_cr
-         | (ignore_regexp_list.regexps || ignore_white_space)));
+         | (ignore_regexp_list.regexps || ignore_white_space
+                 || initial_skip)));

   switch_string = option_list (argv + 1, optind - 1);

@@ -895,6 +910,7 @@ static char const * const option_help_msgid[] = {
   N_("-w, --ignore-all-space          ignore all white space"),
   N_("-B, --ignore-blank-lines        ignore changes where lines are
all blank"),
   N_("-I, --ignore-matching-lines=RE  ignore changes where all lines
match RE"),
+  N_("    --ignore-initial=SKIP       ignore the initial SKIP
characters of each line"),
   "",
   N_("-a, --text                      treat all files as text"),
   N_("    --strip-trailing-cr         strip trailing carriage return on
input"),
diff --git a/src/diff.h b/src/diff.h
index e9f0471..b638a3f 100644
--- a/src/diff.h
+++ b/src/diff.h
@@ -125,6 +125,9 @@ XTERN enum DIFF_white_space ignore_white_space;
 /* Ignore changes that affect only blank lines (-B).  */

+/* Skip this many initial characters on each line */
+XTERN size_t initial_skip;
+
 /* Files can be compared byte-by-byte, as if they were binary.
    This depends on various options.  */
 XTERN bool files_can_be_treated_as_binary;
diff --git a/src/io.c b/src/io.c
index 463ee35..7e15996 100644
--- a/src/io.c
+++ b/src/io.c
@@ -232,13 +232,18 @@ find_and_hash_each_line (struct file_data *current)
   bool diff_length_compare_anyway =
     ig_white_space != IGNORE_NO_WHITE_SPACE;
   bool same_length_diff_contents_compare_anyway =
-    diff_length_compare_anyway | ig_case;
+    diff_length_compare_anyway | ig_case || initial_skip != 0;

   while (p < suffix_begin)
     {
       char const *ip = p;
       hash_value h = 0;
       unsigned char c;
+      size_t skip = initial_skip;
+
+      while (skip--)
+       if ((c = *p++) == '\n')
+         goto hashing_done;

       /* Hash this line until we find a newline.  */
       switch (ig_white_space)
diff --git a/src/util.c b/src/util.c
index 016057d..0acba06 100644
--- a/src/util.c
+++ b/src/util.c
@@ -413,6 +413,16 @@ lines_differ (char const *s1, char const *s2)
   register char const *t1 = s1;
   register char const *t2 = s2;
   size_t column = 0;
+  size_t skip = initial_skip;
+
+  while (skip--)
+    {
+      register unsigned char c1 = *t1++;
+      register unsigned char c2 = *t2++;
+
+      if (c1 == '\n' || c2 == '\n')
+       return c1 != c2;
+    }

   while (1)
     {





Acknowledgement sent to Dave Gordon <david.s.gordon@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-diffutils@HIDDEN. Full text available.
Report forwarded to bug-diffutils@HIDDEN:
bug#19835; Package diffutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.