GNU bug report logs - #6903
join: support numeric keys

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: coreutils; Severity: wishlist; Reported by: Bernhard Schiffner <bernhard@HIDDEN>; dated Tue, 24 Aug 2010 19:57:01 UTC; Maintainer for coreutils is bug-coreutils@HIDDEN.
Changed bug title to 'join: support numeric keys' from 'join: improve paralleles to sort?' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 6903 <at> debbugs.gnu.org:


Received: (at 6903) by debbugs.gnu.org; 26 Aug 2010 19:07:18 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Aug 26 15:07:18 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1OohnC-0005jY-BI
	for submit <at> debbugs.gnu.org; Thu, 26 Aug 2010 15:07:18 -0400
Received: from moutng.kundenserver.de ([212.227.17.9])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <bernhard@HIDDEN>) id 1Oohn9-0005jT-BJ
	for 6903 <at> debbugs.gnu.org; Thu, 26 Aug 2010 15:07:17 -0400
Received: from bs7.localnet (dialin-212-144-019-040.pools.arcor-ip.net
	[212.144.19.40])
	by mrelayeu.kundenserver.de (node=mreu1) with ESMTP (Nemesis)
	id 0Mgrky-1OSl6o0Afw-00MaIT; Thu, 26 Aug 2010 21:08:36 +0200
From: Bernhard Schiffner <bernhard@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#6903: join: improve paralleles to sort?
Date: Thu, 26 Aug 2010 21:08:29 +0200
User-Agent: KMail/1.13.5 (Linux/2.6.33.4-0.1-desktop; KDE/4.4.4; i686; ; )
References: <201008242139.21283.bernhard@HIDDEN>
	<201008250857.22805.bernhard@HIDDEN>
	<4C754335.8030700@HIDDEN>
In-Reply-To: <4C754335.8030700@HIDDEN>
MIME-Version: 1.0
Content-Type: Multipart/Mixed;
  boundary="Boundary-00=_uurdMV41pQL7euK"
Message-Id: <201008262108.30695.bernhard@HIDDEN>
X-Provags-ID: V02:K0:v7+P/bsSIyh81N3/dlgkaB+73dx0dz9pIcRl6ZNJRka
	FN9B2R+W6NBDy+2az6Yhq7auDXemoMw12WOWSR1t3Bi1kxLtYi
	9a9CKHh034CnzapINjC3hYyIh1WlTczz0NzVlUBdEJFW120HLt
	g3JYD4v7MADpqUz0yo/SnrBHhr7GGdpZF5pHyNiKm3WsI4+E0r
	YdQH3zxfdHG3Wa95dpN/g==
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 6903
Cc: 6903 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -1.3 (-)

--Boundary-00=_uurdMV41pQL7euK
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit

Am Mittwoch, 25. August 2010, 18:22:13 schrieb Paul Eggert:
> On 08/24/2010 11:57 PM, Bernhard Schiffner wrote:
> > 2146427	/LBAtoJM/ROOT/WEB-INF/lib/hibernate-3.2.0.cr3.jar
> > 214618118	/temp/marketing_ms/emails.dat
> 
> That won't work, because the two lines are not sorted correctly.
> Recall that join uses lexicographic comparison, not numeric.
> Its input must be sorted lexicographically.

Ok.
I solved my problem using the attached patch.

The patch shows that it is possible to use different sortings for keys 
(joinfield) in join.

I integrated some / most of the code from sort.c verbaly  in order to see 
what's needed to compile it successfully in join.c .
I did no tests beside my special usecase mentioned earlier.

It's clear that a user-friendly key-selection needs a lot more work. Same is 
about a unified version of join and sort.

Thanks to Paul and Christian Perle for their valueable help so far.

The FSF can make any use of the code here. 
It was theirs already before  ;-)


Bernhard



--Boundary-00=_uurdMV41pQL7euK
Content-Type: text/x-patch;
  charset="UTF-8";
  name="join_proposal_2.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
	filename="join_proposal_2.diff"

diff --git a/src/join.c b/src/join.c
index fa18c9d..b02dc08 100644
--- a/src/join.c
+++ b/src/join.c
@@ -47,6 +47,308 @@
   b = tmp; \
 } while (0);
 
+/*
+ * The code here is an as verbal copy from sort.c. as possible.
+ * It's here to test the unification of key handling between sort and join.
+ *
+ * Test it in join.c first and decide about move into a shared library
+ * later on.
+ */
+
+/* TODO : check includes carefully */
+#include "strnumcmp.h"
+/* #include "hard-locale.h" */
+#include "langinfo.h"
+#include "strnumcmp.h"
+/* #include "unistr.h" */
+
+#define UCHAR_LIM (UCHAR_MAX + 1)
+
+#if HAVE_C99_STRTOLD
+# define long_double long double
+#else
+# define long_double double
+# undef strtold
+# define strtold strtod
+#endif
+
+/* The representation of the decimal point in the current locale.  */
+static int decimal_point;
+
+/* Thousands separator; if -1, then there isn't one.  */
+static int thousands_sep;
+
+/* Nonzero if the corresponding locales are hard.  */
+static bool hard_LC_COLLATE;
+#if HAVE_NL_LANGINFO
+static bool hard_LC_TIME;
+#endif
+
+/* The kind of blanks for '-b' to skip in various options. */
+enum blanktype { bl_start, bl_end, bl_both };
+
+/* Table of blanks.  */
+static bool blanks[UCHAR_LIM];
+
+/* Table of non-printing characters. */
+static bool nonprinting[UCHAR_LIM];
+
+/* Table of non-dictionary characters (not letters, digits, or blanks). */
+static bool nondictionary[UCHAR_LIM];
+
+/* Translation table folding lower case to upper.  */
+static unsigned char fold_toupper[UCHAR_LIM];
+
+#define MONTHS_PER_YEAR 12
+
+struct month
+{
+  char const *name;
+  int val;
+};
+
+/* Table mapping month names to integers.
+   Alphabetic order allows binary search. */
+static struct month monthtab[] =
+{
+  {"APR", 4},
+  {"AUG", 8},
+  {"DEC", 12},
+  {"FEB", 2},
+  {"JAN", 1},
+  {"JUL", 7},
+  {"JUN", 6},
+  {"MAR", 3},
+  {"MAY", 5},
+  {"NOV", 11},
+  {"OCT", 10},
+  {"SEP", 9}
+};
+
+#if HAVE_NL_LANGINFO
+
+static int
+struct_month_cmp (void const *m1, void const *m2)
+{
+  struct month const *month1 = m1;
+  struct month const *month2 = m2;
+  return strcmp (month1->name, month2->name);
+}
+
+#endif
+
+/* Initialize the character class tables. */
+
+static void
+inittables (void)
+{
+  size_t i;
+
+  for (i = 0; i < UCHAR_LIM; ++i)
+    {
+      blanks[i] = !! isblank (i);
+      nonprinting[i] = ! isprint (i);
+      nondictionary[i] = ! isalnum (i) && ! isblank (i);
+      fold_toupper[i] = toupper (i);
+    }
+
+#if HAVE_NL_LANGINFO
+  /* If we're not in the "C" locale, read different names for months.  */
+  if (hard_LC_TIME)
+    {
+      for (i = 0; i < MONTHS_PER_YEAR; i++)
+        {
+          char const *s;
+          size_t s_len;
+          size_t j, k;
+          char *name;
+
+          s = nl_langinfo (ABMON_1 + i);
+          s_len = strlen (s);
+          monthtab[i].name = name = xmalloc (s_len + 1);
+          monthtab[i].val = i + 1;
+
+          for (j = k = 0; j < s_len; j++)
+            if (! isblank (to_uchar (s[j])))
+              name[k++] = fold_toupper[to_uchar (s[j])];
+          name[k] = '\0';
+        }
+      qsort (monthtab, MONTHS_PER_YEAR, sizeof *monthtab, struct_month_cmp);
+    }
+#endif
+}
+
+/* Table that maps characters to order-of-magnitude values.  */
+static char const unit_order[UCHAR_LIM] =
+  {
+#if ! ('K' == 75 && 'M' == 77 && 'G' == 71 && 'T' == 84 && 'P' == 80 \
+     && 'E' == 69 && 'Z' == 90 && 'Y' == 89 && 'k' == 107)
+    /* This initializer syntax works on all C99 hosts.  For now, use
+       it only on non-ASCII hosts, to ease the pain of porting to
+       pre-C99 ASCII hosts.  */
+    ['K']=1, ['M']=2, ['G']=3, ['T']=4, ['P']=5, ['E']=6, ['Z']=7, ['Y']=8,
+    ['k']=1,
+#else
+    /* Generate the following table with this command:
+       perl -e 'my %a=(k=>1, K=>1, M=>2, G=>3, T=>4, P=>5, E=>6, Z=>7, Y=>8);
+       foreach my $i (0..255) {my $c=chr($i); $a{$c} ||= 0;print "$a{$c}, "}'\
+       |fmt  */
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 3,
+    0, 0, 0, 1, 0, 2, 0, 0, 5, 0, 0, 0, 4, 0, 0, 0, 0, 8, 7, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+#endif
+  };
+
+/* Return an integer that represents the order of magnitude of the
+   unit following the number.  The number may contain thousands
+   separators and a decimal point, but it may not contain leading blanks.
+   Negative numbers get negative orders; zero numbers have a zero order.  */
+
+static int
+find_unit_order (char const *number)
+{
+  bool minus_sign = (*number == '-');
+  char const *p = number + minus_sign;
+  int nonzero = 0;
+  unsigned char ch;
+
+  /* Scan to end of number.
+     Decimals or separators not followed by digits stop the scan.
+     Numbers ending in decimals or separators are thus considered
+     to be lacking in units.
+     FIXME: add support for multibyte thousands_sep and decimal_point.  */
+
+  do
+    {
+      while (ISDIGIT (ch = *p++))
+        nonzero |= ch - '0';
+    }
+  while (ch == thousands_sep);
+
+  if (ch == decimal_point)
+    while (ISDIGIT (ch = *p++))
+      nonzero |= ch - '0';
+
+  if (nonzero)
+    {
+      int order = unit_order[ch];
+      return (minus_sign ? -order : order);
+    }
+  else
+    return 0;
+}
+
+/* Compare numbers A and B ending in units with SI or IEC prefixes
+       <none/unknown> < K/k < M < G < T < P < E < Z < Y  */
+
+static int
+human_numcompare (char const *a, char const *b)
+{
+  while (blanks[to_uchar (*a)])
+    a++;
+  while (blanks[to_uchar (*b)])
+    b++;
+
+  int diff = find_unit_order (a) - find_unit_order (b);
+  return (diff ? diff : strnumcmp (a, b, decimal_point, thousands_sep));
+}
+
+/* Compare strings A and B as numbers without explicitly converting them to
+   machine numbers.  Comparatively slow for short strings, but asymptotically
+   hideously fast. */
+
+static int
+numcompare (char const *a, char const *b)
+{
+  while (blanks[to_uchar (*a)])
+    a++;
+  while (blanks[to_uchar (*b)])
+    b++;
+
+  return strnumcmp (a, b, decimal_point, thousands_sep);
+}
+
+static int
+general_numcompare (char const *sa, char const *sb)
+{
+  /* FIXME: maybe add option to try expensive FP conversion
+     only if A and B can't be compared more cheaply/accurately.  */
+
+  char *ea;
+  char *eb;
+  long_double a = strtold (sa, &ea);
+  long_double b = strtold (sb, &eb);
+
+  /* Put conversion errors at the start of the collating sequence.  */
+  if (sa == ea)
+    return sb == eb ? 0 : -1;
+  if (sb == eb)
+    return 1;
+
+  /* Sort numbers in the usual way, where -0 == +0.  Put NaNs after
+     conversion errors but before numbers; sort them by internal
+     bit-pattern, for lack of a more portable alternative.  */
+  return (a < b ? -1
+          : a > b ? 1
+          : a == b ? 0
+          : b == b ? -1
+          : a == a ? 1
+          : memcmp (&a, &b, sizeof a));
+}
+
+/* Return an integer in 1..12 of the month name MONTH.
+   Return 0 if the name in S is not recognized.  */
+
+static int
+getmonth (char const *month, char **ea)
+{
+  size_t lo = 0;
+  size_t hi = MONTHS_PER_YEAR;
+
+  while (blanks[to_uchar (*month)])
+    month++;
+
+  do
+    {
+      size_t ix = (lo + hi) / 2;
+      char const *m = month;
+      char const *n = monthtab[ix].name;
+
+      for (;; m++, n++)
+        {
+          if (!*n)
+            {
+              if (ea)
+                *ea = (char *) m;
+              return monthtab[ix].val;
+            }
+          if (fold_toupper[to_uchar (*m)] < to_uchar (*n))
+            {
+              hi = ix;
+              break;
+            }
+          else if (fold_toupper[to_uchar (*m)] > to_uchar (*n))
+            {
+              lo = ix + 1;
+              break;
+            }
+        }
+    }
+  while (lo < hi);
+
+  return 0;
+}
+
+/* Import from copy.c ends here */
+
 /* An element of the list identifying which fields to print for each
    output line.  */
 struct outlist
@@ -201,6 +503,24 @@ by whitespace.  When FILE1 or FILE2 (not both) is -, read standard input.\n\
   --header          treat the first line in each file as field headers,\n\
                       print them without trying to pair them\n\
 "), stdout);
+      fputs (_("\
+  -b, --ignore-leading-blanks  ignore leading blanks\n\
+  -d, --dictionary-order      consider only blanks and alphanumeric characters\n\
+  -f, --ignore-case           fold lower case to upper case characters\n\
+"), stdout);
+      fputs (_("\
+  -g, --general-numeric-sort  compare according to general numerical value\n\
+  -h, --human-numeric-sort    compare human readable numbers (e.g., 2K 1G)\n\
+  -n, --numeric-sort          compare according to string numerical value\n\
+  -M, --month-sort            compare (unknown) < `JAN' < ... < `DEC'\n\
+  -V, --version-sort          natural sort of (version) numbers within text\n\
+"), stdout);
+      fputs (_("\
+      --sort=WORD             sort according to WORD:\n\
+                                general-numeric -g, human-numeric -h, month -M,\n\
+                                numeric -n, version -V\n\
+\n\
+"), stdout);
       fputs (HELP_OPTION_DESCRIPTION, stdout);
       fputs (VERSION_OPTION_DESCRIPTION, stdout);
       fputs (_("\
@@ -329,6 +649,16 @@ keycmp (struct line const *line1, struct line const *line2,
       len2 = 0;
     }
 
+/* The following ifdef'ed part allowed to solve the problem
+ * http://debbugs.gnu.org/cgi/bugreport.cgi?bug=6903
+ * describes.
+ * A real solution will need lots of inprovements here.
+ */
+
+#undef KEY_NUMCMP
+#ifdef KEY_NUMCMP
+  diff = numcompare (beg1, beg2);
+#else
   if (len1 == 0)
     return len2 == 0 ? 0 : -1;
   if (len2 == 0)
@@ -343,13 +673,15 @@ keycmp (struct line const *line1, struct line const *line2,
   else
     {
       if (hard_LC_COLLATE)
-        return xmemcoll (beg1, len1, beg2, len2);
-      diff = memcmp (beg1, beg2, MIN (len1, len2));
+        diff = xmemcoll (beg1, len1, beg2, len2);
+	  else
+        diff = memcmp (beg1, beg2, MIN (len1, len2));
     }
 
-  if (diff)
-    return diff;
-  return len1 < len2 ? -1 : len1 != len2;
+  if (! diff)
+    diff = len1 < len2 ? -1 : len1 != len2;
+#endif
+return diff;
 }
 
 /* Check that successive input lines PREV and CURRENT from input file
@@ -975,6 +1307,8 @@ main (int argc, char **argv)
   issued_disorder_warning[0] = issued_disorder_warning[1] = false;
   check_input_order = CHECK_ORDER_DEFAULT;
 
+  inittables();
+
   while ((optc = getopt_long (argc, argv, "-a:e:i1:2:j:o:t:v:",
                               longopts, NULL))
          != -1)

--Boundary-00=_uurdMV41pQL7euK--




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#6903; Package coreutils. Full text available.

Message received at 6903 <at> debbugs.gnu.org:


Received: (at 6903) by debbugs.gnu.org; 25 Aug 2010 16:21:10 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Aug 25 12:21:10 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1OoIir-0000mV-Jf
	for submit <at> debbugs.gnu.org; Wed, 25 Aug 2010 12:21:10 -0400
Received: from vms173005pub.verizon.net ([206.46.173.5])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <eggert@HIDDEN>) id 1OoIip-0000mC-HQ
	for 6903 <at> debbugs.gnu.org; Wed, 25 Aug 2010 12:21:08 -0400
Received: from [192.168.1.10] ([unknown] [71.189.109.235])
	by vms173005.mailsrvcs.net
	(Sun Java(tm) System Messaging Server 7u2-7.02 32bit (built Apr 16
	2009)) with ESMTPA id <0L7P00KN5UT18X63@HIDDEN> for
	6903 <at> debbugs.gnu.org; Wed, 25 Aug 2010 11:22:19 -0500 (CDT)
Message-id: <4C754335.8030700@HIDDEN>
Date: Wed, 25 Aug 2010 09:22:13 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.11) Gecko/20100713
	Thunderbird/3.0.6
MIME-version: 1.0
To: Bernhard Schiffner <bernhard@HIDDEN>
Subject: Re: bug#6903: join: improve paralleles to sort?
References: <201008242139.21283.bernhard@HIDDEN>
	<4C74386B.30004@HIDDEN>
	<201008250857.22805.bernhard@HIDDEN>
In-reply-to: <201008250857.22805.bernhard@HIDDEN>
Content-type: text/plain; charset=UTF-8
Content-transfer-encoding: 7bit
X-Spam-Score: -2.2 (--)
X-Debbugs-Envelope-To: 6903
Cc: 6903 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -2.1 (--)

On 08/24/2010 11:57 PM, Bernhard Schiffner wrote:
> 2146427	/LBAtoJM/ROOT/WEB-INF/lib/hibernate-3.2.0.cr3.jar
> 214618118	/temp/marketing_ms/emails.dat

That won't work, because the two lines are not sorted correctly.
Recall that join uses lexicographic comparison, not numeric.
Its input must be sorted lexicographically.

You can sort its output numerically later, if you prefer numeric
order.




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#6903; Package coreutils. Full text available.

Message received at 6903 <at> debbugs.gnu.org:


Received: (at 6903) by debbugs.gnu.org; 25 Aug 2010 06:55:58 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Aug 25 02:55:58 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Oo9tu-0004hB-GR
	for submit <at> debbugs.gnu.org; Wed, 25 Aug 2010 02:55:58 -0400
Received: from moutng.kundenserver.de ([212.227.17.8])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <bernhard@HIDDEN>) id 1Oo9ts-0004h6-An
	for 6903 <at> debbugs.gnu.org; Wed, 25 Aug 2010 02:55:57 -0400
Received: from bs7.localnet (smtp.transinsight.com [87.139.58.41])
	by mrelayeu.kundenserver.de (node=mreu0) with ESMTP (Nemesis)
	id 0Mefpk-1OUP362bzE-00OJXD; Wed, 25 Aug 2010 08:57:17 +0200
From: Bernhard Schiffner <bernhard@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#6903: join: improve paralleles to sort?
Date: Wed, 25 Aug 2010 08:57:21 +0200
User-Agent: KMail/1.13.5 (Linux/2.6.33.4-0.1-desktop; KDE/4.4.4; i686; ; )
References: <201008242139.21283.bernhard@HIDDEN>
	<4C74386B.30004@HIDDEN>
In-Reply-To: <4C74386B.30004@HIDDEN>
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="utf-8"
Content-Transfer-Encoding: 7bit
Message-Id: <201008250857.22805.bernhard@HIDDEN>
X-Provags-ID: V02:K0:KwGgrh/ZPJu+HNGsA2jO0CjKYv8SG0JrBIjGK95iJih
	qUuNukhe7Z1llio7gAEdJcD5XEj98VxkxmSaVRoFt4ADQYYb2g
	VNS7BydVQN44x86pIFEpTbWKtLjhEnTV6A0ZecPRrwn9YbfY3G
	7J79jdT7kHeF3qDq9vbzc8mUvLu/XTbQrC/7f0trdKr99TtXz4
	hIzvi4TV3ILQgHS79QkpA==
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: 6903
Cc: 6903 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -3.5 (---)

Am Dienstag, 24. August 2010, 23:23:55 schrieb Paul Eggert:
> On 08/24/2010 12:39 PM, Bernhard Schiffner wrote:
> > Because join uses strtoul() before doing comparisation it is
> > understandable. ("unpairable" is the result.)
> 
> No, join doesn't use strtoul. 
I was wrong (It is the number of the field to join.)

> It compares the numbers as strings.
> So if you use plain "sort" on the numbers, join will work, unless the
> numbers are numerically equal but textually different (e.g., 0 versus -0).
Not a problem for me.
> You can then sort the output of join with "sort -n", if you wish.

A small testcase is included here.
Do
join a  b
and try to understand, why the lines with
214618118	/temp/marketing_ms/emails.dat
214618118	/temp/bs/marketing_ms/emails.dat
are not in the result.
Do you see any reason?

Perhaps I'am missusing join here a litte bit, but until now I don't 
understand, why it should be wrong.
Before I'am going to blame someone else, I'll try to dig a little bit deeper 
too.

TIA!

Bernhard

File a:
21460	/ElsevierDocuments/EWX0886A/09218181/00220001/99000417/main.raw
21460	/ElsevierDocuments/EWX0889A/00319201/01200001/00001461/main.raw
21464	/apache/xerces/dom/DeferredAttrNSImpl.html
21466	/spam/1206882672_000701c89267_03453ee8_21fcd5a0@jlsvsf
21467	/MINING/MIN0002A/03605442/00230009/98000218/main.raw
21468	/___MRA/___sophos_autoupdate1.dir/1207625107/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1208238697/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1208834890/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209153877/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209404409/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209710971/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1209737271/encloa-b.ide
21468	/___MRA/___sophos_autoupdate1.dir/1214978929/encloa-b.ide
21469	/ElsevierDocuments/EWX0886A/09218181/00370003/02001996/main.raw
21469	/ElsevierDocuments/EWX0890A/00335894/00660002/06000846/main.xml
21469	/ElsevierDocuments/MINING/MIN0001A/01968904/00420007/00000911/main.raw
214602	/ElsevierDocuments/EWX0876A/00370738/01710001/04002477/main.xml
214604	/ElsevierDocuments/EWX0881A/00128252/00700001/04001333/main.xml
214614	/ElsevierDocuments/EWX0887A/02773791/00240020/05000223/main.xml
214666	/ElsevierDocuments/EWX0886A/09218181/00600003/07000240/main.xml
214682	/ElsevierDocuments/EWX0879A/0012821X/02430003/06000367/main.xml
2146369	/marketing/diffferent_Berichtsband_Online_Crossmedia_Kampagnen.pdf
2146427	/LBAtoJM/ROOT/WEB-INF/lib/hibernate-3.2.0.cr3.jar
214618118	/temp/marketing_ms/emails.dat
214618118	/temp/bs/marketing_ms/emails.dat
214618120	/temp/marketing_js/emails.dat

File b:
21460
21468
21469
214618118
215777777





Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#6903; Package coreutils. Full text available.

Message received at 6903 <at> debbugs.gnu.org:


Received: (at 6903) by debbugs.gnu.org; 24 Aug 2010 21:22:55 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Aug 24 17:22:55 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Oo0xL-0000tc-K2
	for submit <at> debbugs.gnu.org; Tue, 24 Aug 2010 17:22:55 -0400
Received: from vms173001pub.verizon.net ([206.46.173.1])
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <eggert@HIDDEN>) id 1Oo0xJ-0000tX-Mi
	for 6903 <at> debbugs.gnu.org; Tue, 24 Aug 2010 17:22:54 -0400
Received: from [192.168.1.10] ([unknown] [71.189.109.235])
	by vms173001.mailsrvcs.net
	(Sun Java(tm) System Messaging Server 7u2-7.02 32bit (built Apr 16
	2009)) with ESMTPA id <0L7O00L1JE3VSOA4@HIDDEN> for
	6903 <at> debbugs.gnu.org; Tue, 24 Aug 2010 16:23:58 -0500 (CDT)
Message-id: <4C74386B.30004@HIDDEN>
Date: Tue, 24 Aug 2010 14:23:55 -0700
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.11) Gecko/20100713
	Thunderbird/3.0.6
MIME-version: 1.0
To: Bernhard Schiffner <bernhard@HIDDEN>
Subject: Re: bug#6903: join: improve paralleles to sort?
References: <201008242139.21283.bernhard@HIDDEN>
In-reply-to: <201008242139.21283.bernhard@HIDDEN>
Content-type: text/plain; charset=UTF-8
Content-transfer-encoding: 7bit
X-Spam-Score: -2.0 (--)
X-Debbugs-Envelope-To: 6903
Cc: 6903 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -2.0 (--)

On 08/24/2010 12:39 PM, Bernhard Schiffner wrote:
> Because join uses strtoul() before doing comparisation it is understandable. 
> ("unpairable" is the result.)

No, join doesn't use strtoul.  It compares the numbers as strings.
So if you use plain "sort" on the numbers, join will work, unless the
numbers are numerically equal but textually different (e.g., 0 versus -0).
You can then sort the output of join with "sort -n", if you wish.

> Do you see a chance to extend join with a -n parameter for numeric 
> comparisation as sort has already?

That would be a nice thing to add, if someone had the time to do it.
Generally speaking, any comparison that "sort" can do, "join" should
do too (except for random comparison I suppose).

The comparison code between sort and join should be shared, of course.
Can you write that?




Information forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#6903; Package coreutils. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 24 Aug 2010 19:56:41 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Aug 24 15:56:41 2010
Received: from localhost ([127.0.0.1] helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1Onzbs-0006jw-1w
	for submit <at> debbugs.gnu.org; Tue, 24 Aug 2010 15:56:40 -0400
Received: from mail.gnu.org ([199.232.76.166] helo=mx10.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <bernhard@HIDDEN>) id 1OnzW3-0006h9-M4
	for submit <at> debbugs.gnu.org; Tue, 24 Aug 2010 15:50:40 -0400
Received: from lists.gnu.org ([199.232.76.165]:36077)
	by monty-python.gnu.org with esmtps
	(TLS-1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.60)
	(envelope-from <bernhard@HIDDEN>) id 1OnzXL-0005zO-7E
	for submit <at> debbugs.gnu.org; Tue, 24 Aug 2010 15:51:59 -0400
Received: from [140.186.70.92] (port=47285 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1OnzXH-0004Mt-6D
	for bug-coreutils@HIDDEN; Tue, 24 Aug 2010 15:51:58 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE
	autolearn=unavailable version=3.3.1
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <bernhard@HIDDEN>) id 1OnzL5-0005h4-M7
	for bug-coreutils@HIDDEN; Tue, 24 Aug 2010 15:39:20 -0400
Received: from moutng.kundenserver.de ([212.227.126.187]:52638)
	by eggs.gnu.org with esmtp (Exim 4.69)
	(envelope-from <bernhard@HIDDEN>) id 1OnzL5-0005gU-Ap
	for bug-coreutils@HIDDEN; Tue, 24 Aug 2010 15:39:19 -0400
Received: from bs7.localnet (smtp.transinsight.com [87.139.58.41])
	by mrelayeu.kundenserver.de (node=mreu2) with ESMTP (Nemesis)
	id 0MXkkl-1OIzeU0xV9-00WH0W; Tue, 24 Aug 2010 21:39:17 +0200
To: bug-coreutils@HIDDEN
Subject: join: improve paralleles to sort?
From: Bernhard Schiffner <bernhard@HIDDEN>
Date: Tue, 24 Aug 2010 21:39:20 +0200
MIME-Version: 1.0
Content-Type: Text/Plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Message-Id: <201008242139.21283.bernhard@HIDDEN>
X-Provags-ID: V02:K0:TEkFMDhVgPovFA8UTHBJzVlb77uOgOO7GkSCjJFgmsV
	pnhJwGRF0SjIbqTv8Rkdw+V7ygEKc3f/dvI8WZ7L0TIEczOjH+
	iiVzR8woqYKWW7u28sSLxtVEabANnw7Vc+xhwUXGbWX+IrTwz2
	n2jtm2nHRX1oEJAspuUaAGbakek3gSx3AuD6awAHk8SCsbBZ/Q
	tqkXCWwlDxkfxIdZ4/duA==
X-detected-operating-system: by eggs.gnu.org: Genre and OS details not
	recognized.
X-detected-operating-system: by monty-python.gnu.org: GNU/Linux 2.6,
	seldom 2.4 (older, 4)
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Tue, 24 Aug 2010 15:56:38 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.11
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/pipermail/debbugs-submit>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
	<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Sender: debbugs-submit-bounces <at> debbugs.gnu.org
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
X-Spam-Score: -5.3 (-----)

Hi,

having to work with lists containing lage numbers (i.e. filesizes greater 2GB) 
I have problems.
sort -n works
join dosen't  do as a "newcomer" expects.

Because join uses strtoul() before doing comparisation it is understandable. 
("unpairable" is the result.)

Do you see a chance to extend join with a -n parameter for numeric 
comparisation as sort has already?

In the source / manpage  join claims about combinations with sort. Why not 
expand this? :-)

TIA!

Bernhard




Acknowledgement sent to Bernhard Schiffner <bernhard@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-coreutils@HIDDEN. Full text available.
Report forwarded to owner <at> debbugs.gnu.org, bug-coreutils@HIDDEN:
bug#6903; Package coreutils. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.