GNU bug report logs - #72086
scm_i_utf8_string_hash overruns buffer when len is zero

Previous Next

Package: guile;

Reported by: Rob Browning <rlb <at> defaultvalue.org>

Date: Sat, 13 Jul 2024 00:44:01 UTC

Severity: normal

To reply to this bug, email your comments to 72086 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-guile <at> gnu.org:
bug#72086; Package guile. (Sat, 13 Jul 2024 00:44:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Rob Browning <rlb <at> defaultvalue.org>:
New bug report received and forwarded. Copy sent to bug-guile <at> gnu.org. (Sat, 13 Jul 2024 00:44:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Rob Browning <rlb <at> defaultvalue.org>
To: bug-guile <at> gnu.org
Subject: scm_i_utf8_string_hash overruns buffer when len is zero
Date: Fri, 12 Jul 2024 19:43:18 -0500
[Message part 1 (text/plain, inline)]
The first patch attempts to fix that, and the second is an optimization
when then input is ASCII (since we already have the information we need
to detect that):

[0001-scm_i_utf8_string_hash-don-t-overrun-when-len-is-zer.patch (text/x-diff, inline)]
From 619e3d3afec2c116007d9cb2ad32a500fb32a7dd Mon Sep 17 00:00:00 2001
From: Rob Browning <rlb <at> defaultvalue.org>
Date: Sun, 30 Jun 2024 22:41:40 -0500
Subject: [PATCH 1/2] scm_i_utf8_string_hash: don't overrun when len is zero

When the length is zero, the previous code would include the byte after
the end of the string in the hash.  Fix that (the wide and narrow
hashers also guard against it via "case 0"), and while we're there,
switch to u8_mbtouc since the unsafe variant is now the same (see the
info pages), and don't bother mutating length for the trailing bytes,
since we don't need to.

libguile/hash.c (scm_i_utf8_string_hash): switch to u8_mbtouc, and avoid
overrun when len == 0.
---
 libguile/hash.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/libguile/hash.c b/libguile/hash.c
index a038a11bf..d92f60df8 100644
--- a/libguile/hash.c
+++ b/libguile/hash.c
@@ -195,32 +195,34 @@ scm_i_utf8_string_hash (const char *str, size_t len)
   /* Handle most of the key.  */
   while (length > 3)
     {
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       a += u32;
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       b += u32;
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
       c += u32;
       mix (a, b, c);
       length -= 3;
     }
 
   /* Handle the last 3 elements's.  */
-  ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-  a += u32;
-  if (--length)
+  if (length)
     {
-      ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-      b += u32;
-      if (--length)
+      ustr += u8_mbtouc (&u32, ustr, end - ustr);
+      a += u32;
+      if (length > 1)
         {
-          ustr += u8_mbtouc_unsafe (&u32, ustr, end - ustr);
-          c += u32;
+          ustr += u8_mbtouc (&u32, ustr, end - ustr);
+          b += u32;
+          if (length > 2)
+            {
+              ustr += u8_mbtouc (&u32, ustr, end - ustr);
+              c += u32;
+            }
         }
+      final (a, b, c);
     }
 
-  final (a, b, c);
-
   if (sizeof (unsigned long) == 8)
     ret = (((unsigned long) c) << 32) | b;
   else
-- 
2.43.0

[0002-scm_i_utf8_string_hash-optimize-ASCII.patch (text/x-diff, inline)]
From c6a888888101a893820f38561898e7c0390dd9d2 Mon Sep 17 00:00:00 2001
From: Rob Browning <rlb <at> defaultvalue.org>
Date: Mon, 1 Jul 2024 20:56:57 -0500
Subject: [PATCH 2/2] scm_i_utf8_string_hash: optimize ASCII

Since we already compute the char length, use that to detect all ASCII
strings and handle those the same way we handle latin-1.

libguile/hash.c (scm_i_utf8_string_hash): when byte_len == char_len,
(i.e. fixed-width ASCII) optimize hashing via existing narrow path.
---
 libguile/hash.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/libguile/hash.c b/libguile/hash.c
index d92f60df8..fe950e358 100644
--- a/libguile/hash.c
+++ b/libguile/hash.c
@@ -169,25 +169,29 @@ scm_i_latin1_string_hash (const char *str, size_t len)
 unsigned long 
 scm_i_utf8_string_hash (const char *str, size_t len)
 {
-  const uint8_t *end, *ustr = (const uint8_t *) str;
-  unsigned long ret;
-
-  /* The length of the string in characters.  This name corresponds to
-     Jenkins' original name.  */
-  size_t length;
-
-  uint32_t a, b, c, u32;
-
   if (len == (size_t) -1)
     len = strlen (str);
 
-  end = ustr + len;
-
+  const uint8_t *ustr = (const uint8_t *) str;
   if (u8_check (ustr, len) != NULL)
     /* Invalid UTF-8; punt.  */
     return scm_i_string_hash (scm_from_utf8_stringn (str, len));
 
-  length = u8_mbsnlen (ustr, len);
+  /* The length of the string in characters.  This name corresponds to
+     Jenkins' original name.  */
+  size_t length = u8_mbsnlen (ustr, len);
+
+  if (len == length) // ascii, same as narrow_string_hash above
+    {
+      unsigned long ret;
+      JENKINS_LOOKUP3_HASHWORD2 (str, len, ret);
+      ret >>= 2; /* Ensure that it fits in a fixnum.  */
+      return ret;
+    }
+
+  const uint8_t * const end = ustr + len;
+  uint32_t a, b, c, u32;
+  unsigned long ret;
 
   /* Set up the internal state.  */
   a = b = c = 0xdeadbeef + ((uint32_t)(length<<2)) + 47;
-- 
2.43.0

[Message part 4 (text/plain, inline)]
Thanks
-- 
Rob Browning
rlb @defaultvalue.org and @debian.org
GPG as of 2011-07-10 E6A9 DA3C C9FD 1FF8 C676 D2C4 C0F0 39E9 ED1B 597A
GPG as of 2002-11-03 14DD 432F AE39 534D B592 F9A0 25C8 D377 8C7E 73A4

This bug report was last modified 133 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.