GNU bug report logs - #65996
29.1; UCS normalization is wrong

Previous Next

Package: emacs;

Reported by: awrhygty <at> outlook.com

Date: Fri, 15 Sep 2023 12:51:02 UTC

Severity: normal

Found in version 29.1

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 65996 in the body.
You can then email your comments to 65996 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#65996; Package emacs. (Fri, 15 Sep 2023 12:51:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to awrhygty <at> outlook.com:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 15 Sep 2023 12:51:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: awrhygty <at> outlook.com
To: bug-gnu-emacs <at> gnu.org
Subject: 29.1; UCS normalization is wrong
Date: Fri, 15 Sep 2023 21:49:38 +0900
UCS normalization is wrong for some characters.

(1) NFD/NFKD decompostion is not done
    U+1112E 𑄮 CHAKMA VOWEL SIGN O
    U+1112F 𑄯 CHAKMA VOWEL SIGN AU
    U+1134B 𑍋 GRANTHA VOWEL SIGN OO
    U+1134C 𑍌 GRANTHA VOWEL SIGN AU
    U+114BB 𑒻 TIRHUTA VOWEL SIGN AI
    U+114BC 𑒼 TIRHUTA VOWEL SIGN O
    U+114BE 𑒾 TIRHUTA VOWEL SIGN AU
    U+115BA 𑖺 SIDDHAM VOWEL SIGN O
    U+115BB 𑖻 SIDDHAM VOWEL SIGN AU
    U+11938 𑤸 DIVES AKURU VOWEL SIGN O

    (let ((s "\U0001112E\U0001112F\U0001134B\U0001134C\
    \U000114BB\U000114BC\U000114BE\U000115BA\U000115BB\U00011938"))
      (require 'ucs-normalize)
      (list (equal s (ucs-normalize-NFD-string s))
            (equal s (ucs-normalize-NFKD-string s))))
    =>(t t)

(2) NFKC/NFKD replacement is not done
    U+1E030..U+1E06D Cyrillic MODIFIER LETTER or SUBSCRIPT
    U+1EE00..U+1EEBB ARABIC MATHEMATICAL *
    U+1FBF0..U+1FBF9 SEGMENTED DIGIT *

    (let* ((f (lambda (cell)
                (apply #'string (number-sequence (car cell) (cdr cell)))))
           (s (mapconcat f '((#x1E030 . #x1E06D)
                             (#x1EE00 . #x1EEBB)
                             (#x1FBF0 . #x1FBF9)))))
      (require 'ucs-normalize)
      (list (equal s (ucs-normalize-NFKC-string s))
            (equal s (ucs-normalize-NFKD-string s))))
    =>(t t)


In GNU Emacs 29.1 (build 2, x86_64-w64-mingw32) of 2023-08-02 built on
 AVALON
Windowing system distributor 'Microsoft Corp.', version 10.0.19045
System Description: Microsoft Windows 10 Pro (v10.0.2009.19045.3448)

Configured using:
 'configure --with-modules --without-dbus --with-native-compilation=aot
 --without-compress-install --with-tree-sitter CFLAGS=-O2'

Configured features:
ACL GIF GMP GNUTLS HARFBUZZ JPEG JSON LCMS2 LIBXML2 MODULES NATIVE_COMP
NOTIFY W32NOTIFY PDUMPER PNG RSVG SOUND SQLITE3 THREADS TIFF
TOOLKIT_SCROLL_BARS TREE_SITTER WEBP XPM ZLIB

(NATIVE_COMP present but libgccjit not available)

Important settings:
  value of $LANG: JPN
  locale-coding-system: cp932

Major mode: Lisp Interaction

Minor modes in effect:
  highlight-changes-visible-mode: t
  tooltip-mode: t
  global-eldoc-mode: t
  eldoc-mode: t
  show-paren-mode: t
  electric-indent-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  line-number-mode: t
  indent-tabs-mode: t
  transient-mark-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t

Load-path shadows:
None found.

Features:
(misearch multi-isearch comp comp-cstr warnings icons rx emoji-labels
emoji multisession sqlite transient format-spec edmacro kmacro cl-extra
gnutls network-stream nsm mailalias smtpmail textsec uni-scripts url
url-proxy url-privacy url-expand url-methods url-history url-cookie
generate-lisp-file url-domsuf url-util url-parse auth-source cl-seq
eieio eieio-core cl-macs json map url-vars idna-mapping ucs-normalize
uni-confusable textsec-check cl-print byte-opt gv bytecomp byte-compile
debug backtrace find-func hilit-chg wid-edit thingatpt help-fns
radix-tree help-mode pp shadow sort mail-extr emacsbug message mailcap
yank-media puny dired dired-loaddefs rfc822 mml mml-sec password-cache
epa derived epg rfc6068 epg-config gnus-util text-property-search
time-date subr-x mm-decode mm-bodies mm-encode mail-parse rfc2231
mailabbrev gmm-utils mailheader cl-loaddefs cl-lib sendmail rfc2047
rfc2045 ietf-drums mm-util mail-prsvr mail-utils term/bobcat japan-util
rmc iso-transl tooltip cconv eldoc paren electric uniquify ediff-hook
vc-hooks lisp-float-type elisp-mode mwheel dos-w32 ls-lisp disp-table
term/w32-win w32-win w32-vars term/common-win tool-bar dnd fontset image
regexp-opt fringe tabulated-list replace newcomment text-mode lisp-mode
prog-mode register page tab-bar menu-bar rfn-eshadow isearch easymenu
timer select scroll-bar mouse jit-lock font-lock syntax font-core
term/tty-colors frame minibuffer nadvice seq simple cl-generic
indonesian philippine cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese eucjp-ms cp51932 hebrew greek
romanian slovak czech european ethiopic indian cyrillic chinese
composite emoji-zwj charscript charprop case-table epa-hook
jka-cmpr-hook help abbrev obarray oclosure cl-preloaded button loaddefs
theme-loaddefs faces cus-face macroexp files window text-properties
overlay sha1 md5 base64 format env code-pages mule custom widget keymap
hashtable-print-readable backquote threads w32notify w32 lcms2 multi-tty
make-network-process native-compile emacs)

Memory information:
((conses 16 331760 49630)
 (symbols 48 14840 3)
 (strings 32 66748 8954)
 (string-bytes 1 1357518)
 (vectors 16 55924)
 (vector-slots 8 1637738 128446)
 (floats 8 68 385)
 (intervals 56 7100 2925)
 (buffers 984 18))




Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Sat, 16 Sep 2023 09:23:02 GMT) Full text and rfc822 format available.

Notification sent to awrhygty <at> outlook.com:
bug acknowledged by developer. (Sat, 16 Sep 2023 09:23:02 GMT) Full text and rfc822 format available.

Message #10 received at 65996-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: awrhygty <at> outlook.com
Cc: 65996-done <at> debbugs.gnu.org
Subject: Re: bug#65996: 29.1; UCS normalization is wrong
Date: Sat, 16 Sep 2023 12:21:42 +0300
> From: awrhygty <at> outlook.com
> Date: Fri, 15 Sep 2023 21:49:38 +0900
> 
> 
> UCS normalization is wrong for some characters.
> 
> (1) NFD/NFKD decompostion is not done
>     U+1112E 𑄮 CHAKMA VOWEL SIGN O
>     U+1112F 𑄯 CHAKMA VOWEL SIGN AU
>     U+1134B 𑍋 GRANTHA VOWEL SIGN OO
>     U+1134C 𑍌 GRANTHA VOWEL SIGN AU
>     U+114BB 𑒻 TIRHUTA VOWEL SIGN AI
>     U+114BC 𑒼 TIRHUTA VOWEL SIGN O
>     U+114BE 𑒾 TIRHUTA VOWEL SIGN AU
>     U+115BA 𑖺 SIDDHAM VOWEL SIGN O
>     U+115BB 𑖻 SIDDHAM VOWEL SIGN AU
>     U+11938 𑤸 DIVES AKURU VOWEL SIGN O
> 
>     (let ((s "\U0001112E\U0001112F\U0001134B\U0001134C\
>     \U000114BB\U000114BC\U000114BE\U000115BA\U000115BB\U00011938"))
>       (require 'ucs-normalize)
>       (list (equal s (ucs-normalize-NFD-string s))
>             (equal s (ucs-normalize-NFKD-string s))))
>     =>(t t)
> 
> (2) NFKC/NFKD replacement is not done
>     U+1E030..U+1E06D Cyrillic MODIFIER LETTER or SUBSCRIPT
>     U+1EE00..U+1EEBB ARABIC MATHEMATICAL *
>     U+1FBF0..U+1FBF9 SEGMENTED DIGIT *
> 
>     (let* ((f (lambda (cell)
>                 (apply #'string (number-sequence (car cell) (cdr cell)))))
>            (s (mapconcat f '((#x1E030 . #x1E06D)
>                              (#x1EE00 . #x1EEBB)
>                              (#x1FBF0 . #x1FBF9)))))
>       (require 'ucs-normalize)
>       (list (equal s (ucs-normalize-NFKC-string s))
>             (equal s (ucs-normalize-NFKD-string s))))
>     =>(t t)

Thanks, fixed on the emacs-29 branch.

Once again, if (as I'm guessing) you found these problems by examining
the data in ucs-normalize.el, it would have greatly helped if you'd
pointed to the problematic data in your report.  Reverse-engineering
the sources of the problem from the behavior takes time, especially
when the relevant code is not trivial and was written by someone else.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 14 Oct 2023 11:24:13 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 210 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.