GNU bug report logs - #11073
24.0.94; BIDI-related crash in redisplay with certain byte sequences

Package: emacs;

Reported by: Eli Zaretskii <eliz <at> gnu.org>

Date: Fri, 23 Mar 2012 11:27:02 UTC

Severity: normal

Found in version 24.0.94

Done: Glenn Morris <rgm <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 11073 in the body.
You can then email your comments to 11073 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 23 Mar 2012 11:27:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eli Zaretskii <eliz <at> gnu.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 23 Mar 2012 11:27:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.0.94; BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 23 Mar 2012 12:55:19 +0200

The person who reported this to me in private email won't go public,
for whatever reasons, so I'm reporting this for them.

The recipe:

 emacs -Q
 C-x C-f bidicrash.txt RET

where the file bidicrash.txt was created with this shell command:

 echo -e "\0365\0205\0264\0225"

(On Windows, use the port of GNU `echo' rather than the built-in shell
command.)

Emacs crashes; the backtrace is below.

I'm working on fixing this.

Breakpoint 1, w32_abort () at w32fns.c:7196
7196      button = MessageBox (NULL,
(gdb) bt
#0  w32_abort () at w32fns.c:7196
#1  0x012f2e49 in bidi_get_type (ch=4195533, override=NEUTRAL_DIR)
    at bidi.c:108
#2  0x012f4120 in bidi_resolve_explicit_1 (bidi_it=0x82cff8) at bidi.c:1400
#3  0x012f44a8 in bidi_resolve_explicit (bidi_it=0x82cff8) at bidi.c:1529
#4  0x012f4a2f in bidi_resolve_weak (bidi_it=0x82cff8) at bidi.c:1614
#5  0x012f5110 in bidi_resolve_neutral (bidi_it=0x82cff8) at bidi.c:1850
#6  0x012f5a49 in bidi_type_of_next_char (bidi_it=0x82cff8) at bidi.c:2020
#7  0x012f5d6f in bidi_level_of_next_char (bidi_it=0x82cff8) at bidi.c:2133
#8  0x012f630e in bidi_move_to_visually_next (bidi_it=0x82cff8) at bidi.c:2342
#9  0x0116aded in set_iterator_to_next (it=0x82ca40, reseat_p=1)
    at xdisp.c:6898
#10 0x011941c1 in display_line (it=0x82ca40) at xdisp.c:19341
#11 0x0118917a in try_window (window=55991301, pos=..., flags=1)
    at xdisp.c:15977
#12 0x01186a32 in redisplay_window (window=55991301, just_this_one_p=0)
    at xdisp.c:15502
#13 0x011800b8 in redisplay_window_0 (window=55991301) at xdisp.c:13625
#14 0x01033d1b in internal_condition_case_1 (
    bfun=0x1180086 <redisplay_window_0>, arg=55991301, handlers=53234414,
    hfun=0x1180065 <redisplay_window_error>) at eval.c:1553
#15 0x01180055 in redisplay_windows (window=55991301) at xdisp.c:13605
#16 0x0117dff8 in redisplay_internal () at xdisp.c:13182
#17 0x0117b2ea in redisplay () at xdisp.c:12405
#18 0x010087fb in read_char (commandflag=1, nmaps=2, maps=0x82fa30,
    prev_event=53250074, used_mouse_menu=0x82fb5c, end_time=0x0)
    at keyboard.c:2446
#19 0x0101c246 in read_key_sequence (keybuf=0x82fc60, bufsize=30,
    prompt=53250074, dont_downcase_last=0, can_return_switch_frame=1,
    fix_current_buffer=1) at keyboard.c:9326
#20 0x01005aa8 in command_loop_1 () at keyboard.c:1448
#21 0x01033c0b in internal_condition_case (bfun=0x10054b6 <command_loop_1>,
    handlers=53307802, hfun=0x1004ce0 <cmd_error>) at eval.c:1515
#22 0x0100511c in command_loop_2 (ignore=53250074) at keyboard.c:1159
#23 0x010335cb in internal_catch (tag=53305826,
    func=0x10050f9 <command_loop_2>, arg=53250074) at eval.c:1272
#24 0x010050d4 in command_loop () at keyboard.c:1138
#25 0x0100469e in recursive_edit_1 () at keyboard.c:758
#26 0x010049c0 in Frecursive_edit () at keyboard.c:822
#27 0x010027c8 in main (argc=2, argv=0xa32880) at emacs.c:1715
(gdb) up
#1  0x012f2e49 in bidi_get_type (ch=4195533, override=NEUTRAL_DIR)
    at bidi.c:108
108         abort ();
(gdb) up
#2  0x012f4120 in bidi_resolve_explicit_1 (bidi_it=0x82cff8) at bidi.c:1400
1400      type = bidi_get_type (curchar, NEUTRAL_DIR);
(gdb) p bidi_it->charpos
$1 = 2
(gdb) p bidi_it->bytepos
$2 = 4
(gdb) p bidi_it->ch_len
$3 = 2
(gdb) p bidi_it->ch
$4 = 4195533
(gdb) p/x bidi_it->ch
$5 = 0x4004cd
(gdb)

This is on Windows.  On GNU/Linux, or if you change the EOL format of
the file to be Unix-style LF, the last command prints 0x4004ca
instead.  Evidently, Emacs is trying to produce a Unicode codepoint
from bytes that include the newline sequence.


In GNU Emacs 24.0.94.1 (i386-mingw-nt5.1.2600)
 of 2012-02-27 on HOME-C4E4A596F7
Windowing system distributor `Microsoft Corp.', version 5.1.2600
Configured using:
 `configure --with-gcc (3.4)'

Important settings:
  value of $LC_ALL: nil
  value of $LC_COLLATE: nil
  value of $LC_CTYPE: nil
  value of $LC_MESSAGES: nil
  value of $LC_MONETARY: nil
  value of $LC_NUMERIC: nil
  value of $LC_TIME: nil
  value of $LANG: ENU
  value of $XMODIFIERS: nil
  locale-coding-system: cp1255
  default enable-multibyte-characters: t

Major mode: Mail

Minor modes in effect:
  diff-auto-refine-mode: t
  flyspell-mode: t
  desktop-save-mode: t
  show-paren-mode: t
  display-time-mode: t
  tooltip-mode: t
  mouse-wheel-mode: t
  tool-bar-mode: t
  menu-bar-mode: t
  file-name-shadow-mode: t
  global-font-lock-mode: t
  font-lock-mode: t
  blink-cursor-mode: t
  auto-composition-mode: t
  auto-encryption-mode: t
  auto-compression-mode: t
  temp-buffer-resize-mode: t
  line-number-mode: t
  abbrev-mode: t

Recent input:
a l SPC D E F A U L T _ F A C E _ I D S-SPC i n t a 
c t . ) <down> <return> <return> S o SPC t h i s SPC 
b u g SPC h a s SPC r a h e <backspace> <backspace> 
t h e r SPC l o w SPC p r i o r i t y SPC a t SPC t 
h i s SPC t i m e , SPC a s SPC i t ' s SPC n o t SPC 
a SPC r e g r e s s i o n SPC w r t SPC E m a c s SPC 
2 3 . SPC SPC N e v e r t h e l e s s , <M-backspace> 
I S-SPC w i l l SPC a t SPC t h e SPC v e r y SPC l 
e a s t SPC t r y SPC t o SPC f i g u r e SPC o u t 
SPC w h a t SPC c h a n g e s SPC a r e SPC n e e d 
e d SPC t o SPC m a k e SPC t h i s SPC w o r k SPC 
a s SPC e x p e c t e d . <return> <up> <up> <C-right> 
<C-right> <C-right> <C-right> <C-right> <C-left> T 
i m e SPC p e r m i t t i n g , SPC M-q <down> <down> 
<down> <up> <up> <up> <up> <M-left> <C-home> C-c C-s 
<switch-frame> n n n n p p <switch-frame> M-x e m a 
c s - r e <M-backspace> <M-backspace> r e p o r t <tab> 
<return>

Recent messages:
Mark set [4 times]
Auto-saving...done
byte-code: End of buffer
Auto-saving...done
Mark set
Sending...
Added to d:/usr/eli/rmail/SENT.MAIL
Sending email 
Sending email done
Sending...done

Load-path shadows:
None found.

Features:
(shadow emacsbug etags cc-awk network-stream starttls tls smtpmail
auth-source eieio assoc gnus-util password-cache mailalias sendmail
multi-isearch find-func help-mode view rmailout dabbrev ld-script
dired-x dired tcl nxml-uchnm rng-xsd xsd-regexp rng-cmpct rng-nxml
rng-valid rng-loc rng-uri rng-parse nxml-parse rng-match rng-dt
rng-util rng-pttrn nxml-ns nxml-mode nxml-outln nxml-rap nxml-util
nxml-glyph nxml-enc xmltok sgml-mode org-wl org-w3m org-vm org-rmail
org-mhe org-mew org-irc org-jsinfo org-infojs org-html org-exp ob-exp
org-exp-blocks org-agenda org-info org-gnus org-docview org-bibtex
bibtex org-bbdb org byte-opt warnings bytecomp byte-compile cconv
macroexp advice help-fns advice-preload ob-emacs-lisp ob-tangle ob-ref
ob-lob ob-table org-footnote org-src ob-comint ob-keys ob ob-eval
org-pcomplete pcomplete org-list org-faces org-compat org-entities
org-macs cal-menu calendar cal-loaddefs noutline outline arc-mode
archive-mode diff-mode conf-mode newcomment parse-time sh-script
executable gud easy-mmode comint ansi-color ring generic jka-compr
make-mode flyspell ispell vc-cvs autorevert info vc-bzr cc-mode
cc-fonts cc-guess cc-menus cc-cmds cc-styles cc-align cc-engine
cc-vars cc-defs regexp-opt qp rmailsum rmailmm message format-spec
rfc822 mml mml-sec mm-decode mm-bodies mm-encode mailabbrev gmm-utils
mailheader mail-parse rfc2231 rmail rfc2047 rfc2045 ietf-drums mm-util
mail-prsvr mail-utils desktop server filecache mairix cus-edit
easymenu cus-start cus-load wid-edit saveplace midnight generic-x
paren battery time time-date tooltip ediff-hook vc-hooks
lisp-float-type mwheel dos-w32 disp-table ls-lisp w32-win w32-vars
tool-bar dnd fontset image fringe lisp-mode register page menu-bar
rfn-eshadow timer select scroll-bar mouse jit-lock font-lock syntax
facemenu font-core frame cham georgian utf-8-lang misc-lang vietnamese
tibetan thai tai-viet lao korean japanese hebrew greek romanian slovak
czech european ethiopic indian cyrillic chinese case-table epa-hook
jka-cmpr-hook help simple abbrev minibuffer loaddefs button faces
cus-face files text-properties overlay sha1 md5 base64 format env
code-pages mule custom widget hashtable-print-readable backquote
make-network-process multi-tty emacs)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 23 Mar 2012 13:07:02 GMT) Full text and rfc822 format available.

Message #8 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 23 Mar 2012 14:35:28 +0200

> Date: Fri, 23 Mar 2012 12:55:19 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> 
>  emacs -Q
>  C-x C-f bidicrash.txt RET
> 
> where the file bidicrash.txt was created with this shell command:
> 
>  echo -e "\0365\0205\0264\0225"
> 
> (On Windows, use the port of GNU `echo' rather than the built-in shell
> command.)
> 
> Emacs crashes; the backtrace is below.
> 
> I'm working on fixing this.

Fixed in revision 107665 on the trunk.  It was a pretty basic blunder.

(Repeat after me: FETCH_MULTIBYTE_CHAR followed by CHAR_BYTES is not
always equivalent to STRING_CHAR_AND_LENGTH.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 23 Mar 2012 14:59:01 GMT) Full text and rfc822 format available.

Message #11 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 23 Mar 2012 10:27:39 -0400

> (Repeat after me: FETCH_MULTIBYTE_CHAR followed by CHAR_BYTES is not
> always equivalent to STRING_CHAR_AND_LENGTH.)

Do we really absolutely have to have such a trap?
I mean: is there a good reason why they're not always equivalent?


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 23 Mar 2012 16:30:02 GMT) Full text and rfc822 format available.

Message #14 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 23 Mar 2012 17:58:25 +0200

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 11073 <at> debbugs.gnu.org
> Date: Fri, 23 Mar 2012 10:27:39 -0400
> 
> > (Repeat after me: FETCH_MULTIBYTE_CHAR followed by CHAR_BYTES is not
> > always equivalent to STRING_CHAR_AND_LENGTH.)
> 
> Do we really absolutely have to have such a trap?
> I mean: is there a good reason why they're not always equivalent?

They are not equivalent when conversion of the multibyte form into a
character unifies a CJK character that is represented by a codepoint
from one of the private use areas.  This unification is done in
char_string, via a call to MAYBE_UNIFY_CHAR, which converts the
private codepoint into the equivalent codepoint in one of the "normal"
planes.  The UTF-8 encoding of the unified character can be shorter or
longer than the original multibyte sequence.  The problem with the
code I had in bidi.c, viz.:

   character = FETCH_MULTIBYTE_CHAR (bytepos);
   char_len = CHAR_BYTES (character);

is that the value in `character' is not guaranteed to correspond to
the multibyte sequence consumed by FETCH_MULTIBYTE_CHAR, and therefore
that character's length as returned by CHAR_BYTES is not the right
instrument to advance to the next character.

So, I'd say that FETCH_MULTIBYTE_CHAR should only be used for fetching
a single character; if one wants to advance, one should either use
FETCH_CHAR_ADVANCE or (if they are paranoiac about speed, like I am)
use 

   character = STRING_CHAR_AND_LENGTH (BYTE_POS_ADDR (bytepos), length);

which returns the length of the consumed sequence, and use that to
advance to the next character position.

And note the other gotcha: that the length returned by
STRING_CHAR_AND_LENGTH is not necessarily the length of the UTF-8
encoding of the character it returns, but rather the length of the
multibyte sequence which was converted to the character.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 23 Mar 2012 18:06:02 GMT) Full text and rfc822 format available.

Message #17 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 23 Mar 2012 13:34:45 -0400

> They are not equivalent when conversion of the multibyte form into a
> character unifies a CJK character that is represented by a codepoint
> from one of the private use areas.

Why do we need this unification?  Or rather, why do we need multiple
codepoints, which then forces us to unify them?


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 23 Mar 2012 19:18:02 GMT) Full text and rfc822 format available.

Message #20 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>, Kenichi Handa <handa <at> m17n.org>
Cc: 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 23 Mar 2012 20:46:36 +0200

> From: Stefan Monnier <monnier <at> iro.umontreal.ca>
> Cc: 11073 <at> debbugs.gnu.org
> Date: Fri, 23 Mar 2012 13:34:45 -0400
> 
> > They are not equivalent when conversion of the multibyte form into a
> > character unifies a CJK character that is represented by a codepoint
> > from one of the private use areas.
> 
> Why do we need this unification?  Or rather, why do we need multiple
> codepoints, which then forces us to unify them?

That's something Handa-san (CC'ed) will be able to explain much better
than I ever could.  AFAIU, there are good reasons to have some CJK
characters on separate codepoints, because they need to be treated
differently from their Unicode codepoints (perhaps a different choice
of font to display them?)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Mon, 26 Mar 2012 08:18:02 GMT) Full text and rfc822 format available.

Message #23 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 11073 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Mon, 26 Mar 2012 16:45:56 +0900

In article <837gybupdf.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

> > Why do we need this unification?  Or rather, why do we need multiple
> > codepoints, which then forces us to unify them?

> That's something Handa-san (CC'ed) will be able to explain much better
> than I ever could.

It's a long story.  When I designed emacs-unicode (the
version before merged to the trunk, more than 10 years ago),
the unification maps of CJK charsets to Unicode were not
stable.  In addtion, there were various conflicting policies
on which character to unify to which character.  One reason
of this confusion was that Unicode itself didn't define
mapping to/from such CJK charsets (JIS, GB, KSC).

The unification problem is not only for Ideographic
characters.  Many CJK charsets contain, for instance,
full-width version of Greek characters, but Unicode doesn't
distinguish them from single-width versions (though Unicode
has full-width version of 'A'..'Z', etc).  There were people
who wanted to distinguish full-width Greek chars from
single-width chars.

There also were people who have a text of iso-2022-7bit file
which distinguishes characters of GB charset and JIS
charset.  To edit such a file and write it back as the
original one, one has to disable unification of one of GB
and JIS (or both of them).

So, I decided at that time to give each CJK charset unique
code space (above #x110000) in Emacs, and allow users to
freely unify/disunify them to Unicode code space (below
#x110000) by giving the function unify-charset.

FYI, http://www.unicode.org/reports/tr38/ tells some
difficulty of mappings.

> AFAIU, there are good reasons to have some CJK
> characters on separate codepoints, because they need to be treated
> differently from their Unicode codepoints (perhaps a different choice
> of font to display them?)

That was one reaons, but the current code pay attention to
`charset' text property of each character to select a proper
font.

---
Kenichi Handa
handa <at> m17n.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Mon, 26 Mar 2012 12:56:02 GMT) Full text and rfc822 format available.

Message #26 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Kenichi Handa <handa <at> m17n.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Mon, 26 Mar 2012 08:23:58 -0400

> So, I decided at that time to give each CJK charset unique
> code space (above #x110000) in Emacs, and allow users to
> freely unify/disunify them to Unicode code space (below
> #x110000) by giving the function unify-charset.

I understand this part.  The part I don't understand is why we do
unification when reading a char from the buffer's text.  That is: why
unify chars in `int' (or Lisp_Object) form but not in the
internal-utf-8 representation?

I would expect the unification to happen during encoding/decoding
only, and not during internal conversions from byte byte-sequence to int.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Thu, 29 Mar 2012 05:52:03 GMT) Full text and rfc822 format available.

Message #29 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Thu, 29 Mar 2012 14:19:50 +0900

In article <jwviphrft9z.fsf-monnier+INBOX <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> I understand this part.  The part I don't understand is why we do
> unification when reading a char from the buffer's text.  That is: why
> unify chars in `int' (or Lisp_Object) form but not in the
> internal-utf-8 representation?

> I would expect the unification to happen during encoding/decoding

Usually, yes.  But as far as there is a code space in high
area for a CJK charset, it is unavoidable to have a
buffer/string that contains a character represented by a
byte sequence in that high area as the test case of
Bug#11073.  And, as "unification" means to treat such a
character the same way as the unified character, I thought
they both have the same character code.

---
Kenichi Handa
handa <at> m17n.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Thu, 29 Mar 2012 16:36:02 GMT) Full text and rfc822 format available.

Message #32 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> IRO.UMontreal.CA>
To: Kenichi Handa <handa <at> m17n.org>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Thu, 29 Mar 2012 12:04:22 -0400

>> I understand this part.  The part I don't understand is why we do
>> unification when reading a char from the buffer's text.  That is: why
>> unify chars in `int' (or Lisp_Object) form but not in the
>> internal-utf-8 representation?

>> I would expect the unification to happen during encoding/decoding

> Usually, yes.  But as far as there is a code space in high
> area for a CJK charset, it is unavoidable to have a
> buffer/string that contains a character represented by a
> byte sequence in that high area as the test case of
> Bug#11073.  And, as "unification" means to treat such a
> character the same way as the unified character, I thought
> they both have the same character code.

Since there are two internal byte-sequence representation, I don't see
any good reason why we shouldn't have 2 internal int representations.
I.e. if unification failed for the byte-sequence (which might be the
result of a bug, for all I know), we may as well keep them non-unified
in the int representation.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Tue, 03 Apr 2012 02:23:02 GMT) Full text and rfc822 format available.

Message #35 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Tue, 03 Apr 2012 11:22:23 +0900

In article <jwvvcln5ra4.fsf-monnier+INBOX <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> > Usually, yes.  But as far as there is a code space in high
> > area for a CJK charset, it is unavoidable to have a
> > buffer/string that contains a character represented by a
> > byte sequence in that high area as the test case of
> > Bug#11073.  And, as "unification" means to treat such a
> > character the same way as the unified character, I thought
> > they both have the same character code.

> Since there are two internal byte-sequence representation, I don't see
> any good reason why we shouldn't have 2 internal int representations.
> I.e. if unification failed for the byte-sequence (which might be the
> result of a bug, for all I know), we may as well keep them non-unified
> in the int representation.

Please note that not all characters in the code-space of a
CJK charset are unified.  For instance, Big5 has it's own
PUA (private use area), and characters in PUA are not
unified by default.  So, if Emacs reads a Big5 file that
contains PUA chars, those chars stay in high-area.   Then,
one can provide his own unification map that also maps PUA
chars to some Unicode chars as this:
  (unify-charset 'big5 "MyBig5.map")
After this, I thought that previously read PUA chars staying
in the high-area should be treated as the corresponding
Unicode chars (in displaying, search, etc).

One may find some bug in his map or find another map is
better.  Then he can do this again:
  (unify-charset 'big5 "MyNewBig5.map")

The current design was to enable such a scenario.

Of course, there will be an opinion that such a
functionality is too much for Emacs, and when one changes
any unification map, he must re-read a file, process-output,
mail etc.

---
Kenichi Handa
handa <at> m17n.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Tue, 03 Apr 2012 04:23:02 GMT) Full text and rfc822 format available.

Message #38 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Kenichi Handa <handa <at> m17n.org>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Tue, 03 Apr 2012 00:22:32 -0400

>> > Usually, yes.  But as far as there is a code space in high
>> > area for a CJK charset, it is unavoidable to have a
>> > buffer/string that contains a character represented by a
>> > byte sequence in that high area as the test case of
>> > Bug#11073.  And, as "unification" means to treat such a
>> > character the same way as the unified character, I thought
>> > they both have the same character code.

>> Since there are two internal byte-sequence representation, I don't see
>> any good reason why we shouldn't have 2 internal int representations.
>> I.e. if unification failed for the byte-sequence (which might be the
>> result of a bug, for all I know), we may as well keep them non-unified
>> in the int representation.

> Please note that not all characters in the code-space of a
> CJK charset are unified.  For instance, Big5 has it's own
> PUA (private use area), and characters in PUA are not
> unified by default.  So, if Emacs reads a Big5 file that
> contains PUA chars, those chars stay in high-area.   Then,
> one can provide his own unification map that also maps PUA
> chars to some Unicode chars as this:
>   (unify-charset 'big5 "MyBig5.map")
> After this, I thought that previously read PUA chars staying
> in the high-area should be treated as the corresponding
> Unicode chars (in displaying, search, etc).

But again, this unification takes place during decoding.  Whereas what
I'm talking about takes place when reading the internal utf-8
representation, which should be already unified.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Tue, 03 Apr 2012 05:56:02 GMT) Full text and rfc822 format available.

Message #41 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Tue, 03 Apr 2012 14:55:11 +0900

In article <jwvwr5xwimc.fsf-monnier+INBOX <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> writes:
> > Please note that not all characters in the code-space of a
> > CJK charset are unified.  For instance, Big5 has it's own
> > PUA (private use area), and characters in PUA are not
> > unified by default.  So, if Emacs reads a Big5 file that
> > contains PUA chars, those chars stay in high-area.   Then,
> > one can provide his own unification map that also maps PUA
> > chars to some Unicode chars as this:
> >   (unify-charset 'big5 "MyBig5.map")
> > After this, I thought that previously read PUA chars staying
> > in the high-area should be treated as the corresponding
> > Unicode chars (in displaying, search, etc).

> But again, this unification takes place during decoding.

No.  In the above scenario, PUA chars read before the call
of unify-charset are not unified.  The unification should
take place after the call of unify-charset.

> Whereas what
> I'm talking about takes place when reading the internal utf-8
> representation, which should be already unified.

I'm talking about exactly that case.

---
Kenichi Handa
handa <at> m17n.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Tue, 03 Apr 2012 13:04:02 GMT) Full text and rfc822 format available.

Message #44 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Kenichi Handa <handa <at> m17n.org>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Tue, 03 Apr 2012 09:02:52 -0400

>> > Please note that not all characters in the code-space of a
>> > CJK charset are unified.  For instance, Big5 has it's own
>> > PUA (private use area), and characters in PUA are not
>> > unified by default.  So, if Emacs reads a Big5 file that
>> > contains PUA chars, those chars stay in high-area.   Then,
>> > one can provide his own unification map that also maps PUA
>> > chars to some Unicode chars as this:
>> >   (unify-charset 'big5 "MyBig5.map")
>> > After this, I thought that previously read PUA chars staying
>> > in the high-area should be treated as the corresponding
>> > Unicode chars (in displaying, search, etc).
> No.  In the above scenario, PUA chars read before the call
> of unify-charset are not unified.  The unification should
> take place after the call of unify-charset.

But isn't this (unify-charset 'big5 "MyBig5.map") performed in the
.emacs?  Is it really important to support adding unification rules
after decoding took place?  If so, why?  And also, what about
removing unification rules after decoding?


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Wed, 04 Apr 2012 00:08:01 GMT) Full text and rfc822 format available.

Message #47 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Wed, 04 Apr 2012 09:07:02 +0900

In article <jwvr4w5vuka.fsf-monnier+INBOX <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> But isn't this (unify-charset 'big5 "MyBig5.map") performed in the
> .emacs?

Usually yes.  But, in that case, if .emacs is encoded in
Big5 and it contains some Big5 PUA chars, they are not
unified while loading .emacs.

> Is it really important to support adding unification rules
> after decoding took place?  If so, why?

As I wrote, I can't tell how important it is.  It may be
very important for those (but I guess very few) who need the
above operation, but not important for the majority.

I'm ok to remove such a feature if the maintainers decide
that.

> And also, what about removing unification rules after
> decoding?

When one tells Emacs to unify some chars, and then reads a
file containing those chars, there's no way to dis-unify
them.

---
Kenichi Handa
handa <at> m17n.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Wed, 04 Apr 2012 01:18:01 GMT) Full text and rfc822 format available.

Message #50 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Kenichi Handa <handa <at> m17n.org>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Tue, 03 Apr 2012 21:17:16 -0400

>> But isn't this (unify-charset 'big5 "MyBig5.map") performed in the .emacs?
> Usually yes.  But, in that case, if .emacs is encoded in
> Big5 and it contains some Big5 PUA chars, they are not
> unified while loading .emacs.

Hmm... that doesn't sound like it would be a very common problem, but
it's not completely hypothetical either.  Would this problem also come
up in a BIG5 locale?  If not, then I think we can ignore this problem.

>> Is it really important to support adding unification rules
>> after decoding took place?  If so, why?
> As I wrote, I can't tell how important it is.  It may be very
> important for those (but I guess very few) who need the above
> operation, but not important for the majority.
> I'm ok to remove such a feature if the maintainers decide that.

The problem with it is that it costs all the time for everyone, and it
makes the behavior of some macros subtly more complex/different and
hence adds a nasty complexity.
So if at all possible, I'd rather find a way to remove it (not for
24.1, obviously).

>> And also, what about removing unification rules after decoding?
> When one tells Emacs to unify some chars, and then reads a file
> containing those chars, there's no way to dis-unify them.

But I guess this problem is even much less common.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 06 Apr 2012 01:14:02 GMT) Full text and rfc822 format available.

Message #53 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> m17n.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 06 Apr 2012 10:13:12 +0900

In article <jwvd37os3k5.fsf-monnier+INBOX <at> gnu.org>, Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

>>> But isn't this (unify-charset 'big5 "MyBig5.map") performed in the .emacs?
> > Usually yes.  But, in that case, if .emacs is encoded in
> > Big5 and it contains some Big5 PUA chars, they are not
> > unified while loading .emacs.

> Hmm... that doesn't sound like it would be a very common problem, but
> it's not completely hypothetical either.  Would this problem also come
> up in a BIG5 locale?  If not, then I think we can ignore this problem.

If it ever comes up, it is mostly for people in BIG5 locale.
But, please note that the reason I used BIG5 as an example
is just because that charset name is short.  Almost all CJK
charsets have PUA (officially or just by convention).

>>> Is it really important to support adding unification rules
>>> after decoding took place?  If so, why?
> > As I wrote, I can't tell how important it is.  It may be very
> > important for those (but I guess very few) who need the above
> > operation, but not important for the majority.
> > I'm ok to remove such a feature if the maintainers decide that.

> The problem with it is that it costs all the time for everyone, and it

I believe the extra cost is almost negligible because such
(dynamic) unification happens only for characters that is
greater than MAX_UNICODE_CHAR.

> makes the behavior of some macros subtly more complex/different and
> hence adds a nasty complexity.

That's mostly because I didn't write a proper comments on
the relavant macros, and didn't provide a better macros for
such a case as Eli's.

> So if at all possible, I'd rather find a way to remove it (not for
> 24.1, obviously).

I myself think that it doens't cause much problem even if we
keep this functionality, but, also don't raise strong
objection to remove it for 24.2.

>>> And also, what about removing unification rules after decoding?
> > When one tells Emacs to unify some chars, and then reads a file
> > containing those chars, there's no way to dis-unify them.

> But I guess this problem is even much less common.

Yes.  That's why I didn't implement such a feature.

---
Kenichi Handa
handa <at> m17n.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Fri, 06 Apr 2012 13:16:02 GMT) Full text and rfc822 format available.

Message #56 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Kenichi Handa <handa <at> m17n.org>
Cc: 11073 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Fri, 06 Apr 2012 16:13:33 +0300

> From: Kenichi Handa <handa <at> m17n.org>
> Cc: eliz <at> gnu.org, 11073 <at> debbugs.gnu.org
> Date: Fri, 06 Apr 2012 10:13:12 +0900
> 
> > makes the behavior of some macros subtly more complex/different and
> > hence adds a nasty complexity.
> 
> That's mostly because I didn't write a proper comments on
> the relavant macros, and didn't provide a better macros for
> such a case as Eli's.

I added comments to the relevant macros (as trunk revision 107781) to
warn about these subtleties.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#11073; Package emacs. (Mon, 09 Apr 2012 05:42:02 GMT) Full text and rfc822 format available.

Message #59 received at 11073 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa.kenichi <at> aist.go.jp>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 11073 <at> debbugs.gnu.org, monnier <at> iro.umontreal.ca
Subject: Re: bug#11073: 24.0.94;
	BIDI-related crash in redisplay with certain byte sequences
Date: Mon, 09 Apr 2012 13:14:43 +0900

In article <837gxtatqa.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:
> > That's mostly because I didn't write a proper comments on
> > the relavant macros, and didn't provide a better macros for
> > such a case as Eli's.

> I added comments to the relevant macros (as trunk revision 107781) to
> warn about these subtleties.

Thank you!!

---
Kenichi Handa
handa <at> m17n.org

bug closed, send any further explanations to 11073 <at> debbugs.gnu.org and Eli Zaretskii <eliz <at> gnu.org> Request was from Glenn Morris <rgm <at> gnu.org> to control <at> debbugs.gnu.org. (Sun, 17 Feb 2013 03:25:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 17 Mar 2013 11:24:12 GMT) Full text and rfc822 format available.

This bug report was last modified 12 years and 108 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #11073 24.0.94; BIDI-related crash in redisplay with certain byte sequences

GNU bug report logs - #11073
24.0.94; BIDI-related crash in redisplay with certain byte sequences