GNU bug report logs - #13084
boyer_moore crashes with certain characters in the case table

Package: emacs;

Reported by: Juri Linkov <juri <at> jurta.org>

Date: Wed, 5 Dec 2012 00:37:02 UTC

Severity: normal

Done: Juri Linkov <juri <at> jurta.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13084 in the body.
You can then email your comments to 13084 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Wed, 05 Dec 2012 00:37:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Juri Linkov <juri <at> jurta.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Wed, 05 Dec 2012 00:37:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: bug-gnu-emacs <at> gnu.org
Subject: boyer_moore crashes with certain characters in the case table
Date: Wed, 05 Dec 2012 02:34:39 +0200

The minimal reproducible recipe for crashes in boyer_moore noticed in bug#13041:

1. emacs -Q

2. Eval in *scratch*:

(let ((table (standard-case-table)) canon)
  (setq canon (copy-sequence table))
  (aset canon #xff59 ?y)
  (set-char-table-extra-slot table 1 canon)
  (set-char-table-extra-slot table 2 nil)
  (set-standard-case-table table))

3. Start an activity that includes a search, e.g. `C-x 8 RET TAB'

The crash in boyer_moore is caused by fullwidth characters like #xff59
whose Unicode properties are:

  name: FULLWIDTH LATIN SMALL LETTER Y
  decomposition: (wide 121) (wide 'y')

However, the crash doesn't occur when the same fullwidth characters are
set to their downcase counterparts in lisp/international/characters.el:

  ;; Fullwidth Latin
  (setq c #xff21)
  (while (<= c #xff3a)
    (set-case-syntax-pair c (+ c #x20) tbl)
    (modify-category-entry c ?l)
    (modify-category-entry (+ c #x20) ?l)
    (setq c (1+ c)))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Tue, 11 Dec 2012 15:39:02 GMT) Full text and rfc822 format available.

Message #8 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>, Kenichi Handa <handa <at> gnu.org>
Cc: 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Tue, 11 Dec 2012 17:37:09 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Date: Wed, 05 Dec 2012 02:34:39 +0200
> 
> The minimal reproducible recipe for crashes in boyer_moore noticed in bug#13041:
> 
> 1. emacs -Q
> 
> 2. Eval in *scratch*:
> 
> (let ((table (standard-case-table)) canon)
>   (setq canon (copy-sequence table))
>   (aset canon #xff59 ?y)
>   (set-char-table-extra-slot table 1 canon)
>   (set-char-table-extra-slot table 2 nil)
>   (set-standard-case-table table))
> 
> 3. Start an activity that includes a search, e.g. `C-x 8 RET TAB'

Thanks.  I think i fixed this (revision 111021 on the emacs-24
branch), please test.

In addition, I'd suggest that Handa-san (or someone else) takes a good
look at the code that sets up the simple_translate table in
boyer_moore, because the constants there, like 0200 and 0x3F, and all
the talk about characters that belong "to the same charset and row"
smell of pre-Unicode (a.k.a. "MULE") representation of characters.
For now, I disabled boyer_moore for unibyte characters beyond 160,
because my reading of the code is that simple_translate and the
supporting code cannot handle that.  Maybe I'm wrong.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Tue, 11 Dec 2012 23:25:01 GMT) Full text and rfc822 format available.

Message #11 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Kenichi Handa <handa <at> gnu.org>, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Wed, 12 Dec 2012 01:17:04 +0200

> I think i fixed this (revision 111021 on the emacs-24 branch),
> please test.

Thanks, there are no more crashes when using code from
http://debbugs.gnu.org/13041#41

Does this mean there are no more obstacles to filling a translation table
for ignoring equivalence with all character mappings according to the
`decomposition' property?  This would be the first step in this direction.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Wed, 12 Dec 2012 03:57:01 GMT) Full text and rfc822 format available.

Message #14 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: handa <at> gnu.org, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Wed, 12 Dec 2012 05:55:04 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Cc: Kenichi Handa <handa <at> gnu.org>,  13084 <at> debbugs.gnu.org
> Date: Wed, 12 Dec 2012 01:17:04 +0200
> 
> > I think i fixed this (revision 111021 on the emacs-24 branch),
> > please test.
> 
> Thanks, there are no more crashes when using code from
> http://debbugs.gnu.org/13041#41
> 
> Does this mean there are no more obstacles to filling a translation table
> for ignoring equivalence with all character mappings according to the
> `decomposition' property?  This would be the first step in this direction.

I'm not sure I understand what you are asking.  Please show more
details.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Wed, 12 Dec 2012 09:36:02 GMT) Full text and rfc822 format available.

Message #17 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: handa <at> gnu.org, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Wed, 12 Dec 2012 11:27:50 +0200

>> Does this mean there are no more obstacles to filling a translation table
>> for ignoring equivalence with all character mappings according to the
>> `decomposition' property?  This would be the first step in this direction.
>
> I'm not sure I understand what you are asking.  Please show more details.

There is confusion with the word `equivalence'.  Currently there
exists the case equivalence table in the case table (`case_eqv_table').
Implementing a diacritic search in bug#13041 requires adding a new
similar table.  I don't know what would be a good name:
`decomposition_eqv_table' or `normalization_eqv_table' or something better.

I'm unfamiliar with the details of `search_buffer', but in principle
using two tables in the macro `TRANSLATE' could implement a diacritic
search where at the first step the character will be translated using
`decomposition_eqv_table', and after that the resulting character
will be translated using `case_eqv_table'.

So the dataflow to get the canonical character will be Á -> A -> a.
If `case-fold-search' is nil, then Á -> A.  If a new variable
`decomposition-search' (or `normalized-search') is nil then Á -> á.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Wed, 12 Dec 2012 10:23:01 GMT) Full text and rfc822 format available.

Message #20 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Wed, 12 Dec 2012 11:21:36 +0100

> So the dataflow to get the canonical character will be Á -> A -> a.
> If `case-fold-search' is nil, then Á -> A.  If a new variable
> `decomposition-search' (or `normalized-search') is nil then Á -> á.

Any such table should allow handling asymmetric searches: That is,
searching for "ába" should match "ába" "ábà" and "ábá" but not "aba" or
"àbá".  Can we do that?

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Wed, 12 Dec 2012 10:39:01 GMT) Full text and rfc822 format available.

Message #23 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Wed, 12 Dec 2012 12:31:57 +0200

>> So the dataflow to get the canonical character will be Á -> A -> a.
>> If `case-fold-search' is nil, then Á -> A.  If a new variable
>> `decomposition-search' (or `normalized-search') is nil then Á -> á.
>
> Any such table should allow handling asymmetric searches: That is,
> searching for "ába" should match "ába" "ábà" and "ábá" but not "aba" or
> "àbá".  Can we do that?

IIUC what you mean is something like `search-upper-case'
where upper case chars disable case fold searching,
so "Aba" should match "Aba" and "AbA" but not "aba".

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Wed, 12 Dec 2012 12:45:02 GMT) Full text and rfc822 format available.

Message #26 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: Eli Zaretskii <eliz <at> gnu.org>, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Wed, 12 Dec 2012 13:43:25 +0100

> IIUC what you mean is something like `search-upper-case'
> where upper case chars disable case fold searching,
> so "Aba" should match "Aba" and "AbA" but not "aba".

Yes. I think that's a very good explanation in Emacs terms.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Wed, 12 Dec 2012 16:49:01 GMT) Full text and rfc822 format available.

Message #29 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: handa <at> gnu.org, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Wed, 12 Dec 2012 18:47:16 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Cc: handa <at> gnu.org,  13084 <at> debbugs.gnu.org
> Date: Wed, 12 Dec 2012 11:27:50 +0200
> 
> >> Does this mean there are no more obstacles to filling a translation table
> >> for ignoring equivalence with all character mappings according to the
> >> `decomposition' property?  This would be the first step in this direction.
> >
> > I'm not sure I understand what you are asking.  Please show more details.
> 
> There is confusion with the word `equivalence'.  Currently there
> exists the case equivalence table in the case table (`case_eqv_table').
> Implementing a diacritic search in bug#13041 requires adding a new
> similar table.  I don't know what would be a good name:
> `decomposition_eqv_table' or `normalization_eqv_table' or something better.
> 
> I'm unfamiliar with the details of `search_buffer', but in principle
> using two tables in the macro `TRANSLATE' could implement a diacritic
> search where at the first step the character will be translated using
> `decomposition_eqv_table', and after that the resulting character
> will be translated using `case_eqv_table'.
> 
> So the dataflow to get the canonical character will be Á -> A -> a.
> If `case-fold-search' is nil, then Á -> A.  If a new variable
> `decomposition-search' (or `normalized-search') is nil then Á -> á.

OK, all this is now clear and agreed.  So what did you mean by "no
more obstacles" above?  The obstacles I see is that case tables aren't
up to the job because they don't support ignoring of characters, and
the code in search.c cannot handle ignoring even if the table did
support that.  These obstacles still stand.

Reply sent to Juri Linkov <juri <at> jurta.org>:
You have taken responsibility. (Wed, 12 Dec 2012 23:11:01 GMT) Full text and rfc822 format available.

Notification sent to Juri Linkov <juri <at> jurta.org>:
bug acknowledged by developer. (Wed, 12 Dec 2012 23:11:01 GMT) Full text and rfc822 format available.

Message #34 received at 13084-done <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: handa <at> gnu.org, 13084-done <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Thu, 13 Dec 2012 01:05:09 +0200

> So what did you mean by "no more obstacles" above?

By obstacles I meant crashes that you fixed.
Thanks for that.  I'm closing this bug.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Thu, 13 Dec 2012 13:43:02 GMT) Full text and rfc822 format available.

Message #37 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Thu, 13 Dec 2012 22:39:29 +0900

In article <831uewa9cq.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

> In addition, I'd suggest that Handa-san (or someone else) takes a good
> look at the code that sets up the simple_translate table in
> boyer_moore, because the constants there, like 0200 and 0x3F, and all
> the talk about characters that belong "to the same charset and row"
> smell of pre-Unicode (a.k.a. "MULE") representation of characters.
> For now, I disabled boyer_moore for unibyte characters beyond 160,
> because my reading of the code is that simple_translate and the
> supporting code cannot handle that.  Maybe I'm wrong.

I have not yet checked the code, but what I remember is that
search_buffer checks the search string and decides which to
use; boyer_moore or simple_search.  If all equivalent
characters of all non-ASCII characters in the search string
are in the same character group, we can use boyer_moore.
Here, A and B belongs to the same character group iff A and
B has the same multibyte sequence except for the last byte.
In this condition, we should be able to use the table
simple_translate.

---
Kenichi Handa
handa <at> gnu.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Thu, 13 Dec 2012 17:34:02 GMT) Full text and rfc822 format available.

Message #40 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Kenichi Handa <handa <at> gnu.org>
Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Thu, 13 Dec 2012 19:32:08 +0200

> From: Kenichi Handa <handa <at> gnu.org>
> Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
> Date: Thu, 13 Dec 2012 22:39:29 +0900
> 
> I have not yet checked the code, but what I remember is that
> search_buffer checks the search string and decides which to
> use; boyer_moore or simple_search.  If all equivalent
> characters of all non-ASCII characters in the search string
> are in the same character group, we can use boyer_moore.

Yes, that's my reading of the code as well.

> Here, A and B belongs to the same character group iff A and
> B has the same multibyte sequence except for the last byte.
> In this condition, we should be able to use the table
> simple_translate.

OK, then maybe just the comments need to be fixed.  They shouldn't
talk about "charset" and "row", which are undefined in Unicode Emacs.
They should instead use terminology that correspond to UTF-8 multibyte
representation of characters we use today.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Sat, 15 Dec 2012 13:22:01 GMT) Full text and rfc822 format available.

Message #43 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> gnu.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Sat, 15 Dec 2012 22:17:17 +0900

In article <83obhxoo2v.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:

> > Here, A and B belongs to the same character group iff A and
> > B has the same multibyte sequence except for the last byte.
> > In this condition, we should be able to use the table
> > simple_translate.

> OK, then maybe just the comments need to be fixed.  They shouldn't
> talk about "charset" and "row", which are undefined in Unicode Emacs.
> They should instead use terminology that correspond to UTF-8 multibyte
> representation of characters we use today.

I've just committed this change.  How is it?

=== modified file 'src/search.c'
--- src/search.c	2012-10-10 20:09:47 +0000
+++ src/search.c	2012-12-15 13:04:46 +0000
@@ -1313,8 +1313,11 @@
 	     non-nil, we can use boyer-moore search only if TRT can be
 	     represented by the byte array of 256 elements.  For that,
 	     all non-ASCII case-equivalents of all case-sensitive
-	     characters in STRING must belong to the same charset and
-	     row.  */
+	     characters in STRING must belong to the same character
+	     group (two characters belong to the same group iff their
+	     multibyte forms are the same except for the last byte;
+	     i.e. every 64 characters form a group; U+0000..U+003F,
+	     U+0040..U+007F, U+0080..U+00BF, ...).  */
 
 	  while (--len >= 0)
 	    {

---
Kenichi Handa
handa <at> gnu.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13084; Package emacs. (Sat, 15 Dec 2012 13:58:01 GMT) Full text and rfc822 format available.

Message #46 received at 13084 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Kenichi Handa <handa <at> gnu.org>
Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
Subject: Re: bug#13084: boyer_moore crashes with certain characters in the
	case	table
Date: Sat, 15 Dec 2012 15:55:06 +0200

> From: Kenichi Handa <handa <at> gnu.org>
> Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
> Date: Sat, 15 Dec 2012 22:17:17 +0900
> 
> In article <83obhxoo2v.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> > > Here, A and B belongs to the same character group iff A and
> > > B has the same multibyte sequence except for the last byte.
> > > In this condition, we should be able to use the table
> > > simple_translate.
> 
> > OK, then maybe just the comments need to be fixed.  They shouldn't
> > talk about "charset" and "row", which are undefined in Unicode Emacs.
> > They should instead use terminology that correspond to UTF-8 multibyte
> > representation of characters we use today.
> 
> I've just committed this change.  How is it?

Clear, thanks.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 13 Jan 2013 12:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 12 years and 163 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #13084 boyer_moore crashes with certain characters in the case table

GNU bug report logs - #13084
boyer_moore crashes with certain characters in the case table