GNU bug report logs -
#13084
boyer_moore crashes with certain characters in the case table
Previous Next
Reported by: Juri Linkov <juri <at> jurta.org>
Date: Wed, 5 Dec 2012 00:37:02 UTC
Severity: normal
Done: Juri Linkov <juri <at> jurta.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13084 in the body.
You can then email your comments to 13084 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Wed, 05 Dec 2012 00:37:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Juri Linkov <juri <at> jurta.org>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Wed, 05 Dec 2012 00:37:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
The minimal reproducible recipe for crashes in boyer_moore noticed in bug#13041:
1. emacs -Q
2. Eval in *scratch*:
(let ((table (standard-case-table)) canon)
(setq canon (copy-sequence table))
(aset canon #xff59 ?y)
(set-char-table-extra-slot table 1 canon)
(set-char-table-extra-slot table 2 nil)
(set-standard-case-table table))
3. Start an activity that includes a search, e.g. `C-x 8 RET TAB'
The crash in boyer_moore is caused by fullwidth characters like #xff59
whose Unicode properties are:
name: FULLWIDTH LATIN SMALL LETTER Y
decomposition: (wide 121) (wide 'y')
However, the crash doesn't occur when the same fullwidth characters are
set to their downcase counterparts in lisp/international/characters.el:
;; Fullwidth Latin
(setq c #xff21)
(while (<= c #xff3a)
(set-case-syntax-pair c (+ c #x20) tbl)
(modify-category-entry c ?l)
(modify-category-entry (+ c #x20) ?l)
(setq c (1+ c)))
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Tue, 11 Dec 2012 15:39:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> From: Juri Linkov <juri <at> jurta.org>
> Date: Wed, 05 Dec 2012 02:34:39 +0200
>
> The minimal reproducible recipe for crashes in boyer_moore noticed in bug#13041:
>
> 1. emacs -Q
>
> 2. Eval in *scratch*:
>
> (let ((table (standard-case-table)) canon)
> (setq canon (copy-sequence table))
> (aset canon #xff59 ?y)
> (set-char-table-extra-slot table 1 canon)
> (set-char-table-extra-slot table 2 nil)
> (set-standard-case-table table))
>
> 3. Start an activity that includes a search, e.g. `C-x 8 RET TAB'
Thanks. I think i fixed this (revision 111021 on the emacs-24
branch), please test.
In addition, I'd suggest that Handa-san (or someone else) takes a good
look at the code that sets up the simple_translate table in
boyer_moore, because the constants there, like 0200 and 0x3F, and all
the talk about characters that belong "to the same charset and row"
smell of pre-Unicode (a.k.a. "MULE") representation of characters.
For now, I disabled boyer_moore for unibyte characters beyond 160,
because my reading of the code is that simple_translate and the
supporting code cannot handle that. Maybe I'm wrong.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Tue, 11 Dec 2012 23:25:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> I think i fixed this (revision 111021 on the emacs-24 branch),
> please test.
Thanks, there are no more crashes when using code from
http://debbugs.gnu.org/13041#41
Does this mean there are no more obstacles to filling a translation table
for ignoring equivalence with all character mappings according to the
`decomposition' property? This would be the first step in this direction.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Wed, 12 Dec 2012 03:57:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> From: Juri Linkov <juri <at> jurta.org>
> Cc: Kenichi Handa <handa <at> gnu.org>, 13084 <at> debbugs.gnu.org
> Date: Wed, 12 Dec 2012 01:17:04 +0200
>
> > I think i fixed this (revision 111021 on the emacs-24 branch),
> > please test.
>
> Thanks, there are no more crashes when using code from
> http://debbugs.gnu.org/13041#41
>
> Does this mean there are no more obstacles to filling a translation table
> for ignoring equivalence with all character mappings according to the
> `decomposition' property? This would be the first step in this direction.
I'm not sure I understand what you are asking. Please show more
details.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Wed, 12 Dec 2012 09:36:02 GMT)
Full text and
rfc822 format available.
Message #17 received at 13084 <at> debbugs.gnu.org (full text, mbox):
>> Does this mean there are no more obstacles to filling a translation table
>> for ignoring equivalence with all character mappings according to the
>> `decomposition' property? This would be the first step in this direction.
>
> I'm not sure I understand what you are asking. Please show more details.
There is confusion with the word `equivalence'. Currently there
exists the case equivalence table in the case table (`case_eqv_table').
Implementing a diacritic search in bug#13041 requires adding a new
similar table. I don't know what would be a good name:
`decomposition_eqv_table' or `normalization_eqv_table' or something better.
I'm unfamiliar with the details of `search_buffer', but in principle
using two tables in the macro `TRANSLATE' could implement a diacritic
search where at the first step the character will be translated using
`decomposition_eqv_table', and after that the resulting character
will be translated using `case_eqv_table'.
So the dataflow to get the canonical character will be Á -> A -> a.
If `case-fold-search' is nil, then Á -> A. If a new variable
`decomposition-search' (or `normalized-search') is nil then Á -> á.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Wed, 12 Dec 2012 10:23:01 GMT)
Full text and
rfc822 format available.
Message #20 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> So the dataflow to get the canonical character will be Á -> A -> a.
> If `case-fold-search' is nil, then Á -> A. If a new variable
> `decomposition-search' (or `normalized-search') is nil then Á -> á.
Any such table should allow handling asymmetric searches: That is,
searching for "ába" should match "ába" "ábà" and "ábá" but not "aba" or
"àbá". Can we do that?
martin
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Wed, 12 Dec 2012 10:39:01 GMT)
Full text and
rfc822 format available.
Message #23 received at 13084 <at> debbugs.gnu.org (full text, mbox):
>> So the dataflow to get the canonical character will be Á -> A -> a.
>> If `case-fold-search' is nil, then Á -> A. If a new variable
>> `decomposition-search' (or `normalized-search') is nil then Á -> á.
>
> Any such table should allow handling asymmetric searches: That is,
> searching for "ába" should match "ába" "ábà" and "ábá" but not "aba" or
> "àbá". Can we do that?
IIUC what you mean is something like `search-upper-case'
where upper case chars disable case fold searching,
so "Aba" should match "Aba" and "AbA" but not "aba".
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Wed, 12 Dec 2012 12:45:02 GMT)
Full text and
rfc822 format available.
Message #26 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> IIUC what you mean is something like `search-upper-case'
> where upper case chars disable case fold searching,
> so "Aba" should match "Aba" and "AbA" but not "aba".
Yes. I think that's a very good explanation in Emacs terms.
martin
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Wed, 12 Dec 2012 16:49:01 GMT)
Full text and
rfc822 format available.
Message #29 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> From: Juri Linkov <juri <at> jurta.org>
> Cc: handa <at> gnu.org, 13084 <at> debbugs.gnu.org
> Date: Wed, 12 Dec 2012 11:27:50 +0200
>
> >> Does this mean there are no more obstacles to filling a translation table
> >> for ignoring equivalence with all character mappings according to the
> >> `decomposition' property? This would be the first step in this direction.
> >
> > I'm not sure I understand what you are asking. Please show more details.
>
> There is confusion with the word `equivalence'. Currently there
> exists the case equivalence table in the case table (`case_eqv_table').
> Implementing a diacritic search in bug#13041 requires adding a new
> similar table. I don't know what would be a good name:
> `decomposition_eqv_table' or `normalization_eqv_table' or something better.
>
> I'm unfamiliar with the details of `search_buffer', but in principle
> using two tables in the macro `TRANSLATE' could implement a diacritic
> search where at the first step the character will be translated using
> `decomposition_eqv_table', and after that the resulting character
> will be translated using `case_eqv_table'.
>
> So the dataflow to get the canonical character will be Á -> A -> a.
> If `case-fold-search' is nil, then Á -> A. If a new variable
> `decomposition-search' (or `normalized-search') is nil then Á -> á.
OK, all this is now clear and agreed. So what did you mean by "no
more obstacles" above? The obstacles I see is that case tables aren't
up to the job because they don't support ignoring of characters, and
the code in search.c cannot handle ignoring even if the table did
support that. These obstacles still stand.
Reply sent
to
Juri Linkov <juri <at> jurta.org>
:
You have taken responsibility.
(Wed, 12 Dec 2012 23:11:01 GMT)
Full text and
rfc822 format available.
Notification sent
to
Juri Linkov <juri <at> jurta.org>
:
bug acknowledged by developer.
(Wed, 12 Dec 2012 23:11:01 GMT)
Full text and
rfc822 format available.
Message #34 received at 13084-done <at> debbugs.gnu.org (full text, mbox):
> So what did you mean by "no more obstacles" above?
By obstacles I meant crashes that you fixed.
Thanks for that. I'm closing this bug.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Thu, 13 Dec 2012 13:43:02 GMT)
Full text and
rfc822 format available.
Message #37 received at 13084 <at> debbugs.gnu.org (full text, mbox):
In article <831uewa9cq.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:
> In addition, I'd suggest that Handa-san (or someone else) takes a good
> look at the code that sets up the simple_translate table in
> boyer_moore, because the constants there, like 0200 and 0x3F, and all
> the talk about characters that belong "to the same charset and row"
> smell of pre-Unicode (a.k.a. "MULE") representation of characters.
> For now, I disabled boyer_moore for unibyte characters beyond 160,
> because my reading of the code is that simple_translate and the
> supporting code cannot handle that. Maybe I'm wrong.
I have not yet checked the code, but what I remember is that
search_buffer checks the search string and decides which to
use; boyer_moore or simple_search. If all equivalent
characters of all non-ASCII characters in the search string
are in the same character group, we can use boyer_moore.
Here, A and B belongs to the same character group iff A and
B has the same multibyte sequence except for the last byte.
In this condition, we should be able to use the table
simple_translate.
---
Kenichi Handa
handa <at> gnu.org
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Thu, 13 Dec 2012 17:34:02 GMT)
Full text and
rfc822 format available.
Message #40 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> From: Kenichi Handa <handa <at> gnu.org>
> Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
> Date: Thu, 13 Dec 2012 22:39:29 +0900
>
> I have not yet checked the code, but what I remember is that
> search_buffer checks the search string and decides which to
> use; boyer_moore or simple_search. If all equivalent
> characters of all non-ASCII characters in the search string
> are in the same character group, we can use boyer_moore.
Yes, that's my reading of the code as well.
> Here, A and B belongs to the same character group iff A and
> B has the same multibyte sequence except for the last byte.
> In this condition, we should be able to use the table
> simple_translate.
OK, then maybe just the comments need to be fixed. They shouldn't
talk about "charset" and "row", which are undefined in Unicode Emacs.
They should instead use terminology that correspond to UTF-8 multibyte
representation of characters we use today.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Sat, 15 Dec 2012 13:22:01 GMT)
Full text and
rfc822 format available.
Message #43 received at 13084 <at> debbugs.gnu.org (full text, mbox):
In article <83obhxoo2v.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:
> > Here, A and B belongs to the same character group iff A and
> > B has the same multibyte sequence except for the last byte.
> > In this condition, we should be able to use the table
> > simple_translate.
> OK, then maybe just the comments need to be fixed. They shouldn't
> talk about "charset" and "row", which are undefined in Unicode Emacs.
> They should instead use terminology that correspond to UTF-8 multibyte
> representation of characters we use today.
I've just committed this change. How is it?
=== modified file 'src/search.c'
--- src/search.c 2012-10-10 20:09:47 +0000
+++ src/search.c 2012-12-15 13:04:46 +0000
@@ -1313,8 +1313,11 @@
non-nil, we can use boyer-moore search only if TRT can be
represented by the byte array of 256 elements. For that,
all non-ASCII case-equivalents of all case-sensitive
- characters in STRING must belong to the same charset and
- row. */
+ characters in STRING must belong to the same character
+ group (two characters belong to the same group iff their
+ multibyte forms are the same except for the last byte;
+ i.e. every 64 characters form a group; U+0000..U+003F,
+ U+0040..U+007F, U+0080..U+00BF, ...). */
while (--len >= 0)
{
---
Kenichi Handa
handa <at> gnu.org
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#13084
; Package
emacs
.
(Sat, 15 Dec 2012 13:58:01 GMT)
Full text and
rfc822 format available.
Message #46 received at 13084 <at> debbugs.gnu.org (full text, mbox):
> From: Kenichi Handa <handa <at> gnu.org>
> Cc: juri <at> jurta.org, 13084 <at> debbugs.gnu.org
> Date: Sat, 15 Dec 2012 22:17:17 +0900
>
> In article <83obhxoo2v.fsf <at> gnu.org>, Eli Zaretskii <eliz <at> gnu.org> writes:
>
> > > Here, A and B belongs to the same character group iff A and
> > > B has the same multibyte sequence except for the last byte.
> > > In this condition, we should be able to use the table
> > > simple_translate.
>
> > OK, then maybe just the comments need to be fixed. They shouldn't
> > talk about "charset" and "row", which are undefined in Unicode Emacs.
> > They should instead use terminology that correspond to UTF-8 multibyte
> > representation of characters we use today.
>
> I've just committed this change. How is it?
Clear, thanks.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sun, 13 Jan 2013 12:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 12 years and 163 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.