GNU bug report logs - #17130
24.4.50; Deficient Unicode case folding

Previous Next

Package: emacs;

Reported by: Nathan Trapuzzano <nbtrap <at> nbtrap.com>

Date: Fri, 28 Mar 2014 12:08:02 UTC

Severity: wishlist

Found in version 24.4.50

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 17130 in the body.
You can then email your comments to 17130 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Fri, 28 Mar 2014 12:08:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Nathan Trapuzzano <nbtrap <at> nbtrap.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 28 Mar 2014 12:08:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.4.50; Deficient Unicode case folding
Date: Fri, 28 Mar 2014 08:07:20 -0400
M-: (compare-strings "σ" nil nil "ς" nil nil t)

==> -1  ;; should be t

Can someone that knows a thing about Unicode and emacs case tables speak
to whether the latter could suffice for implementing full Unicode case
folding?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Fri, 28 Mar 2014 15:52:02 GMT) Full text and rfc822 format available.

Message #8 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Fri, 28 Mar 2014 18:51:49 +0300
> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> Date: Fri, 28 Mar 2014 08:07:20 -0400
> 
> M-: (compare-strings "σ" nil nil "ς" nil nil t)
> 
> ==> -1  ;; should be t

No, because these characters are not a case pair.

> Can someone that knows a thing about Unicode and emacs case tables speak
> to whether the latter could suffice for implementing full Unicode case
> folding?

What is "full Unicode case folding"?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Fri, 28 Mar 2014 19:32:02 GMT) Full text and rfc822 format available.

Message #11 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: nbtrap <at> nbtrap.com
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Fri, 28 Mar 2014 15:31:09 -0400
Eli Zaretskii <eliz <at> gnu.org> writes:

>> M-: (compare-strings "σ" nil nil "ς" nil nil t)
>> 
>> ==> -1  ;; should be t
>
> No, because these characters are not a case pair.

They're not a case pair in Emacs, but they should compare equally under
Unicode case folding.

>> Can someone that knows a thing about Unicode and emacs case tables speak
>> to whether the latter could suffice for implementing full Unicode case
>> folding?
>
> What is "full Unicode case folding"?

Somthing that implements this:
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

And perhaps more.  I don't know, but someone on this list probably does.

If you look about a third of the way down, there's a line saying that
U+03C2 (ς) should fold into U+03C3 (σ).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 06:46:02 GMT) Full text and rfc822 format available.

Message #14 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: nbtrap <at> nbtrap.com
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 09:45:10 +0300
> From: nbtrap <at> nbtrap.com
> Cc: 17130 <at> debbugs.gnu.org
> Date: Fri, 28 Mar 2014 15:31:09 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> M-: (compare-strings "σ" nil nil "ς" nil nil t)
> >> 
> >> ==> -1  ;; should be t
> >
> > No, because these characters are not a case pair.
> 
> They're not a case pair in Emacs, but they should compare equally under
> Unicode case folding.

Emacs doesn't currently support that.

> >> Can someone that knows a thing about Unicode and emacs case tables speak
> >> to whether the latter could suffice for implementing full Unicode case
> >> folding?
> >
> > What is "full Unicode case folding"?
> 
> Somthing that implements this:
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> 
> And perhaps more.  I don't know, but someone on this list probably does.
> 
> If you look about a third of the way down, there's a line saying that
> U+03C2 (ς) should fold into U+03C3 (σ).

Patches are welcome to import those tables into Emacs, and make case
folding support them.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 12:38:02 GMT) Full text and rfc822 format available.

Message #17 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 08:37:35 -0400
Eli Zaretskii <eliz <at> gnu.org> writes:

>> > What is "full Unicode case folding"?
>> 
>> Somthing that implements this:
>> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
>> 
>> And perhaps more.  I don't know, but someone on this list probably does.
>> 
>> If you look about a third of the way down, there's a line saying that
>> U+03C2 (ς) should fold into U+03C3 (σ).
>
> Patches are welcome to import those tables into Emacs, and make case
> folding support them.

Reading through the manual section on case tables, it seems that this
could be supported via the extra "canonicalize" slot:

    CANONICALIZE
      The canonicalize table maps all of a set of case-related
      characters into a particular member of that set.

If this isn't already used for Unicode case folding, what _is_ it used
for?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 13:16:02 GMT) Full text and rfc822 format available.

Message #20 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 16:15:53 +0300
> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> Cc: 17130 <at> debbugs.gnu.org
> Date: Sat, 29 Mar 2014 08:37:35 -0400
> 
> Reading through the manual section on case tables, it seems that this
> could be supported via the extra "canonicalize" slot:
> 
>     CANONICALIZE
>       The canonicalize table maps all of a set of case-related
>       characters into a particular member of that set.

Not efficiently, no.  E.g., how will you find ς from σ, using this
method?

Besides, don't we also need to know that ς can only be present at the
end of a word?

Or maybe I'm misunderstanding what you meant?

> If this isn't already used for Unicode case folding, what _is_ it used
> for?

It is used for case-insensitive regexp matching, see search.c.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 14:04:02 GMT) Full text and rfc822 format available.

Message #23 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 10:03:32 -0400
Eli Zaretskii <eliz <at> gnu.org> writes:

>> Reading through the manual section on case tables, it seems that this
>> could be supported via the extra "canonicalize" slot:
>> 
>>     CANONICALIZE
>>       The canonicalize table maps all of a set of case-related
>>       characters into a particular member of that set.
>
> Not efficiently, no.  E.g., how will you find ς from σ, using this
> method?

σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
fold to σ.  (By the way, ς should upcase to Σ--that much I know the case
tables can handle.)

> Besides, don't we also need to know that ς can only be present at the
> end of a word?

Don't think so.  AFAIK, Unicode says nothing about ordering except when
it comes to combining characters.  But even it did prescribe such a
rule, I don't think it would have anything to do with case folding.

>> If this isn't already used for Unicode case folding, what _is_ it used
>> for?
>
> It is used for case-insensitive regexp matching, see search.c.

Right, but what I'm asking is: if Emacs doesn't do Unicode case folding,
what is the purpose of the CANONICALIZE slot except as a kind of
placeholder that gets autofilled?  Are there other kinds of case
folding--other than traditional upper/lower and Unicode--that I'm not
aware of?  I understand that Emacs autofills the CANONICALIZE slot from
the other slots, but only when the CANONICALIZE slot is not already set
to non-nil.  What if the CANONICALIZE slot on ς were set to σ?  I think
that's all that would have to happen for the Unicode folding to work.
It seems the machinery is already in place.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 14:46:02 GMT) Full text and rfc822 format available.

Message #26 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 17:45:47 +0300
> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> Cc: 17130 <at> debbugs.gnu.org
> Date: Sat, 29 Mar 2014 10:03:32 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> Reading through the manual section on case tables, it seems that this
> >> could be supported via the extra "canonicalize" slot:
> >> 
> >>     CANONICALIZE
> >>       The canonicalize table maps all of a set of case-related
> >>       characters into a particular member of that set.
> >
> > Not efficiently, no.  E.g., how will you find ς from σ, using this
> > method?
> 
> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
> fold to σ.

So you would need to search all characters to find those which have σ
in the CANONICALIZE slot -- not very efficient, to say the least.

IOW, what you suggest will provide a one-way mapping, whereas we need
a two-way mapping.

> > Besides, don't we also need to know that ς can only be present at the
> > end of a word?
> 
> Don't think so.  AFAIK, Unicode says nothing about ordering except when
> it comes to combining characters.  But even it did prescribe such a
> rule, I don't think it would have anything to do with case folding.

Who said this is only about case folding?  Emacs should use this data
for up-casing and down-casing as well, for example, so that M-l
downcases Σ to ς, not σ, when it is at the end of the word.  Wouldn't
users of Greek expect that?

> >> If this isn't already used for Unicode case folding, what _is_ it used
> >> for?
> >
> > It is used for case-insensitive regexp matching, see search.c.
> 
> Right, but what I'm asking is: if Emacs doesn't do Unicode case folding,
> what is the purpose of the CANONICALIZE slot except as a kind of
> placeholder that gets autofilled?

Whenever you need the canonical equivalent of a character, such as in
case-insensitive search, you need that slot.

> Are there other kinds of case folding--other than traditional
> upper/lower and Unicode--that I'm not aware of?

There's "title case", of course.  There are also characters whose case
pair is not a single character, but several, like the upper-case
variant of ß in German.  Basically, any character not marked "C" in
the Unicode CaseFolding.txt is special in some way.

> I understand that Emacs autofills the CANONICALIZE slot from
> the other slots, but only when the CANONICALIZE slot is not already set
> to non-nil.  What if the CANONICALIZE slot on ς were set to σ?  I think
> that's all that would have to happen for the Unicode folding to work.
> It seems the machinery is already in place.

For this case, maybe (and even it doesn't handle Σ correctly, I think,
when downcased at the end of the word).  For other cases, not
necessarily.

Personally, I think we need an additional slot for what you want, and
code to use it.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 15:31:02 GMT) Full text and rfc822 format available.

Message #29 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 11:29:43 -0400
Eli Zaretskii <eliz <at> gnu.org> writes:

>> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
>> fold to σ.
>
> So you would need to search all characters to find those which have σ
> in the CANONICALIZE slot -- not very efficient, to say the least.

Doesn't this already happen?  If not, then what is the CANONICALIZE slot
doing that couldn't be done with the regular upcase/downcase slots by
themselves?

> IOW, what you suggest will provide a one-way mapping, whereas we need
> a two-way mapping.

Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
least in principle.

>> > Besides, don't we also need to know that ς can only be present at the
>> > end of a word?
>> 
>> Don't think so.  AFAIK, Unicode says nothing about ordering except when
>> it comes to combining characters.  But even it did prescribe such a
>> rule, I don't think it would have anything to do with case folding.
>
> Who said this is only about case folding?

I should have said just "case", not "case folding".

> Emacs should use this data for up-casing and down-casing as well, for
> example, so that M-l downcases Σ to ς, not σ, when it is at the end of
> the word.  Wouldn't users of Greek expect that?

Maybe.  I'm just saying that Unicode itself doesn't prescribe or even
recommend such behavior.  It defines case conversions independently of
ordering.

That said, making M-l downcase terminal Σ to ς would be a nice feature
that could be enabled, e.g., by enabling a minor mode or by modifying
some *-functions variable of functions that get called before the normal
behavior of M-l is applied, etc.  But it shouldn't have anything to do
with Unicode-compliant case-insensitive searching.

>> Right, but what I'm asking is: if Emacs doesn't do Unicode case folding,
>> what is the purpose of the CANONICALIZE slot except as a kind of
>> placeholder that gets autofilled?
>
> Whenever you need the canonical equivalent of a character, such as in
> case-insensitive search, you need that slot.

But there's nothing about the slot that mandates that only _pairs_ can
be case-equivalent under case folding.  Indeed, the manual speaks of
"sets" of chracters that might be equivalent under case-folding, hence
my understanding that σ, ς, and Σ can all have σ in their CANONICALIZE
slot, and that's all it would take.

(Btw, I'm using "case-insensitive" to mean the same as "under
case-folding".)

>> Are there other kinds of case folding--other than traditional
>> upper/lower and Unicode--that I'm not aware of?
>
> There's "title case", of course.  

I think title case would require an extra slot in the case table.

> There are also characters whose case pair is not a single character,
> but several, like the upper-case variant of ß in German.

Good point.  "ß" should fold to "ss".  I guess for the CANONICALIZE slot
to suffice, it would have to map to a string, not a code point.

> Personally, I think we need an additional slot for what you want, and
> code to use it.

Given the point about ß, you're probably right.  Unless we can make
entries in the CANONICALIZE slot be strings rather than code points.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 17:38:02 GMT) Full text and rfc822 format available.

Message #32 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 20:37:38 +0300
> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> Cc: 17130 <at> debbugs.gnu.org
> Date: Sat, 29 Mar 2014 11:29:43 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> σ, ς, and Σ would all have σ in the CANONICALIZE slot, since they all
> >> fold to σ.
> >
> > So you would need to search all characters to find those which have σ
> > in the CANONICALIZE slot -- not very efficient, to say the least.
> 
> Doesn't this already happen?

No, not when that slot is used for case-insensitive search.  You just
use it to get the canonical equivalent, i.e. use the one-way mapping
that it provides.

> If not, then what is the CANONICALIZE slot doing that couldn't be
> done with the regular upcase/downcase slots by themselves?

If that slot is "trivial", i.e. contains the lower-case variant of the
character, then indeed this slot doesn't add information, I think,
only utility.  But it doesn't have to contain the lower-case variant.

> > IOW, what you suggest will provide a one-way mapping, whereas we need
> > a two-way mapping.
> 
> Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
> least in principle.

It is sufficient for mapping a character to its canonical equivalent,
but not finding the non-canonical variants of a canonical character.
IOW, it is not well suited to finding ς given just σ.

> > Emacs should use this data for up-casing and down-casing as well, for
> > example, so that M-l downcases Σ to ς, not σ, when it is at the end of
> > the word.  Wouldn't users of Greek expect that?
> 
> Maybe.  I'm just saying that Unicode itself doesn't prescribe or even
> recommend such behavior.  It defines case conversions independently of
> ordering.
> 
> That said, making M-l downcase terminal Σ to ς would be a nice feature
> that could be enabled, e.g., by enabling a minor mode or by modifying
> some *-functions variable of functions that get called before the normal
> behavior of M-l is applied, etc.  But it shouldn't have anything to do
> with Unicode-compliant case-insensitive searching.

For searching, you only need the CANONICALIZE slot.  But what about
replacing the search string while keeping the letter case in the
replacement?  For that, CANONICALIZE alone is not enough, you need the
reverse mapping.

> > Personally, I think we need an additional slot for what you want, and
> > code to use it.
> 
> Given the point about ß, you're probably right.  Unless we can make
> entries in the CANONICALIZE slot be strings rather than code points.

This is Lisp; a vector slot can contain any Lisp object.  But using
CANONICALIZE for what you want would be wrong, I think, because it
will screw up case-insensitive search, which expects to find there a
single character.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 18:33:02 GMT) Full text and rfc822 format available.

Message #35 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 14:31:52 -0400
Eli Zaretskii <eliz <at> gnu.org> writes:

>> > So you would need to search all characters to find those which have σ
>> > in the CANONICALIZE slot -- not very efficient, to say the least.
>> 
>> Doesn't this already happen?
>
> No, not when that slot is used for case-insensitive search.  You just
> use it to get the canonical equivalent, i.e. use the one-way mapping
> that it provides.

I still don't get it.  What I say below may explain why.

>> If not, then what is the CANONICALIZE slot doing that couldn't be
>> done with the regular upcase/downcase slots by themselves?
>
> If that slot is "trivial", i.e. contains the lower-case variant of the
> character, then indeed this slot doesn't add information, I think,
> only utility.  But it doesn't have to contain the lower-case variant.

I know.  But if Emacs doesn't do Unicode folding, what is there other
than lower/upper variants?

>> > IOW, what you suggest will provide a one-way mapping, whereas we need
>> > a two-way mapping.
>> 
>> Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
>> least in principle.
>
> It is sufficient for mapping a character to its canonical equivalent,
> but not finding the non-canonical variants of a canonical character.
> IOW, it is not well suited to finding ς given just σ.

Finding the non-canonical variants is not something that happens (at
least in principle) during case-insensitive matching.  You convert both
the matching string and the string being matched into their canonical
equivalents and see if they match.  You never UNfold.  Case folding is
by definition a one-way operation.

>> That said, making M-l downcase terminal Σ to ς would be a nice feature
>> that could be enabled, e.g., by enabling a minor mode or by modifying
>> some *-functions variable of functions that get called before the normal
>> behavior of M-l is applied, etc.  But it shouldn't have anything to do
>> with Unicode-compliant case-insensitive searching.
>
> For searching, you only need the CANONICALIZE slot.  But what about
> replacing the search string while keeping the letter case in the
> replacement?  For that, CANONICALIZE alone is not enough, you need the
> reverse mapping.

There is no reverse mapping when it comes to folding.  There can't be,
since multiple characters can fold into the same character.

I don't fully understand what "case-replace" does (e.g. case being a
property of characters and not strings, what does it mean to "preserve
case" when replacing a string of length x with a string of length y
where x != y), but I don't think Unicode folding would complicate it.
There are three cases in Unicode: lower, upper, and title.  Upper and
title already overlap for the vast majority of codepoints, so there you
already have problems with a case-preserving replace.  That said "fold"
is not a case in Unicode; it's a one-way mapping of non-overlapping sets
of characters to a canonical equivalent, so it makes no sense to talk
about preserving case with respect to case folding.

Notandum: I was wrong about Unicode saying nothing about character
ordering for non-combining characters.  The "special casing" document
(ftp://ftp.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt) contains
context- and language- dependent case rules for certain characters,
including final sigma.  Notably, the document says that Σ in terminal
position should (or "may"--I'm not really sure about how to interpret
the document) downcase to ς.  That said, the document has _nothing_ to
do with case _folding_, which is always context- and language-
independent.

Rightly interpreted, therefore, case _conversion_ (such as in
case-preserving replace) and case-insensitive _searching_ (i.e. case
folding), according to Unicode, are orthogonal.  We don't have to
address both at the same time.

>> Given the point about ß, you're probably right.  Unless we can make
>> entries in the CANONICALIZE slot be strings rather than code points.
>
> This is Lisp; a vector slot can contain any Lisp object.  But using
> CANONICALIZE for what you want would be wrong, I think, because it
> will screw up case-insensitive search, which expects to find there a
> single character.

Right, that's what I meant.  Putting strings there would break
something.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 18:37:01 GMT) Full text and rfc822 format available.

Message #38 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 14:36:42 -0400
Nathan Trapuzzano <nbtrap <at> nbtrap.com> writes:

> Rightly interpreted, therefore, case _conversion_ (such as in
> case-preserving replace) and case-insensitive _searching_ (i.e. case
> folding), according to Unicode, are orthogonal.  We don't have to
> address both at the same time.

Er, let me rephrase.  Case _conversion_ (such as in case-preserving
replace) and case _folding_ (such as ought be used in case-insensitive
searching) are orthogonal.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 19:51:01 GMT) Full text and rfc822 format available.

Message #41 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 22:50:40 +0300
> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> Cc: 17130 <at> debbugs.gnu.org
> Date: Sat, 29 Mar 2014 14:31:52 -0400
> 
> >> If not, then what is the CANONICALIZE slot doing that couldn't be
> >> done with the regular upcase/downcase slots by themselves?
> >
> > If that slot is "trivial", i.e. contains the lower-case variant of the
> > character, then indeed this slot doesn't add information, I think,
> > only utility.  But it doesn't have to contain the lower-case variant.
> 
> I know.  But if Emacs doesn't do Unicode folding, what is there other
> than lower/upper variants?

You can make it have whatever you like, because you can set up
buffer-specific tables.

> >> Not sure I follow.  Seems to me the CANONICALIZE slot is sufficient, at
> >> least in principle.
> >
> > It is sufficient for mapping a character to its canonical equivalent,
> > but not finding the non-canonical variants of a canonical character.
> > IOW, it is not well suited to finding ς given just σ.
> 
> Finding the non-canonical variants is not something that happens (at
> least in principle) during case-insensitive matching.

The case database is not only for searching.

> > For searching, you only need the CANONICALIZE slot.  But what about
> > replacing the search string while keeping the letter case in the
> > replacement?  For that, CANONICALIZE alone is not enough, you need the
> > reverse mapping.
> 
> There is no reverse mapping when it comes to folding.  There can't be,
> since multiple characters can fold into the same character.

You can use the case of the string being replaced as guidelines.
E.g., if the replaced string was capitalized, you can capitalize the
replacement.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 19:52:02 GMT) Full text and rfc822 format available.

Message #44 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 22:51:20 +0300
> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> Cc: 17130 <at> debbugs.gnu.org
> Date: Sat, 29 Mar 2014 14:36:42 -0400
> 
> Er, let me rephrase.  Case _conversion_ (such as in case-preserving
> replace) and case _folding_ (such as ought be used in case-insensitive
> searching) are orthogonal.

But they can very well use the same database.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 20:05:01 GMT) Full text and rfc822 format available.

Message #47 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 16:01:10 -0400
Eli Zaretskii <eliz <at> gnu.org> writes:

>> I know.  But if Emacs doesn't do Unicode folding, what is there other
>> than lower/upper variants?
>
> You can make it have whatever you like, because you can set up
> buffer-specific tables.

Makes me wonder if whoever implemented the CANONICALIZE slot had Unicode
folding in mind.

>> Finding the non-canonical variants is not something that happens (at
>> least in principle) during case-insensitive matching.
>
> The case database is not only for searching.
>
>> There is no reverse mapping when it comes to folding.  There can't be,
>> since multiple characters can fold into the same character.
>
> You can use the case of the string being replaced as guidelines.
> E.g., if the replaced string was capitalized, you can capitalize the
> replacement.

I think you're still conflating case conversion and case folding.  As I
said, there is no case called "fold".  There's just upper, lower, and
title.  And the fact that these three overlap is already a problem for
case-preserving replace.  I spent most of my last email trying to
explain this.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sat, 29 Mar 2014 20:16:01 GMT) Full text and rfc822 format available.

Message #50 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sat, 29 Mar 2014 16:15:34 -0400
Eli Zaretskii <eliz <at> gnu.org> writes:

>> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
>> Cc: 17130 <at> debbugs.gnu.org
>> Date: Sat, 29 Mar 2014 14:36:42 -0400
>> 
>> Er, let me rephrase.  Case _conversion_ (such as in case-preserving
>> replace) and case _folding_ (such as ought be used in case-insensitive
>> searching) are orthogonal.
>
> But they can very well use the same database.

It's not clear what you mean.

We already have a place to store upper- and lower- case variants.  What
I'm proposing is to use the CANONICALIZE slot as a place to store the
case-folding mapping.  If this would mess up Emacs' case-preserving
replace, then I think that would just mean that case-preserving replace
is broken.  There is no such case as "canonicalize"--you can't say, "Oh,
this string is in the canonical case, so when I want to replace it with
this other string in canonical case".  A case-preserving replace should
only consult the upper- and lower-case slots (and perhaps the title-case
slot if it existed).




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sun, 30 Mar 2014 02:46:01 GMT) Full text and rfc822 format available.

Message #53 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sun, 30 Mar 2014 05:45:39 +0300
> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> Cc: 17130 <at> debbugs.gnu.org
> Date: Sat, 29 Mar 2014 16:15:34 -0400
> 
> Eli Zaretskii <eliz <at> gnu.org> writes:
> 
> >> From: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
> >> Cc: 17130 <at> debbugs.gnu.org
> >> Date: Sat, 29 Mar 2014 14:36:42 -0400
> >> 
> >> Er, let me rephrase.  Case _conversion_ (such as in case-preserving
> >> replace) and case _folding_ (such as ought be used in case-insensitive
> >> searching) are orthogonal.
> >
> > But they can very well use the same database.
> 
> It's not clear what you mean.

You keep asking questions about the purpose of the CANONICALIZE slot,
and I keep trying to explain that purpose.

> We already have a place to store upper- and lower- case variants.  What
> I'm proposing is to use the CANONICALIZE slot as a place to store the
> case-folding mapping.  If this would mess up Emacs' case-preserving
> replace, then I think that would just mean that case-preserving replace
> is broken.  There is no such case as "canonicalize"--you can't say, "Oh,
> this string is in the canonical case, so when I want to replace it with
> this other string in canonical case".  A case-preserving replace should
> only consult the upper- and lower-case slots (and perhaps the title-case
> slot if it existed).

Perhaps you should tell what does tis mean in practice, from the POV
of populating the CANONICALIZE slot, and how that content would be
used under your proposal.  That should make the discussion more
useful, I hope.




Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Tue, 15 Apr 2014 04:01:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#17130; Package emacs. (Sun, 29 Sep 2019 14:24:01 GMT) Full text and rfc822 format available.

Message #58 received at 17130 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Nathan Trapuzzano <nbtrap <at> nbtrap.com>
Cc: 17130 <at> debbugs.gnu.org
Subject: Re: bug#17130: 24.4.50; Deficient Unicode case folding
Date: Sun, 29 Sep 2019 16:23:21 +0200
Nathan Trapuzzano <nbtrap <at> nbtrap.com> writes:

> M-: (compare-strings "σ" nil nil "ς" nil nil t)
>
> ==> -1  ;; should be t

(compare-strings "σ" nil nil "ς" nil nil t)
=> t

I'm unable to reproduce this in Emacs 27, so I'm going to go ahead and
guess that this has been fixed in the years since this bug was reported,
and I'm closing this bug report.  If this is still a problem, please
reopen.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




bug closed, send any further explanations to 17130 <at> debbugs.gnu.org and Nathan Trapuzzano <nbtrap <at> nbtrap.com> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 29 Sep 2019 14:24:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 28 Oct 2019 11:24:10 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 182 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.