GNU bug report logs - #13041
24.2; diacritic-fold-search

Previous Next

Package: emacs;

Reported by: perin <at> acm.org

Date: Fri, 30 Nov 2012 18:31:02 UTC

Severity: wishlist

Found in version 24.2

Fixed in version 25.1

Done: Michael Albinus <michael.albinus <at> gmx.de>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 13041 in the body.
You can then email your comments to 13041 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 30 Nov 2012 18:31:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to perin <at> acm.org:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Fri, 30 Nov 2012 18:31:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Lewis Perin <perin <at> panix.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 13:22:05 -0500 (EST)

This is not a bug report but a feature request, so I am omitting
diagnostic information.

Emacs search has long been able to toggle between (a) ignoring the
distinction between upper- and lower-case characters
(case-fold-search) and (b) searching for only one of the pair.  One
could say Climacs offers the choice between (a) searching for all
members of a (2-member) equivalence class and (b) searching for only
one member.

There are larger equivalence classes of characters with practical use
which Climacs is currently unaware of: the groups of characters
consisting of an unadorned (ASCII) character plus all its
diacritic-adorned versions.  Currently, if I want to search for both
“apres” and “après”, I need an additive regular expression.  I would
like to do this as easily as I can search for “apres” and “Apres”.  I
would be delighted if Emacs implemented the equivalence classes
spelled out here:

  http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html

I might add that diacritics folding is the default in web search
engines.  It is also a feature of at least one Web browser in
searching the text of a displayed page (Chrome.)

I’m sure that maintaining the core of Emacs is a big job, and I’m
grateful for the skill and effort that go into that task, including
your consideration of this request!

/Lew
---
Lew Perin | perin <at> acm.org | http://babelcarp.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 30 Nov 2012 18:56:02 GMT) Full text and rfc822 format available.

Message #8 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Lewis Perin <perin <at> panix.com>
Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 20:51:44 +0200

> Currently, if I want to search for both “apres” and “après”,
> I need an additive regular expression.  I would like to do this as
> easily as I can search for “apres” and “Apres”.  I would be delighted
> if Emacs implemented the equivalence classes spelled out here:
>
>   http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html

This could be implemented in isearch using a recipe from

http://thread.gmane.org/gmane.emacs.devel/117003/focus=117959

Instead of hard-coding a list of equivalent characters
I guess it should be possible to do this automatically
using Unicode information about characters.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 30 Nov 2012 19:34:02 GMT) Full text and rfc822 format available.

Message #11 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Lewis Perin <perin <at> panix.com>
Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 14:31:08 -0500

severity 13041 wishlist
thanks

> diacritic-adorned versions.  Currently, if I want to search for both
> “apres” and “après”, I need an additive regular expression.  I would
> like to do this as easily as I can search for “apres” and “Apres”.

That would be a very welcome feature, indeed.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 30 Nov 2012 21:10:01 GMT) Full text and rfc822 format available.

Message #14 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Lewis Perin <perin <at> panix.com>
To: Juri Linkov <juri <at> jurta.org>
Cc: 13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 16:07:44 -0500

Juri Linkov writes:
> > Currently, if I want to search for both “apres” and “après”,
> > I need an additive regular expression.  I would like to do this as
> > easily as I can search for “apres” and “Apres”.  I would be delighted
> > if Emacs implemented the equivalence classes spelled out here:
> >
> >   http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
> 
> This could be implemented in isearch using a recipe from
> 
> http://thread.gmane.org/gmane.emacs.devel/117003/focus=117959
> 
> Instead of hard-coding a list of equivalent characters
> I guess it should be possible to do this automatically
> using Unicode information about characters.

I never thought I was the first to wonder about this!

In the last message of that thread, you say “Provided it doesn’t make
the search slow, it would be nice to add it to Emacs activating on
some user settings.”  Do you remember if that technique turned out to
be tolerably speedy?

/Lew
---
Lew Perin | perin <at> acm.org | http://babelcarp.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 00:42:02 GMT) Full text and rfc822 format available.

Message #17 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Lewis Perin <perin <at> panix.com>
Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 01 Dec 2012 02:27:40 +0200

> In the last message of that thread, you say “Provided it doesn’t make
> the search slow, it would be nice to add it to Emacs activating on
> some user settings.”  Do you remember if that technique turned out to
> be tolerably speedy?

Yes, I have no problems with the speed.  The problem is how to
disable this feature when it is active.  We need a special key
to toggle it in Isearch.  One variant is M-s ~ where the easy-to-type
TILDE character represents diacritics.  Also it's unclear whether the
Isearch prompt should indicate its active state as e.g.

  Diacritic I-search:

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 00:50:01 GMT) Full text and rfc822 format available.

Message #20 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Juri Linkov'" <juri <at> jurta.org>, "'Lewis Perin'" <perin <at> panix.com>
Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 16:47:26 -0800

> it's unclear whether the Isearch prompt should indicate
> its active state

Ǐsearch

(But perhaps that suggests recognizing, rather than ignoring, diacritics.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 00:52:01 GMT) Full text and rfc822 format available.

Message #23 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Juri Linkov'" <juri <at> jurta.org>, "'Lewis Perin'" <perin <at> panix.com>
Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 16:49:24 -0800

> > it's unclear whether the Isearch prompt should indicate
> > its active state
> 
> Isearch
> 
> (But perhaps that suggests recognizing, rather than ignoring, 
> diacritics.)

Hm. That was a capital I with caron when I sent it...

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 01:24:02 GMT) Full text and rfc822 format available.

Message #26 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Lew Perin <perin <at> panix.com>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: Juri Linkov <juri <at> jurta.org>,
	"<13041 <at> debbugs.gnu.org>" <13041 <at> debbugs.gnu.org>,
	"<perin <at> acm.org>" <perin <at> acm.org>
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 20:20:59 -0500

On Nov 30, 2012, at 7:49 PM, "Drew Adams" <drew.adams <at> oracle.com> wrote:

>>> it's unclear whether the Isearch prompt should indicate
>>> its active state
>> 
>> Isearch
>> 
>> (But perhaps that suggests recognizing, rather than ignoring, 
>> diacritics.)
> 
> Hm. That was a capital I with caron when I sent it...

A caron-topped capital I is exactly what I got (on my iPhone.)

/Lew
---
Lew Perin | perin <at> acm.org | http://babelcarp.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 06:54:02 GMT) Full text and rfc822 format available.

Message #29 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Lew Perin'" <perin <at> panix.com>
Cc: 'Juri Linkov' <juri <at> jurta.org>, 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 30 Nov 2012 22:50:48 -0800

> >>> it's unclear whether the Isearch prompt should indicate
> >>> its active state
> >> 
> >> Isearch
> >> 
> >> (But perhaps that suggests recognizing, rather than ignoring, 
> >> diacritics.)
> > 
> > Hm. That was a capital I with caron when I sent it...
> 
> A caron-topped capital I is exactly what I got (on my iPhone.)

Great.  I guess it's the encoding used in my mail client that's showing it with
no marks.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 08:36:02 GMT) Full text and rfc822 format available.

Message #32 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 01 Dec 2012 10:32:35 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Date: Sat, 01 Dec 2012 02:27:40 +0200
> Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
> 
> > In the last message of that thread, you say “Provided it doesn’t make
> > the search slow, it would be nice to add it to Emacs activating on
> > some user settings.”  Do you remember if that technique turned out to
> > be tolerably speedy?
> 
> Yes, I have no problems with the speed.  The problem is how to
> disable this feature when it is active.  We need a special key
> to toggle it in Isearch.  One variant is M-s ~ where the easy-to-type
> TILDE character represents diacritics.  Also it's unclear whether the
> Isearch prompt should indicate its active state as e.g.

I don't understand why this thread is talking only about Latin
characters with diacritics.  That is a special case of what Unicode
calls "compatibility equivalence" (q.e.).  For example, even in the
Latin environments, don't you want to find "sniﬀ" when searching for
"sniff", and vice versa? And there are similar issues in many
non-Latin scripts.

The decomposition of a character such as 'ﬀ' is given by the Unicode
database, for example:

  FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;;
                                      ^^^^^^^^^^^^^^^^^^

(66 hex, or 102 decimal, is the codepoint of 'f').

Emacs already supports these decomposition properties.  E.g.:

  (get-char-code-property ?ﬀ 'decomposition) => (compat 102 102)

Another example, closer to the issue that triggered this thread:

  (get-char-code-property ?è 'decomposition) => (101 768)

(If you want to understand why the previous example included "compat"
in the result, while this one doesn't, read more about Unicode
normalization forms.  The distinction is irrelevant for the current
discussion.)

Using these properties, every search string can be converted to a
sequence of non-decomposable characters (this process is recursive,
because the 'decomposition' property can use characters that
themselves are decomposable).  If the user wants to ignore diacritics,
then the diacritics should be dropped from the decomposition sequence
before starting the search.  E.g., for the decomposition of è above,
we will drop the 768 and will be left with 101, which is 'e'.  Then
searching for that string should apply the same decomposition
transformation to the text being searched, when comparing them.

This would be the most general way of solving this issue, a way that
is not limited to diacritics nor to Latin scripts.  And doing that
will move Emacs closer to the goal of being Unicode compatible, since
support for this is required by the Unicode Standard.

By contrast, building and using custom data bases of equivalences that
are limited to diacritics in Latin scripts is not moving Emacs towards
that goal.  It's just a hack, IMO.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 09:12:02 GMT) Full text and rfc822 format available.

Message #35 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: juri <at> jurta.org, perin <at> panix.com
Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 01 Dec 2012 11:09:20 +0200

> Date: Sat, 01 Dec 2012 10:32:35 +0200
> From: Eli Zaretskii <eliz <at> gnu.org>
> Cc: perin <at> panix.com, 13041 <at> debbugs.gnu.org, perin <at> acm.org
> 
> I don't understand why this thread is talking only about Latin
> characters with diacritics.  That is a special case of what Unicode
> calls "compatibility equivalence" (q.e.).
                                     ^^^^
I meant "q.v.", of course.  Sorry.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 01 Dec 2012 16:42:01 GMT) Full text and rfc822 format available.

Message #38 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Eli Zaretskii'" <eliz <at> gnu.org>, "'Juri Linkov'" <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 1 Dec 2012 08:38:45 -0800

> I don't understand why this thread is talking only about Latin
> characters with diacritics.  That is a special case of what Unicode
> calls "compatibility equivalence" (q.e.).  For example, even in the
> Latin environments, don't you want to find "sni?" when searching for
> "sniff", and vice versa? And there are similar issues in many
> non-Latin scripts.

Actually, in the original thread I made the same point.  
Please see that discussion for this and other points.
http://lists.gnu.org/archive/html/help-gnu-emacs/2012-11/msg00429.html

> The decomposition of a character such as '?' is given by
> the Unicode database...  Emacs already supports these
> decomposition properties.

That's good news (new to me).  So it sounds like even the most hopeful
wanna-haves of the discussion could perhaps be realized without too much
trouble.

> Using these properties, every search string can be converted to a
> sequence of non-decomposable characters (this process is recursive,
> because the 'decomposition' property can use characters that
> themselves are decomposable).  If the user wants to ignore diacritics,
> then the diacritics should be dropped from the decomposition sequence
> before starting the search.  E.g., for the decomposition of è above,
> we will drop the 768 and will be left with 101, which is 'e'.  Then
> searching for that string should apply the same decomposition
> transformation to the text being searched, when comparing them.
> 
> This would be the most general way of solving this issue, a way that
> is not limited to diacritics nor to Latin scripts.  And doing that
> will move Emacs closer to the goal of being Unicode compatible, since
> support for this is required by the Unicode Standard.

This sounds great.  I really hope someone with the time and knowledge adds such
a feature soon (even though, to be clear, I personally do not have much need for
it).  I think it would be very handy for many users - most welcome.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 02 Dec 2012 00:50:02 GMT) Full text and rfc822 format available.

Message #41 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 02:27:32 +0200

> Using these properties, every search string can be converted to a
> sequence of non-decomposable characters (this process is recursive,
> because the 'decomposition' property can use characters that
> themselves are decomposable).  If the user wants to ignore diacritics,
> then the diacritics should be dropped from the decomposition sequence
> before starting the search.  E.g., for the decomposition of è above,
> we will drop the 768 and will be left with 101, which is 'e'.  Then
> searching for that string should apply the same decomposition
> transformation to the text being searched, when comparing them.

Yes, using the `decomposition' property would be better than hard-coding
these decomposition mappings.  Though I'm surprised to see case mappings
hard-coded in lisp/international/characters.el instead of using the
properties `uppercase' and `lowercase' during creation of case tables.

But nevertheless the `decomposition' property should be used to find
all decomposable characters.  The question is how to use them in the search.
One solution is to use the case tables.  I tried to build the case table
with the decomposed characters retrieved using the `decomposition' property
recursively:

(defvar decomposition-table nil)

(defun make-decomposition-table ()
  (let ((table (standard-case-table))
        canon)
    (setq canon (copy-sequence table))
    (let ((c #x0000) d)
      (while (<= c #xFFFD)
        (make-decomposition-table-1 canon c c)
        (setq c (1+ c))))
    (set-char-table-extra-slot table 1 canon)
    (set-char-table-extra-slot table 2 nil)
    (setq decomposition-table table)))

(defun make-decomposition-table-1 (canon c0 c1)
  (let ((d (get-char-code-property c1 'decomposition)))
    (when d
      (unless (characterp (car d)) (pop d))
      (if (eq c1 (car d))
          (aset canon c0 (car d))
        (make-decomposition-table-1 canon c0 (car d))))))

(make-decomposition-table)

Then a new Isearch command (the existing `isearch-toggle-case-fold'
can't be used because it enables/disables the standard case table)
could toggle between the current case table and the decomposition
case table using

  (set-case-table decomposition-table)

After evaluating this, Isearch correctly finds all related characters
in every row of this example:

  http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html

But it seems using the case table for decomposition has one limitation.
I see no way to ignore combining accent characters in the case table,
i.e. to map combining accent characters to nothing.  These characters
have the general-category "Mn (Mark, Nonspacing)", so they should be ignored
in the search.

An alternative would be to build a regexp from the search string
like building a regexp for word-search:

(define-key isearch-mode-map "\M-sd" 'isearch-toggle-decomposition)

(defun isearch-toggle-decomposition ()
  "Toggle Unicode decomposition searching on or off."
  (interactive)
  (setq isearch-word (unless (eq isearch-word 'isearch-decomposition-regexp)
		       'isearch-decomposition-regexp))
  (if isearch-word (setq isearch-regexp nil))
  (setq isearch-success t isearch-adjusted t)
  (isearch-update))

(defun isearch-decomposition-regexp (string &optional _lax)
  "Return a regexp that matches decomposed Unicode characters in STRING."
  (mapconcat
   (lambda (c0)
     (if (eq (get-char-code-property c0 'general-category) 'Mn)
         ;; Mark-Nonspacing chars like COMBINING ACUTE ACCENT are optional.
         (concat (string c0) "?")
       (let ((c1 c0) c2 chars)
         (while (and (setq c2 (aref (char-table-extra-slot
                                     decomposition-table 2) c1))
                     (not (eq c2 c0)))
           (push c2 chars)
           (setq c1 c2))
         (if chars
             ;; Character alternatives from the case equivalences table.
             (concat "[" (string c0) chars "]")
           (string c0)))))
   string ""))

(put 'isearch-decomposition-regexp 'isearch-message-prefix "deco ")

This uses the decomposition table created above but instead of activating it,
it's necessary to "shuffle" the equivalences table with the following code
that prepares the table but doesn't enable it in the current buffer:

  (with-temp-buffer (set-case-table decomposition-table))

The advantage of the regexp-based approach is making combining accents
optional in the search string.  But there is another problem: how to ignore
combining accents in the buffer when the search string doesn't contain them.
With regexps this means adding a group of all possible combining accents
after every character in the search string like turning a search string
like "abc" into "a[́̂̃̄̆]?b[́̂̃̄̆]?c[́̂̃̄̆]?".
This would make the search slow, and I have no better idea.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 02 Dec 2012 17:49:01 GMT) Full text and rfc822 format available.

Message #44 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, Eli Zaretskii <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 18:45:38 +0100

> But nevertheless the `decomposition' property should be used to find
> all decomposable characters.  The question is how to use them in the search.

Whatever solution you find most suitable here, it would be nice to come
up with a similar solution for sorting.  I've been playing around with a
function like

(defun decomposed-string-lessp (string1 string2)
  "Return t if STRING1 is decomposition-less than STRING2."
  (let* ((length1 (length string1))
	 (length2 (length string2))
	 (min-length (min length1 length2))
	 (index 0)
	 type1 type2)
    (catch 'found
      (while (< index min-length)
	(setq type1 (car (get-char-code-property
			  (elt string1 index) 'decomposition)))
	(setq type2 (car (get-char-code-property
			  (elt string2 index) 'decomposition)))
	(cond
	 ((< type1 type2)
	  (throw 'found t))
	 ((> type1 type2)
	  (throw 'found nil)))
	;; Continue.
	(setq index (1+ index)))
      ;; Shorter is less.
      (< length1 length2))))

but am not sure whether I'm missing something wrt the return value of
`get-char-code-property'.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 02 Dec 2012 18:06:01 GMT) Full text and rfc822 format available.

Message #47 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 20:02:59 +0200

> Date: Sun, 02 Dec 2012 18:45:38 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: Eli Zaretskii <eliz <at> gnu.org>, perin <at> panix.com, 13041 <at> debbugs.gnu.org, 
>  perin <at> acm.org
> 
> 	(setq type1 (car (get-char-code-property
> 			  (elt string1 index) 'decomposition)))
> 	(setq type2 (car (get-char-code-property
> 			  (elt string2 index) 'decomposition)))
> 	(cond
> 	 ((< type1 type2)
> 	  (throw 'found t))
> 	 ((> type1 type2)
> 	  (throw 'found nil)))
> 	;; Continue.
> 	(setq index (1+ index)))
>        ;; Shorter is less.
>        (< length1 length2))))
> 
> but am not sure whether I'm missing something wrt the return value of
> `get-char-code-property'.

Maybe only the fact that it can return a list whose car is 'compat',
see the examples I posted.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 02 Dec 2012 18:20:01 GMT) Full text and rfc822 format available.

Message #50 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 20:16:17 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Cc: perin <at> panix.com,  13041 <at> debbugs.gnu.org,  perin <at> acm.org
> Date: Sun, 02 Dec 2012 02:27:32 +0200
> 
> I'm surprised to see case mappings hard-coded in
> lisp/international/characters.el instead of using the properties
> `uppercase' and `lowercase' during creation of case tables.

My guess is that this is because the code in characters.el was written
long before we had access to Unicode character properties in Emacs,
and in fact before Emacs was switched to character representation
based on Unicode codepoints.  And no one bothered to rewrite that code
since then; volunteers are welcome.

> (defvar decomposition-table nil)
> 
> (defun make-decomposition-table ()
>   (let ((table (standard-case-table))
>         canon)
>     (setq canon (copy-sequence table))
>     (let ((c #x0000) d)
>       (while (<= c #xFFFD)
>         (make-decomposition-table-1 canon c c)
>         (setq c (1+ c))))
>     (set-char-table-extra-slot table 1 canon)
>     (set-char-table-extra-slot table 2 nil)
>     (setq decomposition-table table)))
> 
> (defun make-decomposition-table-1 (canon c0 c1)
>   (let ((d (get-char-code-property c1 'decomposition)))
>     (when d
>       (unless (characterp (car d)) (pop d))
>       (if (eq c1 (car d))
>           (aset canon c0 (car d))
>         (make-decomposition-table-1 canon c0 (car d))))))
> 
> (make-decomposition-table)
> 
> Then a new Isearch command (the existing `isearch-toggle-case-fold'
> can't be used because it enables/disables the standard case table)
> could toggle between the current case table and the decomposition
> case table using
> 
>   (set-case-table decomposition-table)
> 
> After evaluating this, Isearch correctly finds all related characters
> in every row of this example:
> 
>   http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
> 
> But it seems using the case table for decomposition has one limitation.
> I see no way to ignore combining accent characters in the case table,
> i.e. to map combining accent characters to nothing.  These characters
> have the general-category "Mn (Mark, Nonspacing)", so they should be ignored
> in the search.

IMO, using case tables for this is evil.  If I want to "fold"
diacritics in search, that doesn't necessarily mean I want to fold the
letter-case as well.  I might want doing that, or I might not; these
are two orthogonal features.

So we need a separate kind of char-table, one that could be installed
in addition to the case table, and one that will interpret nil as
an indication to ignore the character during search.  Then we will be
able to ignore combining accents, as we indeed should.  We also need
to modify the searching primitives to consult this new table, in
addition to case table.

IOW, I don't think we can implement this feature entirely in Lisp.
Some changes are needed on the C level as well.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 02 Dec 2012 22:11:02 GMT) Full text and rfc822 format available.

Message #53 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 23:31:20 +0200

> IMO, using case tables for this is evil.  If I want to "fold"
> diacritics in search, that doesn't necessarily mean I want to fold the
> letter-case as well.  I might want doing that, or I might not; these
> are two orthogonal features.

`decomposition-table' is a separate char-table that has the
subtype `case-table'.  It should not conflict with the standard
case table, so using `isearch-toggle-case-fold' should still
toggle the usage of the standard case table.

To toggle folding in the diacritics search perhaps requires
having two decomposition tables: one where upper and lower case
letters belong to one equivalence set, and another where
they are in different sets, so `isearch-toggle-decomposition'
could toggle between them.

Or should the standard case table and the decomposition table
be combined some other way?  Maybe like the existing variable
`case-fold-search' to add a new variable `decomposition-search'
to enable/disable diacritics in search.

> So we need a separate kind of char-table, one that could be installed
> in addition to the case table, and one that will interpret nil as
> an indication to ignore the character during search.

I believe this kind of char-table should be based on the existing
subtype `case-table' because it provides the features necessary for
decomposition search such as extra table EQUIVALENCES (that permutes
each equivalence class) and the extra table CANONICALIZE (where
the canonical character is the final character in the recursion
that traverses the `decomposition' property).

> Then we will be able to ignore combining accents, as we indeed should.
> We also need to modify the searching primitives to consult this new
> table, in addition to case table.

Yes, it seems the feature of ignoring combining accents (i.e. mapping
some characters to nil) can't be added to existing case tables
because for the case table this would mean that converting a string
to upper case might delete some characters (like combining accents)
and converting a string to lower case might add combining accents
to the string that of course makes no sense.

> IOW, I don't think we can implement this feature entirely in Lisp.
> Some changes are needed on the C level as well.

A hack that abuses the standard case table is already possible
in Lisp.  A complete implementation requires changes on the C level.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 02 Dec 2012 22:11:02 GMT) Full text and rfc822 format available.

Message #56 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: perin <at> acm.org, Eli Zaretskii <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 02 Dec 2012 23:39:17 +0200

> Whatever solution you find most suitable here, it would be nice to come
> up with a similar solution for sorting.  I've been playing around with a
> function like

Did you try to build the case table with the diacritics mappings?  It should
affect the sorting as well without requiring any changes in sorting functions.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Mon, 03 Dec 2012 10:19:02 GMT) Full text and rfc822 format available.

Message #59 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Mon, 03 Dec 2012 11:16:21 +0100

> Maybe only the fact that it can return a list whose car is 'compat',
> see the examples I posted.

So I need two indices for looping.  But what are the guidelines to
interpet `compat'?  Does every list starting with a `compat' mean that
the remaining entries of that list represent the constituents of that
composite?

And how do I now call `put-char-code-property' to make the German sharp
"s" ("ß") equivalent to "ss"?  Or am I not supposed to do such a thing?

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Mon, 03 Dec 2012 10:20:03 GMT) Full text and rfc822 format available.

Message #62 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, Eli Zaretskii <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Mon, 03 Dec 2012 11:16:42 +0100

> Did you try to build the case table with the diacritics mappings?  It should
> affect the sorting as well without requiring any changes in sorting functions.

I tried but it didn't work out.  I have to understand your code first
before I can tell what happens.  In any case, doing your

(set-case-table decomposition-table)

permanently for a buffer crashed Emacs here.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Mon, 03 Dec 2012 16:51:02 GMT) Full text and rfc822 format available.

Message #65 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Mon, 03 Dec 2012 18:47:30 +0200

> Date: Mon, 03 Dec 2012 11:16:21 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: juri <at> jurta.org, perin <at> panix.com, 13041 <at> debbugs.gnu.org, 
>  perin <at> acm.org
> 
> But what are the guidelines to interpet `compat'?

For the purposes of comparing strings, both 'compatibility' and
'canonical' decompositions should be treated the same, AFAIU.  You can
find the details here:

    http://unicode.org/reports/tr15/

>  Does every list starting with a `compat' mean that the remaining
> entries of that list represent the constituents of that composite?

Yes.  This comes directly from UnicdeData.txt, e.g.:

  0132;LATIN CAPITAL LIGATURE IJ;Lu;0;L;<compat> 0049 004A;;;;N;LATIN CAPITAL LETTER I J;;;0133;
                                        ^^^^^^^^^^^^^^^^^^

> And how do I now call `put-char-code-property' to make the German sharp
> "s" ("ß") equivalent to "ss"?  Or am I not supposed to do such a thing?

That's already set up in the appropriate case table, I think.  But it
is not a compatibility decomposition, AFAIK.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Mon, 03 Dec 2012 17:46:01 GMT) Full text and rfc822 format available.

Message #68 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Mon, 03 Dec 2012 18:42:53 +0100

>> And how do I now call `put-char-code-property' to make the German sharp
>> "s" ("ß") equivalent to "ss"?  Or am I not supposed to do such a thing?
>
> That's already set up in the appropriate case table, I think.

Why in a case table?  Both "ß" and "ss" are lower case.

> But it
> is not a compatibility decomposition, AFAIK.

But I can make it one?

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Mon, 03 Dec 2012 18:03:02 GMT) Full text and rfc822 format available.

Message #71 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Mon, 03 Dec 2012 19:59:28 +0200

> Date: Mon, 03 Dec 2012 18:42:53 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: juri <at> jurta.org, perin <at> panix.com, 13041 <at> debbugs.gnu.org, 
>  perin <at> acm.org
> 
>  >> And how do I now call `put-char-code-property' to make the German sharp
>  >> "s" ("ß") equivalent to "ss"?  Or am I not supposed to do such a thing?
>  >
>  > That's already set up in the appropriate case table, I think.
> 
> Why in a case table?  Both "ß" and "ss" are lower case.

I meant the relation  "ß" => "SS".

>  > But it
>  > is not a compatibility decomposition, AFAIK.
> 
> But I can make it one?

Yes, you can modify the table set up by uni-decomposition.el.  I
think.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Tue, 04 Dec 2012 00:21:02 GMT) Full text and rfc822 format available.

Message #74 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: perin <at> acm.org, Eli Zaretskii <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 04 Dec 2012 02:17:04 +0200

> In any case, doing your
>
> (set-case-table decomposition-table)
>
> permanently for a buffer crashed Emacs here.

With more use I see crashes too.  The backtrace says that crashes are in
boyer_moore.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Tue, 04 Dec 2012 03:44:02 GMT) Full text and rfc822 format available.

Message #77 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: rudalics <at> gmx.at, 13041 <at> debbugs.gnu.org, perin <at> panix.com, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 04 Dec 2012 05:41:04 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Cc: Eli Zaretskii <eliz <at> gnu.org>,  perin <at> panix.com,  13041 <at> debbugs.gnu.org,  perin <at> acm.org
> Date: Tue, 04 Dec 2012 02:17:04 +0200
> 
> > In any case, doing your
> >
> > (set-case-table decomposition-table)
> >
> > permanently for a buffer crashed Emacs here.
> 
> With more use I see crashes too.  The backtrace says that crashes are in
> boyer_moore.

Please file a bug report with a minimal reproducible recipe.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Tue, 04 Dec 2012 17:58:01 GMT) Full text and rfc822 format available.

Message #80 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 04 Dec 2012 18:54:59 +0100

> Yes, you can modify the table set up by uni-decomposition.el.  I
> think.

Seems to work well.  The function I came up with goes as below.

Thanks for the hints, martin


(defun decomposed-string-lessp (string1 string2)
  "Return t if STRING1 is decomposition-less than STRING2."
  (let* ((length1 (length string1))
	 (length2 (length string2))
	 (min-length (min length1 length2))
	 (index1 0)
	 (index2 0)
	 prop1 prop2 type1 type2 compat1 compat2)
    (catch 'found
      (while (and (< index1 length1) (< index2 length2))
	(setq prop1 (get-char-code-property
		     (downcase (elt string1 index1)) 'decomposition))
	(setq type1 (car prop1))
	(setq prop2 (get-char-code-property
		     (downcase (elt string2 index2)) 'decomposition))
	(setq type2 (car prop2))
	(cond
	 ((and (eq type1 'compat) (eq type2 'compat))
	  (setq compat1 (concat (cdr prop1)))
	  (setq compat2 (concat (cdr prop2)))
	  (let ((value (compare-strings compat1 0 nil compat2 0 nil t)))
	    (cond
	     ((eq value t)
	      (setq index1 (1+ index1))
	      (setq index2 (1+ index2)))
	     ((< value 0)
	      (throw 'found t))
	     ((< value 0)
	      (throw 'found nil)))))
	 ((eq type1 'compat)
	  (setq compat1 (concat (cdr prop1)))
	  (let ((value
		 (compare-strings
		  compat1 0 nil
		  string2 index2 (min (+ index2 (length compat1)) length2) t)))
	    (cond
	     ((eq value t)
	      (setq index1 (1+ index1))
	      (setq index2 (+ index2 (length compat1))))
	     ((< value 0)
	      (throw 'found t))
	     ((< value 0)
	      (throw 'found nil)))))
	 ((eq type2 'compat)
	  (setq compat2 (concat (cdr prop2)))
	  (let ((value
		 (compare-strings
		  string1 index1 (min (+ index1 (length compat2)) length1)
		  compat2 0 nil t)))
	    (cond
	     ((eq value t)
	      (setq index1 (+ index1 (length compat2)))
	      (setq index2 (1+ index2)))
	     ((< value 0)
	      (throw 'found t))
	     ((< value 0)
	      (throw 'found nil)))))
	 ((< type1 type2)
	  (throw 'found t))
	 ((> type1 type2)
	  (throw 'found nil))
	 (t
	  (setq index1 (1+ index1))
	  (setq index2 (1+ index2)))))
      ;; Shorter is less.
      (< length1 length2))))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Tue, 04 Dec 2012 19:29:01 GMT) Full text and rfc822 format available.

Message #83 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 04 Dec 2012 21:28:32 +0200

> Date: Tue, 04 Dec 2012 18:54:59 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: juri <at> jurta.org, perin <at> panix.com, 13041 <at> debbugs.gnu.org, 
>  perin <at> acm.org
> 
>  > Yes, you can modify the table set up by uni-decomposition.el.  I
>  > think.
> 
> Seems to work well.  The function I came up with goes as below.

How about putting it in subr.el?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Tue, 04 Dec 2012 20:14:02 GMT) Full text and rfc822 format available.

Message #86 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>, "'Eli Zaretskii'" <eliz <at> gnu.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 4 Dec 2012 12:12:57 -0800

> The function [Martin] came up with goes as below.
> (defun decomposed-string-lessp (string1 string2)
>    "Return t if STRING1 is decomposition-less than STRING2."
> ...

I know nothing about character composition and have not tested this with
anything but a few western accents.  But this seems like good stuff.


1. Assuming this or similar is added to Emacs (please do).  Please consider
modifying it to respect `case-fold-search'.  These modified lines do that.

(setq prop1 (get-char-code-property
              (if case-fold-search
                  (downcase (elt string1 index1))
                (elt string1 index1))
              'decomposition))

[Same thing for prop2 with string2 and index2.]

(let ((value (compare-strings compat1 0 nil
                              compat2 0 nil case-fold-search)))


2. In addition, consider updating `string-lessp' to be sensitive to a variable
such as this:

(defvar ignore-diacritics nil
  "Non-nil means ignore diacritics for string comparisons.")

With that, an alternative to hard-coding a call to `decomposed-string-lessp' is
to bind `ignore-diacritics' and use `string-lessp'.

A similar change could be made for `compare-strings': reflect the value of
`ignore-diacritics'.  Or since that function has made the choice to pass
case-sensitivity as a parameter instead of respecting `case-fold-search', pass
another parameter for diacritic sensitivity.


3. More general than #2 would be a function like this, which is sensitive to
both `ignore-diacritics' and `case-fold-search' (this assumes the change
suggested above in #1 for `decomposed-string-lessp').

(defun my-string-lessp (s1 s2)
  "..."
  (if ignore-diacritics
      (decomposed-string-lessp s1 s2)
    (when case-fold-search (setq s1  (upcase s1)
                                 s2  (upcase s2)))
    (string-lessp s1 s2)))

Dunno a good name for this.  It's too late to let `string-lessp' itself act like
this - that would break stuff.


4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and
`decomposed-string-lessp' would be to have those functions be sensitive to a
variable such as this:

(defvar string-case-variable 'case-fold-search
  "Value is a case-sensitivity variable such as `case-fold-search'.
The values of that variable must be like those for `case-fold-search':
nil means case-sensitive, non-nil means case-insensitive.")

Code could then bind `string-case-variable' to, say, `(not
completion-ignore-case)' or to any other case-sensitivity controlling sexp, when
appropriate.

This would have the advantages offered by passing an explicit case-sensitivity
parameter, as in `compare-strings', but also the advantages of dynamic scope:
binding `string-case-var' to affect all comparisons within scope.

Comparers such as `(my-)string-lessp' are often used as arguments to
higher-order functions that treat them as (only) binary predicates, i.e.,
predicates where any additional parameters specifying case or diacritic
sensitivity are ignored.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Tue, 04 Dec 2012 23:17:02 GMT) Full text and rfc822 format available.

Message #89 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>, "'Eli Zaretskii'" <eliz <at> gnu.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 4 Dec 2012 15:15:56 -0800

BTW, there are a couple of minor things to check wrt the code you sent, Martin:

* `min-length' is not used.

* The `cond's all repeat condition (< value 0) twice, with different actions.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 06:51:02 GMT) Full text and rfc822 format available.

Message #92 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>, "'Eli Zaretskii'" <eliz <at> gnu.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 4 Dec 2012 22:50:30 -0800

This version of Martin's function (but respecting `case-fold-search') is maybe a
tiny bit simpler.  It could also be a bit slower because of `substring'
returning a copy (vs just incrementing an offset).  It should also be checked
for correctness - not really tested.  FWIW/HTH.

(It does correct the two double `(< value 0)' typos I mentioned earlier.
That should be done in any case.)

(defun decomposed-string-lessp (string1 string2)
  "Return non-nil if decomposed STRING1 is less than decomposed STRING2.
Comparison respects `case-fold-search'."
  (let ((s1  string1)
        (s2  string2)
        prop1  prop2  type1  type2)
    (catch 'found
      (while (and (> (length s1) 0)  (> (length s2) 0))
        (setq prop1  (get-char-code-property (if case-fold-search
                                                 (downcase (elt s1 0))
                                               (elt s1 0))
                                             'decomposition)
              prop2  (get-char-code-property (if case-fold-search
                                                 (downcase (elt s2 0))
                                               (elt s2 0))
                                             'decomposition)
              type1  (car prop1)
              type2  (car prop2))
        (when (eq type1 'compat) (setq s1  (concat (cdr prop1))))
        (when (eq type2 'compat) (setq s2  (concat (cdr prop2))))
        (cond ((eq type1 'compat)
               (let ((cs  (compare-strings
                           s1 0 nil
                           s2 0 (and (not (eq type2 'compat))
                                     (min (length s1) (length s2)))
                           case-fold-search)))
                 (unless (eq cs t) (throw 'found (< cs 0)))))
              ((eq type2 'compat)
               (let ((cs  (compare-strings
                           s1 0 (min (length s2) (length s1))
                           s2 0 nil
                           case-fold-search)))
                 (unless (eq cs t) (throw 'found (< cs 0)))))
              ((= type1 type2)
               (setq s1  (substring s1 1)
                     s2  (substring s2 1)))
              (t (throw 'found (< type1 type2)))))
      (< (length string1) (length string2)))))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 09:42:01 GMT) Full text and rfc822 format available.

Message #95 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 05 Dec 2012 10:41:40 +0100

> How about putting it in subr.el?

If I correctly understand Juri, I next have to deal with things like

(get-char-code-property #xff59 'decomposition)

and related issues we might unearth in the course of this.

Also, while currently sorting is stable in the sense that with respect
to diacritics text remains unchanged from the original order, this is
not nice for sorting larger pieces of text.  So I'd rather have to use
the second list element returned by `get-char-code-property' to make
sure that, for example, "e" gets always sorted before "è" before "é".

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 09:43:01 GMT) Full text and rfc822 format available.

Message #98 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: perin <at> acm.org, 'Eli Zaretskii' <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 05 Dec 2012 10:42:26 +0100

> 1. Assuming this or similar is added to Emacs (please do).  Please consider
> modifying it to respect `case-fold-search'.  These modified lines do that.
>
> (setq prop1 (get-char-code-property
>               (if case-fold-search
>                   (downcase (elt string1 index1))
>                 (elt string1 index1))
>               'decomposition))
>
> [Same thing for prop2 with string2 and index2.]

This would have to be done, yes.

> (let ((value (compare-strings compat1 0 nil
>                               compat2 0 nil case-fold-search)))
>
>
> 2. In addition, consider updating `string-lessp' to be sensitive to a variable
> such as this:
>
> (defvar ignore-diacritics nil
>   "Non-nil means ignore diacritics for string comparisons.")
>
> With that, an alternative to hard-coding a call to `decomposed-string-lessp' is
> to bind `ignore-diacritics' and use `string-lessp'.

`ignore-diacritics' is misleading.  The variable would have to be called
`observe-decompositions' or something the like.

> A similar change could be made for `compare-strings': reflect the value of
> `ignore-diacritics'.  Or since that function has made the choice to pass
> case-sensitivity as a parameter instead of respecting `case-fold-search', pass
> another parameter for diacritic sensitivity.

Indeed, `string-lessp' is too weak - we'd need a function to tell
whether two strings are equal disregarding "certain" decomposition
properties.

> 3. More general than #2 would be a function like this, which is sensitive to
> both `ignore-diacritics' and `case-fold-search' (this assumes the change
> suggested above in #1 for `decomposed-string-lessp').
>
> (defun my-string-lessp (s1 s2)
>   "..."
>   (if ignore-diacritics
>       (decomposed-string-lessp s1 s2)
>     (when case-fold-search (setq s1  (upcase s1)
>                                  s2  (upcase s2)))
>     (string-lessp s1 s2)))
>
> Dunno a good name for this.  It's too late to let `string-lessp' itself act like
> this - that would break stuff.

`string-lessp' is in C.  I wouldn't touch it anyway.

> 4. Even better than hard-coding `case-fold-search' in `my-string-less-p' and
> `decomposed-string-lessp' would be to have those functions be sensitive to a
> variable such as this:
>
> (defvar string-case-variable 'case-fold-search
>   "Value is a case-sensitivity variable such as `case-fold-search'.
> The values of that variable must be like those for `case-fold-search':
> nil means case-sensitive, non-nil means case-insensitive.")
>
> Code could then bind `string-case-variable' to, say, `(not
> completion-ignore-case)' or to any other case-sensitivity controlling sexp, when
> appropriate.
>
> This would have the advantages offered by passing an explicit case-sensitivity
> parameter, as in `compare-strings', but also the advantages of dynamic scope:
> binding `string-case-var' to affect all comparisons within scope.
>
> Comparers such as `(my-)string-lessp' are often used as arguments to
> higher-order functions that treat them as (only) binary predicates, i.e.,
> predicates where any additional parameters specifying case or diacritic
> sensitivity are ignored.

I first have to solve the problems with the values returned by
`get-char-code-property'.  Then I will look into this.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 09:43:02 GMT) Full text and rfc822 format available.

Message #101 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: perin <at> acm.org, 'Eli Zaretskii' <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 05 Dec 2012 10:42:41 +0100

> BTW, there are a couple of minor things to check wrt the code you sent, Martin:
> 
> * `min-length' is not used.

Leftover from a previous version.

> * The `cond's all repeat condition (< value 0) twice, with different actions.

These are clearly silly, yes.  Funnily, they don't affect the result since
they are never taken and the return value is nil as intended.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 09:44:01 GMT) Full text and rfc822 format available.

Message #104 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: perin <at> acm.org, 'Eli Zaretskii' <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 05 Dec 2012 10:42:54 +0100

> This version of Martin's function (but respecting `case-fold-search') is maybe a
> tiny bit simpler.  It could also be a bit slower because of `substring'
> returning a copy (vs just incrementing an offset).  It should also be checked
> for correctness - not really tested.  FWIW/HTH.

The most important application I see for this is within `sort-subr'
where I want to compare buffer substrings in situ by passing their
boundaries.  Hence I plan to provide a version working in terms of
buffer positions.  For simple string checking your version might be
preferable.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 15:39:02 GMT) Full text and rfc822 format available.

Message #107 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 'Eli Zaretskii' <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 07:38:10 -0800

> `ignore-diacritics' is misleading.  The variable would have 
> to be called `observe-decompositions' or something the like.


1. "Observe decompositions" doesn't mean anything to me.  The verb should
probably be more active - what does it mean to observe the char decompositions
here?

BTW, if we use "decomposition" in the name and description then we should
probably also use "char" - this is not about decomposing strings in some way
(whatever that might mean); it involves decomposing Unicode characters.


2. But my confusion over the name/description is in fact wrt function
`decomposed-string-lessp': I guess it's not 100% clear to me what it does.

Your doc string said "STRING1 is decomposition-less than STRING2", which
confuses me.  And it is a bit ambiguous wrt "-less":

 a. decomposition-less as in comparing the strings only after
    removing (some parts of) their decompositions (i.e., "-less"
    as in "sans")?

or

 b. -lessp as in `string<': a comparison ordering relation?

In the version of `decomposed-string-lessp' that I sent, I changed the doc
string to this: "decomposed STRING1 is less than decomposed STRING2".  But that
is no doubt incorrect (less correct than yours, if perhaps clearer).  In
particular, it says nothing about how we compare the two decompositions.

In practical (use) terms, this is typically about ignoring diacritics, keeping
only the "base" characters.  Something about that should at least be mentioned
in the doc, so that users know they can use this for that.

But IIUC this is not just about diacritics; it sometimes might not be about
diacritics at all; and diacritics present are sometimes not ignored.  E.g., the
ligature ffi gets treated the same as the 3 chars f f i.  There are no
diacritics present in that case.

IIUC, we convert the two strings to their Unicode decompositions and then use
the Unicode char compatibility specs to compare the decompositions.  IOW, we
treat equivalent chars, as defined by Unicode, as the same.

Perhaps the name/description should speak in terms of Unicode char compatibility
or equivalence.  Perhaps a name like `string-less-compat-p'?  Or
`Unicode-equivalent-p'?  Or `string-equivalent-p'?

How would you characterize what the function does?  No doubt Eli can help here.
It is important to try to get the function name and description right from the
outset, if we can.  If the Unicode standard has some terminology that applies
here then perhaps we can/should leverage that.

Beyond the name and an accurate description, the doc should, as I say, at least
mention that you can use this to ignore diacritics (such as accents), as that
will be a common use case.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 15:39:03 GMT) Full text and rfc822 format available.

Message #110 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 'Eli Zaretskii' <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 07:38:20 -0800

> The most important application I see for this is within `sort-subr'
> where I want to compare buffer substrings in situ by passing their
> boundaries.  Hence I plan to provide a version working in terms of
> buffer positions.  For simple string checking your version might be
> preferable.

Please do whatever is right - using positions as you intended.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 15:52:01 GMT) Full text and rfc822 format available.

Message #113 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Lewis Perin <perin <at> panix.com>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: 'martin rudalics' <rudalics <at> gmx.at>, 'Eli Zaretskii' <eliz <at> gnu.org>,
	13041 <at> debbugs.gnu.org
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 10:51:12 -0500

Drew Adams writes:
> > `ignore-diacritics' is misleading.  The variable would have 
> > to be called `observe-decompositions' or something the like.
> 
> 
> 1. "Observe decompositions" doesn't mean anything to me.  The verb
> should probably be more active - what does it mean to observe the
> char decompositions here?

What about “heed”?

/Lew
---
Lew Perin | perin <at> acm.org | http://babelcarp.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 16:22:02 GMT) Full text and rfc822 format available.

Message #116 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: <perin <at> acm.org>
Cc: 'martin rudalics' <rudalics <at> gmx.at>, 'Eli Zaretskii' <eliz <at> gnu.org>,
	13041 <at> debbugs.gnu.org
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 08:20:39 -0800

> > > `ignore-diacritics' is misleading.  The variable would have 
> > > to be called `observe-decompositions' or something the like.
> > 
> > 1. "Observe decompositions" doesn't mean anything to me.  The verb
> > should probably be more active - what does it mean to observe the
> > char decompositions here?
> 
> What about "heed"?

"Respect" is a more common term with that meaning.

But the point (to me) is that we are not conveying much by that - too vague.
"Heed" meaning what?  Heed how?

Those are terms, like "treat", "handle" and "process" (verb), that are generally
signs, in computer science as elsewhere, of insufficient understanding or
laziness in communication.  They say essentially, "it does something".

Sometimes (not here though) such words can even be signals that the function in
question is a congeries of things that do not necessarily belong together.

We should be able to do better here.  If I understood better what the function
does I might be able to offer better name suggestions.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 16:38:01 GMT) Full text and rfc822 format available.

Message #119 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 05 Dec 2012 18:37:09 +0200

> Date: Wed, 05 Dec 2012 10:41:40 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: juri <at> jurta.org, perin <at> panix.com, 13041 <at> debbugs.gnu.org, 
>  perin <at> acm.org
> 
>  > How about putting it in subr.el?
> 
> If I correctly understand Juri, I next have to deal with things like
> 
> (get-char-code-property #xff59 'decomposition)
> 
> and related issues we might unearth in the course of this.

My reading of the table in

  http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings

you should ignore any car of the list returned by
get-char-code-property if it does not pass the characterp test (or
those that do pass the symbolp test).  That is, the character #xff59
should sort exactly like lower-case y.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 17:17:01 GMT) Full text and rfc822 format available.

Message #122 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 09:16:11 -0800

> Perhaps the name/description should speak in terms of Unicode 
> char compatibility or equivalence.  Perhaps a name like
> `string-less-compat-p'?  Or `Unicode-equivalent-p'?  Or
> `string-equivalent-p'?

In the last two suggestions I forgot about the "less" part.

Taking a quick look at the Unicode specs, it seems that what we do involves
(Unicode) "compatibility equivalence".  But it also seemed that Eli was saying
that for us this is not distinguished from (Unicode) "canonical equivalence".

So perhaps `unicode-equivalence-less-p'?  Or if there is a risk of confusion
with char (not string) comparison, then perhaps `unicode-equiv-string-less-p'?
Or just `equiv-string-less-p'?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 18:01:02 GMT) Full text and rfc822 format available.

Message #125 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 10:00:14 -0800

FWIW - Some more browsing on the topic tells me that what we are trying to come
up with here is a predicate for the NFKD canonical ordering (as applied to a
char sequence, not to a single char).

IOW, a string-ordering predicate that uses the canonical ordering for a
character's decomposed normal code point sequence.

We are using compatibility normalization, not canonical normalization.  So a
search (or a string comparison test) for `f' will match the ligature `ffi'
(whereas it would not match wrt canonical normalization).

Someone please correct me if any of this is wrong.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 18:28:01 GMT) Full text and rfc822 format available.

Message #128 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: rudalics <at> gmx.at, 13041 <at> debbugs.gnu.org, perin <at> panix.com, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 05 Dec 2012 20:27:34 +0200

> From: "Drew Adams" <drew.adams <at> oracle.com>
> Date: Wed, 5 Dec 2012 10:00:14 -0800
> Cc: perin <at> panix.com, 13041 <at> debbugs.gnu.org, perin <at> acm.org
> 
> We are using compatibility normalization, not canonical normalization.  So a
> search (or a string comparison test) for `f' will match the ligature `ffi'
> (whereas it would not match wrt canonical normalization).
> 
> Someone please correct me if any of this is wrong.

I'm not sure who is wrong ;-), but I think when compatibility
decomposition exists, it should be used; if not, the canonical
decomposition should be used.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 19:18:01 GMT) Full text and rfc822 format available.

Message #131 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Eli Zaretskii'" <eliz <at> gnu.org>, "'Juri Linkov'" <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 5 Dec 2012 11:17:04 -0800

> > I'm surprised to see case mappings hard-coded in
> > lisp/international/characters.el instead of using the properties
> > `uppercase' and `lowercase' during creation of case tables.
> 
> My guess is that this is because the code in characters.el was written
> long before we had access to Unicode character properties in Emacs,
> and in fact before Emacs was switched to character representation
> based on Unicode codepoints.  And no one bothered to rewrite that code
> since then; volunteers are welcome.

Doesn't file CaseFolding.txt contain all the info needed?

If so, what about populating the case tables from the latest CaseFolding.txt
file at Emacs build time?  Or if no Internet access during build, populate from
a copy of the file to be distributed with Emacs.

And provide the same population code as a Lisp function, in case someone wants
to refresh an old Emacs release to use a more recent CaseFolding.txt file.

Would this make any sense?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 21:20:02 GMT) Full text and rfc822 format available.

Message #134 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 05 Dec 2012 23:19:35 +0200

> From: "Drew Adams" <drew.adams <at> oracle.com>
> Cc: <perin <at> panix.com>, <13041 <at> debbugs.gnu.org>, <perin <at> acm.org>
> Date: Wed, 5 Dec 2012 11:17:04 -0800
> 
> > > I'm surprised to see case mappings hard-coded in
> > > lisp/international/characters.el instead of using the properties
> > > `uppercase' and `lowercase' during creation of case tables.
> > 
> > My guess is that this is because the code in characters.el was written
> > long before we had access to Unicode character properties in Emacs,
> > and in fact before Emacs was switched to character representation
> > based on Unicode codepoints.  And no one bothered to rewrite that code
> > since then; volunteers are welcome.
> 
> Doesn't file CaseFolding.txt contain all the info needed?

You don't need CaseFolding.txt, because UnicodeData.txt includes the
same information, and uni-lowercase.el, uni-uppercase.el, and
uni-titlecase.el already read that information into char-tables.

> If so, what about populating the case tables from the latest CaseFolding.txt
> file at Emacs build time?  Or if no Internet access during build, populate from
> a copy of the file to be distributed with Emacs.
> 
> And provide the same population code as a Lisp function, in case someone wants
> to refresh an old Emacs release to use a more recent CaseFolding.txt file.
> 
> Would this make any sense?

It would make sense to load case tables from uni-*.el at Emacs build
time.  Volunteers are welcome.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 23:15:01 GMT) Full text and rfc822 format available.

Message #137 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	Drew Adams <drew.adams <at> oracle.com>
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 01:04:02 +0200

> `ignore-diacritics' is misleading.  The variable would have to be called
> `observe-decompositions' or something the like.

Since the existing variable that corresponds to the
Unicode file CaseFolding.txt is `case-fold-search',
its counterpart variable that corresponds to the Unicode file
Decomposition.txt could be called `decomposition-search'.

Also like the existing `sort-fold-case', its counterpart could be called
`sort-decomposition'.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 05 Dec 2012 23:15:02 GMT) Full text and rfc822 format available.

Message #140 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: perin <at> acm.org, Eli Zaretskii <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 01:05:42 +0200

> If I correctly understand Juri, I next have to deal with things like
>
> (get-char-code-property #xff59 'decomposition)
>
> and related issues we might unearth in the course of this.

Only until bug#13084 is fixed that is a separate problem.

> Also, while currently sorting is stable in the sense that with respect
> to diacritics text remains unchanged from the original order, this is
> not nice for sorting larger pieces of text.  So I'd rather have to use
> the second list element returned by `get-char-code-property' to make
> sure that, for example, "e" gets always sorted before "è" before "é".

In principle, you could do this by let-binding a new variable
`sort-decomposition' to non-nil for stable sorting.

And later to let-bind `sort-decomposition' to nil for
last-resort comparison where equal lines
(equal according to non-nil `sort-decomposition')
will be sorted without regard to decomposition.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 09:28:02 GMT) Full text and rfc822 format available.

Message #143 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Kenichi Handa <handa <at> gnu.org>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: rudalics <at> gmx.at, eliz <at> gnu.org, perin <at> panix.com, 13041 <at> debbugs.gnu.org,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 18:25:12 +0900

In article <707786B35E94470FB727BCF7F3DDA41A <at> us.oracle.com>, "Drew Adams" <drew.adams <at> oracle.com> writes:

> This version of Martin's function (but respecting `case-fold-search') is maybe a
> tiny bit simpler.  It could also be a bit slower because of `substring'
> returning a copy (vs just incrementing an offset).  It should also be checked
> for correctness - not really tested.  FWIW/HTH.

Emacs contains ucs-normailze package which provides various
normalization functions.  For instance,

(require 'ucs-normalize)
(ucs-normalize-NFKD-string "Äffin") => "Äffin"

Isn't it usable?

---
Kenichi Handa
handa <at> gnu.org

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 10:29:02 GMT) Full text and rfc822 format available.

Message #146 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: perin <at> acm.org, 'Eli Zaretskii' <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 11:28:05 +0100

>> `ignore-diacritics' is misleading.  The variable would have
>> to be called `observe-decompositions' or something the like.
>
>
> 1. "Observe decompositions" doesn't mean anything to me.  The verb should
> probably be more active - what does it mean to observe the char decompositions
> here?
>
> BTW, if we use "decomposition" in the name and description then we should
> probably also use "char" - this is not about decomposing strings in some way
> (whatever that might mean); it involves decomposing Unicode characters.

`ignore-diacritics' is misleading because when we, for example,
sort/match ligatures we already do more than ignore diacritics.  A
variable using the term `observe-decompositions' would express what the
underlying algorithm does - observe the decomposition properties
provided by `get-char-code-property'.

Bear in mind that a "correct" solution for searching and sorting would
have to be based on a correct implementation of a collation table (see
bug#12008) plus some options that make searching more convenient (aka
"asymmetric searching" http://www.unicode.org/reports/tr10/#Searching).
In that sense, Juri's approach for searching and my function can be
considered only as poor man's variants of what should be eventually
done.

For example my Austrian locale sorts

  o < ö < p

while IIUC Swedish has

  o < p ... < z < ö

which IIUC can't be done via the decomposition table.  I don't know
whether this implies that searching for "o" in Swedish means to _not_
list results for "ö" either.

> 2. But my confusion over the name/description is in fact wrt function
> `decomposed-string-lessp': I guess it's not 100% clear to me what it does.
>
> Your doc string said "STRING1 is decomposition-less than STRING2", which
> confuses me.  And it is a bit ambiguous wrt "-less":
>
>  a. decomposition-less as in comparing the strings only after
>     removing (some parts of) their decompositions (i.e., "-less"
>     as in "sans")?
>
> or
>
>  b. -lessp as in `string<': a comparison ordering relation?

I didn't think much about the wording.  But I can't, in general, talk
about comparing characters because in the ligature case (or the "ß" vs
"ss" case) I do compare substrings.

> In the version of `decomposed-string-lessp' that I sent, I changed the doc
> string to this: "decomposed STRING1 is less than decomposed STRING2".  But that
> is no doubt incorrect (less correct than yours, if perhaps clearer).  In
> particular, it says nothing about how we compare the two decompositions.
>
> In practical (use) terms, this is typically about ignoring diacritics, keeping
> only the "base" characters.  Something about that should at least be mentioned
> in the doc, so that users know they can use this for that.

Yes.

> But IIUC this is not just about diacritics; it sometimes might not be about
> diacritics at all; and diacritics present are sometimes not ignored.  E.g., the
> ligature ffi gets treated the same as the 3 chars f f i.  There are no
> diacritics present in that case.

That's why I want to just talk about decompositions for the moment.

> IIUC, we convert the two strings to their Unicode decompositions and then use
> the Unicode char compatibility specs to compare the decompositions.  IOW, we
> treat equivalent chars, as defined by Unicode, as the same.

Character sequences, IIUC.

> Perhaps the name/description should speak in terms of Unicode char compatibility
> or equivalence.  Perhaps a name like `string-less-compat-p'?  Or
> `Unicode-equivalent-p'?  Or `string-equivalent-p'?
>
> How would you characterize what the function does?  No doubt Eli can help here.
> It is important to try to get the function name and description right from the
> outset, if we can.  If the Unicode standard has some terminology that applies
> here then perhaps we can/should leverage that.

I'm not sure whether we can ever fully support Unicode here - the
weights you find in http://www.unicode.org/Public/UCA/6.2.0/allkeys.txt
appear hardly digestible for me (and my machine, presumably).

> Beyond the name and an accurate description, the doc should, as I say, at least
> mention that you can use this to ignore diacritics (such as accents), as that
> will be a common use case.

Sure.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 10:32:01 GMT) Full text and rfc822 format available.

Message #149 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 11:31:31 +0100

> My reading of the table in
>
>   http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings
>
> you should ignore any car of the list returned by
> get-char-code-property if it does not pass the characterp test (or
> those that do pass the symbolp test).  That is, the character #xff59
> should sort exactly like lower-case y.

That is, `wide' and `compat' are completely equivalent in this regard?

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 10:33:01 GMT) Full text and rfc822 format available.

Message #152 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 11:31:44 +0100

> We are using compatibility normalization, not canonical normalization.  So a
> search (or a string comparison test) for `f' will match the ligature `ffi'
> (whereas it would not match wrt canonical normalization).

If it can be done, searching for "f" should match ligatures like "ff"
and "fi".

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 10:33:02 GMT) Full text and rfc822 format available.

Message #155 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	Drew Adams <drew.adams <at> oracle.com>
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 11:31:53 +0100

> Since the existing variable that corresponds to the
> Unicode file CaseFolding.txt is `case-fold-search',
> its counterpart variable that corresponds to the Unicode file
> Decomposition.txt

Where is this file?

> could be called `decomposition-search'.
>
> Also like the existing `sort-fold-case', its counterpart could be called
> `sort-decomposition'.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 10:33:02 GMT) Full text and rfc822 format available.

Message #158 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, Eli Zaretskii <eliz <at> gnu.org>, perin <at> panix.com,
	13041 <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 11:32:37 +0100

> And later to let-bind `sort-decomposition' to nil for
> last-resort comparison where equal lines
> (equal according to non-nil `sort-decomposition')
> will be sorted without regard to decomposition.

Indeed.  In any case, equal lines shouldn't be the rule - especially with
functions that remove duplicates ;-)

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 10:35:01 GMT) Full text and rfc822 format available.

Message #161 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Kenichi Handa <handa <at> gnu.org>
Cc: perin <at> acm.org, eliz <at> gnu.org, perin <at> panix.com, 13041 <at> debbugs.gnu.org,
	Drew Adams <drew.adams <at> oracle.com>
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 11:34:26 +0100

> Emacs contains ucs-normailze package which provides various
> normalization functions.  For instance,
>
> (require 'ucs-normalize)
> (ucs-normalize-NFKD-string "Äffin") => "Äffin"
>
> Isn't it usable?

Actually, the function should do what we need.  But I have no idea how
to integrate it into a searching algorithm.  And when sorting, it seems
expensive for comparing buffer substrings.  Also, the use of a temporary
buffer for normalizing every single string makes its weight quite heavy.

In any case, I would probably steal the entire decomposition property
handling part from it.  So thanks a lot for this hint.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 16:01:02 GMT) Full text and rfc822 format available.

Message #164 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 6 Dec 2012 07:59:59 -0800

>  > We are using compatibility normalization, not canonical 
>  > normalization.  So a search (or a string comparison test)
>  > for `f' will match the ligature `ffi'
>  > (whereas it would not match wrt canonical normalization).
> 
> If it can be done, searching for "f" should match ligatures like "ff"
> and "fi".

That's what I thought you were planning/preparing to do.

On the other hand, as the Unicode spec points out (for level 2), sometimes
someone wants to distinguish searching for f from searching for the ligature.
Ideally (we might never get there), that would be possible as an alternative
(choice).

The spec also points to hybrid situations regarding case conversion (see sect
RL2.4) where, e.g., you might want to do full case matching on ß in a literal
name such as Strauß but simple case folding on ß when used in a character class,
such as [ß].  Dunno whether we would ever get there either.

There seems to be a lot in the Unicode regexp spec
(http://www.unicode.org/reports/tr18/) that could be food for thought for Emacs.
I imagine that some Emacs Dev folks have already taken a close look and given it
some thought.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 17:50:01 GMT) Full text and rfc822 format available.

Message #167 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 19:48:47 +0200

> Date: Thu, 06 Dec 2012 11:31:31 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: juri <at> jurta.org, perin <at> panix.com, 13041 <at> debbugs.gnu.org, 
>  perin <at> acm.org
> 
>  > My reading of the table in
>  >
>  >   http://www.unicode.org/reports/tr44/#Character_Decomposition_Mappings
>  >
>  > you should ignore any car of the list returned by
>  > get-char-code-property if it does not pass the characterp test (or
>  > those that do pass the symbolp test).  That is, the character #xff59
>  > should sort exactly like lower-case y.
> 
> That is, `wide' and `compat' are completely equivalent in this regard?

Yes.  They are all different forms of the same character, which should
all compare equal in this context.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 17:52:01 GMT) Full text and rfc822 format available.

Message #170 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: handa <at> gnu.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com, perin <at> acm.org,
	drew.adams <at> oracle.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 19:50:48 +0200

> Date: Thu, 06 Dec 2012 11:34:26 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: Drew Adams <drew.adams <at> oracle.com>, eliz <at> gnu.org, perin <at> panix.com, 
>  13041 <at> debbugs.gnu.org, perin <at> acm.org
> 
>  > Emacs contains ucs-normailze package which provides various
>  > normalization functions.  For instance,
>  >
>  > (require 'ucs-normalize)
>  > (ucs-normalize-NFKD-string "Äffin") => "Äffin"
>  >
>  > Isn't it usable?
> 
> Actually, the function should do what we need.  But I have no idea how
> to integrate it into a searching algorithm.  And when sorting, it seems
> expensive for comparing buffer substrings.  Also, the use of a temporary
> buffer for normalizing every single string makes its weight quite heavy.

Yes, I don't think this will be possible without changes on the C
level.  Those changes should use code very similar to what we
currently do for case-insensitive search.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Thu, 06 Dec 2012 17:54:02 GMT) Full text and rfc822 format available.

Message #173 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	drew.adams <at> oracle.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Thu, 06 Dec 2012 19:53:20 +0200

> Date: Thu, 06 Dec 2012 11:28:05 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: 'Eli Zaretskii' <eliz <at> gnu.org>, perin <at> panix.com, 
>  13041 <at> debbugs.gnu.org, perin <at> acm.org
> 
>  >> `ignore-diacritics' is misleading.  The variable would have
>  >> to be called `observe-decompositions' or something the like.
>  >
>  >
>  > 1. "Observe decompositions" doesn't mean anything to me.  The verb should
>  > probably be more active - what does it mean to observe the char decompositions
>  > here?
>  >
>  > BTW, if we use "decomposition" in the name and description then we should
>  > probably also use "char" - this is not about decomposing strings in some way
>  > (whatever that might mean); it involves decomposing Unicode characters.
> 
> `ignore-diacritics' is misleading because when we, for example,
> sort/match ligatures we already do more than ignore diacritics.  A
> variable using the term `observe-decompositions' would express what the
> underlying algorithm does - observe the decomposition properties
> provided by `get-char-code-property'.

I would suggest something like equivalence-search or maybe
loose-match-search.  The latter is slightly less suitable, since loose
matches include not just decompositions, see the Unicode Regular
Expressions report.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 07 Dec 2012 01:33:01 GMT) Full text and rfc822 format available.

Message #176 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	Drew Adams <drew.adams <at> oracle.com>
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 07 Dec 2012 02:52:12 +0200

>> Since the existing variable that corresponds to the
>> Unicode file CaseFolding.txt is `case-fold-search',
>> its counterpart variable that corresponds to the Unicode file
>> Decomposition.txt
>
> Where is this file?

There was a reference to
http://www.unicode.org/Public/UNIDATA/extracted/DerivedDecompositionType.txt
from http://www.unicode.org/faq/casemap_charprop.html
but it seems this file is redundant since you can get
the same information from admin/unidata/UnicodeData.txt
using (get-char-code-property ?? 'decomposition)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 07 Dec 2012 01:33:02 GMT) Full text and rfc822 format available.

Message #179 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Kenichi Handa <handa <at> gnu.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	Drew Adams <drew.adams <at> oracle.com>
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 07 Dec 2012 02:58:17 +0200

> Emacs contains ucs-normailze package which provides various
> normalization functions.  For instance,
>
> (require 'ucs-normalize)
> (ucs-normalize-NFKD-string "Äffin") => "Äffin"
>
> Isn't it usable?

This is usable to sort and compare strings, but I don't see
how ucs-normalize.el could help in the search.  I suppose the
searched buffer can't be normalized before starting a search.
So the search function somehow should be able to skip combining
characters in the buffer.  But to do this, the translation table needs
to contain additional information about certain characters to ignore.
Also the translation table should be able to map a sequence of
characters like "ss" to "ß".

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 07 Dec 2012 06:34:01 GMT) Full text and rfc822 format available.

Message #182 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: handa <at> gnu.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 07 Dec 2012 08:33:04 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Date: Fri, 07 Dec 2012 02:58:17 +0200
> Cc: perin <at> panix.com, 13041 <at> debbugs.gnu.org, perin <at> acm.org
> 
> > Emacs contains ucs-normailze package which provides various
> > normalization functions.  For instance,
> >
> > (require 'ucs-normalize)
> > (ucs-normalize-NFKD-string "Äffin") => "Äffin"
> >
> > Isn't it usable?
> 
> This is usable to sort and compare strings, but I don't see
> how ucs-normalize.el could help in the search.

I agree.

> I suppose the searched buffer can't be normalized before starting a
> search.

Yes, that's not acceptable.

> So the search function somehow should be able to skip combining
> characters in the buffer.  But to do this, the translation table needs
> to contain additional information about certain characters to ignore.

Right.  This is very similar to how the search primitives currently
use the case tables, except that they don't skip characters.  But
adding such a skip operation should be easy.

> Also the translation table should be able to map a sequence of
> characters like "ss" to "ß".

I'd say the other way around: map ß to ss.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Fri, 07 Dec 2012 10:38:01 GMT) Full text and rfc822 format available.

Message #185 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: Kenichi Handa <handa <at> gnu.org>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Fri, 07 Dec 2012 11:37:00 +0100

> This is usable to sort and compare strings, but I don't see
> how ucs-normalize.el could help in the search.  I suppose the
> searched buffer can't be normalized before starting a search.

You can either temporarily

- leave the text alone but give each string that should be handled
  specially a text property with the normalized form.  In this case
  searching has to pay attention to these properties, if present.

- normalize the text and give each normalized string a text property
  with the original text.  In this case searching will proceed as usual
  but you have to restore the original text when done.

I don't know how feasible these are for searching.  But I used the
second approach for sorting without problems.

Also I don't know how to handle the return value and/or highlighting
when, for example, finding a match for "suf" within "suﬀer".  For
example, replacing each occurrence of "suf" with the empty string should
leave us with "fer" here.  So in this case, we have to deal with the
normalized string anyway.  OTOH replacing a match for "res" in "résumé"
with the empty string should probably leave us with "umé".

> So the search function somehow should be able to skip combining
> characters in the buffer.  But to do this, the translation table needs
> to contain additional information about certain characters to ignore.
> Also the translation table should be able to map a sequence of
> characters like "ss" to "ß".

I have no idea how many mappings like "ß" -> "ss" exist.  The problem is
that we don't get them from UnicodeData.txt IIUC.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 08 Dec 2012 00:06:04 GMT) Full text and rfc822 format available.

Message #188 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: Kenichi Handa <handa <at> gnu.org>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 08 Dec 2012 01:55:22 +0200

> - leave the text alone but give each string that should be handled
>   specially a text property with the normalized form.  In this case
>   searching has to pay attention to these properties, if present.
>
> - normalize the text and give each normalized string a text property
>   with the original text.  In this case searching will proceed as usual
>   but you have to restore the original text when done.

This reminds an idea that searching should take into account the text
displayed with the `display' property and other display-related properties.
It seems this is more difficult to implement.

> Also I don't know how to handle the return value and/or highlighting
> when, for example, finding a match for "suf" within "suﬀer".  For
> example, replacing each occurrence of "suf" with the empty string should
> leave us with "fer" here.

I believe such ligature characters should be handled as a whole,
i.e. "suf" doesn't match "suﬀer", only "suff" should match it.

> I have no idea how many mappings like "ß" -> "ss" exist.  The problem is
> that we don't get them from UnicodeData.txt IIUC.

I can't find them in UnicodeData.txt too.  Looking at the files in
http://www.unicode.org/Public/UNIDATA/ can find them in the file

http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt

that is derived from

http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 08 Dec 2012 08:22:01 GMT) Full text and rfc822 format available.

Message #191 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: rudalics <at> gmx.at, 13041 <at> debbugs.gnu.org, perin <at> panix.com, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 08 Dec 2012 10:20:18 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Date: Sat, 08 Dec 2012 01:55:22 +0200
> Cc: 13041 <at> debbugs.gnu.org, perin <at> panix.com, perin <at> acm.org
> 
> This reminds an idea that searching should take into account the text
> displayed with the `display' property and other display-related properties.
> It seems this is more difficult to implement.

I don't know if it's more difficult.  After all, the primitives you
need to (a) find out whether there's a display string at given buffer
position, and (b) access its text, are already there, ready to be
used.  Moreover, there's even a C function that searches the current
buffer for a specific Lisp string, which you could use as a model for
this feature.

What is definitely true, though, is that searching display string is a
separate feature, with an entirely different implementation.  I
suggest therefore to keep it in mind, but not mix with what's being
discussed here.

> > I have no idea how many mappings like "ß" -> "ss" exist.  The problem is
> > that we don't get them from UnicodeData.txt IIUC.
> 
> I can't find them in UnicodeData.txt too.  Looking at the files in
> http://www.unicode.org/Public/UNIDATA/ can find them in the file
> 
> http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
> 
> that is derived from
> 
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

Maybe we should extend ucs-normalize.el to include that as well.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 08 Dec 2012 11:23:02 GMT) Full text and rfc822 format available.

Message #194 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: Kenichi Handa <handa <at> gnu.org>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 08 Dec 2012 12:21:48 +0100

>> - leave the text alone but give each string that should be handled
>>   specially a text property with the normalized form.  In this case
>>   searching has to pay attention to these properties, if present.
>>
>> - normalize the text and give each normalized string a text property
>>   with the original text.  In this case searching will proceed as usual
>>   but you have to restore the original text when done.
>
> This reminds an idea that searching should take into account the text
> displayed with the `display' property and other display-related properties.
> It seems this is more difficult to implement.

... and probably should include searching for overlays too.

>> Also I don't know how to handle the return value and/or highlighting
>> when, for example, finding a match for "suf" within "suﬀer".  For
>> example, replacing each occurrence of "suf" with the empty string should
>> leave us with "fer" here.
>
> I believe such ligature characters should be handled as a whole,
> i.e. "suf" doesn't match "suﬀer", only "suff" should match it.

This means that when you type the second "f" you might get a match
before the present one.  Consider a buffer containing the two lines

suﬀer
suffer

Typing "suf" as search string would go to "suffer".  Adding an "f" to
the search string now would go back to "suﬀer" (or not).  Disconcerting
in any case.

>> I have no idea how many mappings like "ß" -> "ss" exist.  The problem is
>> that we don't get them from UnicodeData.txt IIUC.
>
> I can't find them in UnicodeData.txt too.  Looking at the files in
> http://www.unicode.org/Public/UNIDATA/ can find them in the file
>
> http://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt
>
> that is derived from
>
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt

Case folding "ß" to "SS" (upper case "S") is not what I had in mind.  I
was talking about the (weak?) equivalence of "ß" and "ss" (lower case
"s") which is much more important when searching.  In particular so,
because many German words that were earlier written with an "ß" are now
written with "ss".

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 08 Dec 2012 11:37:02 GMT) Full text and rfc822 format available.

Message #197 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: Juri Linkov <juri <at> jurta.org>, perin <at> acm.org, 13041 <at> debbugs.gnu.org,
	perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 08 Dec 2012 12:35:37 +0100

> I don't know if it's more difficult.  After all, the primitives you
> need to (a) find out whether there's a display string at given buffer
> position, and (b) access its text, are already there, ready to be
> used.  Moreover, there's even a C function that searches the current
> buffer for a specific Lisp string, which you could use as a model for
> this feature.

I think that mirroring/cloning (part of) the current buffer in a special
search buffer would be the cheapest solution.  The search buffer would
contain the normalized text, be built only when normalization is
needed and be rebuilt whenever a search option or the buffer text
changes.  I don't know whether `buffer-swap-text' could be used here.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 08 Dec 2012 12:41:02 GMT) Full text and rfc822 format available.

Message #200 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: juri <at> jurta.org, perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 08 Dec 2012 14:40:05 +0200

> Date: Sat, 08 Dec 2012 12:35:37 +0100
> From: martin rudalics <rudalics <at> gmx.at>
> CC: Juri Linkov <juri <at> jurta.org>, 13041 <at> debbugs.gnu.org, perin <at> panix.com, 
>  perin <at> acm.org
> 
>  > I don't know if it's more difficult.  After all, the primitives you
>  > need to (a) find out whether there's a display string at given buffer
>  > position, and (b) access its text, are already there, ready to be
>  > used.  Moreover, there's even a C function that searches the current
>  > buffer for a specific Lisp string, which you could use as a model for
>  > this feature.
> 
> I think that mirroring/cloning (part of) the current buffer in a special
> search buffer would be the cheapest solution.  The search buffer would
> contain the normalized text, be built only when normalization is
> needed and be rebuilt whenever a search option or the buffer text
> changes.

Maybe this is the cheapest, but it still needs the same support the
other alternatives do.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 08 Dec 2012 23:21:05 GMT) Full text and rfc822 format available.

Message #203 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: Kenichi Handa <handa <at> gnu.org>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 09 Dec 2012 01:07:12 +0200

> This means that when you type the second "f" you might get a match
> before the present one.  Consider a buffer containing the two lines
> suﬀer
> suffer
>
> Typing "suf" as search string would go to "suffer".  Adding an "f" to
> the search string now would go back to "suﬀer" (or not).
Going back looks like backtracking in the regexp search.

OTOH, instead of using an approach of matching only a full match
like in Chromium, we could do like GEdit and OpenOffice that
match the whole ligature character in a partial match
(i.e. to match "ﬀ" when the search string is just "f").

Though this has a problem of highlighting the whole character for
a partial match that looks wrong, but perhaps no one can do better.

>> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
>> http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt
>
> Case folding "ß" to "SS" (upper case "S") is not what I had in mind.  I
> was talking about the (weak?) equivalence of "ß" and "ss" (lower case
> "s") which is much more important when searching.  In particular so,
> because many German words that were earlier written with an "ß" are now
> written with "ss".

Yes, this is what I meant too.  It is surprising but
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
defines the equivalence of "ß" and "ss" (lower case "s")
instead of case-folding.  The following line in CaseFolding.txt:

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

maps 00DF (LATIN SMALL LETTER SHARP S) to two characters
0073 0073 (LATIN SMALL LETTER S) keeping the lower case.
Maybe this is a bug in Unicode data?

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sat, 08 Dec 2012 23:55:01 GMT) Full text and rfc822 format available.

Message #206 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Juri Linkov <juri <at> jurta.org>
Cc: martin rudalics <rudalics <at> gmx.at>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 08 Dec 2012 18:54:15 -0500

> i.e. "suf" doesn't match "suﬀer", only "suff" should match it.

I completely disagree here.  "suf" should match "suﬀer".


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 00:06:01 GMT) Full text and rfc822 format available.

Message #209 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Juri Linkov'" <juri <at> jurta.org>, "'martin rudalics'" <rudalics <at> gmx.at>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 8 Dec 2012 16:04:38 -0800

> > Typing "suf" as search string would go to "suffer".  Adding 
> > an "f" to the search string now would go back to "su?er" (or not).
>
> Going back looks like backtracking in the regexp search.
> 
> OTOH, instead of using an approach of matching only a full match
> like in Chromium, we could do like GEdit and OpenOffice that
> match the whole ligature character in a partial match
> (i.e. to match "?" when the search string is just "f").

Seems to me that the starting point should be the Unicode Regexp spec, which
outlines the behavior of level 1 and level 2 searches.  Emacs Dev can choose
what it wants to do, of course, but that is a good place to start, I think.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 00:16:02 GMT) Full text and rfc822 format available.

Message #212 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Stefan Monnier'" <monnier <at> iro.umontreal.ca>,
	"'Juri Linkov'" <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 8 Dec 2012 16:14:28 -0800

> > i.e. "suf" doesn't match "su?er", only "suff" should match it.
> 
> I completely disagree here.  "suf" should match "su?er".

The Unicode Regexp spec says that it is best, if possible, to let users do
either.  It discusses such different search possibilities explicitly.

We might not be able to support that superior level (level 2) for Emacs search,
but the point is that each kind of matching can be useful here.

At this stage of the discussion it should not, I think, be a case of "I
completely disagree" (or completely agree), unless you have already decided
something wrt design/implementation etc.  Better to look at the possibilities
for users and then discuss what it might take to be able to support this or that
kind of search matching.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 00:54:02 GMT) Full text and rfc822 format available.

Message #215 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 09 Dec 2012 02:35:46 +0200

>> i.e. "suf" doesn't match "suﬀer", only "suff" should match it.
>
> I completely disagree here.  "suf" should match "suﬀer".

AFAIS, there are more programs that find a partial match,
but neither of them can do the right highlighting:
both possibilities (to highlight the whole ligature and not to highlight)
are wrong, and highlighting a part of the ligature is impossible.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 11:37:01 GMT) Full text and rfc822 format available.

Message #218 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Stephen Berman <stephen.berman <at> gmx.net>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 09 Dec 2012 12:35:59 +0100

On Sun, 09 Dec 2012 02:35:46 +0200 Juri Linkov <juri <at> jurta.org> wrote:

>>> i.e. "suf" doesn't match "suﬀer", only "suff" should match it.
>>
>> I completely disagree here.  "suf" should match "suﬀer".
>
> AFAIS, there are more programs that find a partial match,
> but neither of them can do the right highlighting:
> both possibilities (to highlight the whole ligature and not to highlight)
> are wrong, and highlighting a part of the ligature is impossible.

Could a ligature be highlighted in a different way (different color or
additional attribute such as underlining) to indicate a partial or
potential match?

Steve Berman

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 15:43:02 GMT) Full text and rfc822 format available.

Message #221 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: "Drew Adams" <drew.adams <at> oracle.com>
Cc: 'Juri Linkov' <juri <at> jurta.org>, perin <at> acm.org, 13041 <at> debbugs.gnu.org,
	perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 09 Dec 2012 10:42:15 -0500

> The Unicode Regexp spec says that it is best, if possible, to let users do
> either.

We're talking about the (now misnamed) "diacritic-fold" search.  If the
user wants to be more strict, there's always going to be the
"non-diacritic-fold" search.


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 15:46:02 GMT) Full text and rfc822 format available.

Message #224 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: Juri Linkov <juri <at> jurta.org>
Cc: martin rudalics <rudalics <at> gmx.at>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 09 Dec 2012 10:45:07 -0500

>>> i.e. "suf" doesn't match "suﬀer", only "suff" should match it.
>> I completely disagree here.  "suf" should match "suﬀer".
> AFAIS, there are more programs that find a partial match,
> but neither of them can do the right highlighting:
> both possibilities (to highlight the whole ligature and not to highlight)
> are wrong, and highlighting a part of the ligature is impossible.

One step at a time: first, let's make sure we can match it.  Then we'll
worry about what the match-boundaries should be and how to display it
(when we get to this point, we can even consider displaying suﬀer as
suffer temporarily, just like we do when point is in the middle of
a composition).


        Stefan

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 17:54:02 GMT) Full text and rfc822 format available.

Message #227 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> jurta.org>
Cc: Kenichi Handa <handa <at> gnu.org>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 09 Dec 2012 18:52:17 +0100

> OTOH, instead of using an approach of matching only a full match
> like in Chromium, we could do like GEdit and OpenOffice that
> match the whole ligature character in a partial match
> (i.e. to match "ﬀ" when the search string is just "f").

Strictly spoken, they should match the first "f" in "ﬀ".  When matching
"suf" against "suﬀer", the `match-string' would be "suf", with
`match-end' after "ﬀ".  That is, the match length would not increase
when adding an "f" to the search string now.  But I don't know what
`match-string' should return - "suﬀ" or "suff".

> Though this has a problem of highlighting the whole character for
> a partial match that looks wrong, but perhaps no one can do better.

We needed a display string "ff" replacing "ﬀ" during highlighting and
highlight only the first "f" in it.

> Yes, this is what I meant too.  It is surprising but
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt
> defines the equivalence of "ß" and "ss" (lower case "s")
> instead of case-folding.  The following line in CaseFolding.txt:
>
> 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
>
> maps 00DF (LATIN SMALL LETTER SHARP S) to two characters
> 0073 0073 (LATIN SMALL LETTER S) keeping the lower case.
> Maybe this is a bug in Unicode data?

Maybe it's explained here

  http://www.unicode.org/faq/idn.html

in the answer to

  Q: Why does IDNA2003 map final sigma (ς) to sigma (σ), map eszett (ß)
  to "ss", and delete ZWJ/ZWNJ?

One possible interpretation of this is that mapping "ß" to "SS" would
imply that downcasing "SS" should produce "ß" and this is unwanted.  But
I still wonder whether we are supposed to apply mappings recursively.

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 17:54:03 GMT) Full text and rfc822 format available.

Message #230 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Stephen Berman <stephen.berman <at> gmx.net>
Cc: Juri Linkov <juri <at> jurta.org>, perin <at> acm.org, 13041 <at> debbugs.gnu.org,
	perin <at> panix.com
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 09 Dec 2012 18:52:33 +0100

> Could a ligature be highlighted in a different way (different color or
> additional attribute such as underlining) to indicate a partial or
> potential match?

I think ligatures can be easily handled by displaying the corresponding
decomposed string.  But a different color could be used to higlight the
"ß" with an incremental search string "Mas" and a match in "Maße".

martin

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 18:02:01 GMT) Full text and rfc822 format available.

Message #233 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'Stefan Monnier'" <monnier <at> iro.umontreal.ca>
Cc: 'Juri Linkov' <juri <at> jurta.org>, perin <at> acm.org, 13041 <at> debbugs.gnu.org,
	perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 9 Dec 2012 10:00:16 -0800

> > The Unicode Regexp spec says that it is best, if possible, 
> > to let users do either.
> 
> We're talking about the (now misnamed) "diacritic-fold" search.
> If the user wants to be more strict, there's always going to be
> the "non-diacritic-fold" search.

Yes, and?  That ignoring of diacritics etc. is essentially what the Unicode
Regexp spec refers to as "loose matching", IIUC.  And that means "at least the
simple, default Unicode case folding."

You are considering, among other things, whether `f' should match the ? ligature
or whether only `ff' should match it.  The standard deals with this question, I
believe.  

(BTW, I cannot actually see that ligature with my mail client.  So I copied the
char from another mail message and pasted it, above.  If that copy+paste didn't
work, what I meant was the ligature for ff.)

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Sun, 09 Dec 2012 18:08:01 GMT) Full text and rfc822 format available.

Message #236 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: "Drew Adams" <drew.adams <at> oracle.com>
To: "'martin rudalics'" <rudalics <at> gmx.at>, "'Juri Linkov'" <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com
Subject: RE: bug#13041: 24.2; diacritic-fold-search
Date: Sun, 9 Dec 2012 10:06:44 -0800

> Maybe it's explained here
>    http://www.unicode.org/faq/idn.html
> in the answer to
> 
>    Q: Why does IDNA2003 map final sigma (?) to sigma (s), map 
>       eszett (ß) to "ss", and delete ZWJ/ZWNJ?
> 
> One possible interpretation of this is that mapping "ß" to "SS" would
> imply that downcasing "SS" should produce "ß" and this is 
> unwanted.

This is also covered in the Unicode Regexp spec.
http://www.unicode.org/reports/tr18/

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Mon, 10 Dec 2012 08:11:02 GMT) Full text and rfc822 format available.

Message #239 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> jurta.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: martin rudalics <rudalics <at> gmx.at>, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Mon, 10 Dec 2012 09:57:49 +0200

> One step at a time: first, let's make sure we can match it.  Then we'll
> worry about what the match-boundaries should be and how to display it
> (when we get to this point, we can even consider displaying suﬀer as
> suffer temporarily, just like we do when point is in the middle of
> a composition).

Isearch used to decompose a composition of a character with a combining
accent and displaying them separately in the middle of a composition
in Emacs 23.  But as I see now in the latest version Isearch in the
middle of a composition doesn't decompose them.  It highlights the
matched character with still unmatched combining accent as a whole.
It seems the current behavior is better then earlier because it doesn't
change the displayed characters.  This is more WYSIWYG.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Mon, 10 Dec 2012 08:23:01 GMT) Full text and rfc822 format available.

Message #242 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> jurta.org>
Cc: perin <at> acm.org, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	monnier <at> iro.umontreal.ca
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Mon, 10 Dec 2012 10:20:55 +0200

> From: Juri Linkov <juri <at> jurta.org>
> Date: Mon, 10 Dec 2012 09:57:49 +0200
> Cc: 13041 <at> debbugs.gnu.org, perin <at> panix.com, perin <at> acm.org
> 
> Isearch used to decompose a composition of a character with a combining
> accent and displaying them separately in the middle of a composition
> in Emacs 23.

AFAIR, this was due to problems in the display engine wrt composite
characters, and problems with composition support in general, problems
which are now solved.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Tue, 11 Dec 2012 07:21:01 GMT) Full text and rfc822 format available.

Message #245 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Drew Adams <drew.adams <at> oracle.com>
Cc: juri <at> jurta.org, rudalics <at> gmx.at, 13041 <at> debbugs.gnu.org, perin <at> panix.com,
	perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Tue, 11 Dec 2012 09:19:55 +0200

> From: "Drew Adams" <drew.adams <at> oracle.com>
> Date: Sun, 9 Dec 2012 10:06:44 -0800
> Cc: perin <at> panix.com, 13041 <at> debbugs.gnu.org, perin <at> acm.org
> 
> > Maybe it's explained here
> >    http://www.unicode.org/faq/idn.html
> > in the answer to
> > 
> >    Q: Why does IDNA2003 map final sigma (?) to sigma (s), map 
> >       eszett (ß) to "ss", and delete ZWJ/ZWNJ?
> > 
> > One possible interpretation of this is that mapping "ß" to "SS" would
> > imply that downcasing "SS" should produce "ß" and this is 
> > unwanted.
> 
> This is also covered in the Unicode Regexp spec.
> http://www.unicode.org/reports/tr18/

Another relevant Unicode document is the Unicode Collation Algorithm.
For the latest (yet unapproved) draft, see

  http://www.unicode.org/reports/tr10/proposed.html

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#13041; Package emacs. (Wed, 31 Aug 2016 14:47:01 GMT) Full text and rfc822 format available.

Message #248 received at 13041 <at> debbugs.gnu.org (full text, mbox):

From: Michael Albinus <michael.albinus <at> gmx.de>
To: Lewis Perin <perin <at> panix.com>
Cc: 13041 <at> debbugs.gnu.org, perin <at> acm.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Wed, 31 Aug 2016 16:45:44 +0200

Lewis Perin <perin <at> panix.com> writes:

> Emacs search has long been able to toggle between (a) ignoring the
> distinction between upper- and lower-case characters
> (case-fold-search) and (b) searching for only one of the pair.  One
> could say Climacs offers the choice between (a) searching for all
> members of a (2-member) equivalence class and (b) searching for only
> one member.
>
> There are larger equivalence classes of characters with practical use
> which Climacs is currently unaware of: the groups of characters
> consisting of an unadorned (ASCII) character plus all its
> diacritic-adorned versions.  Currently, if I want to search for both
> “apres” and “après”, I need an additive regular expression.  I would
> like to do this as easily as I can search for “apres” and “Apres”.  I
> would be delighted if Emacs implemented the equivalence classes
> spelled out here:
>
>   http://hex-machina.com/scripts/yui/3.3.0pr1/api/unicode-data-accentfold.js.html
>
> I might add that diacritics folding is the default in web search
> engines.  It is also a feature of at least one Web browser in
> searching the text of a displayed page (Chrome.)

Emacs 25.1 has introduced the new user option `search-default-mode'. If
set to `char-fold-to-regexp', the requested feature is available. See
etc/NEWS for further information.

So I propose to close this bug. There was a long discussion in the bug's
log back in 2012, but AFAICS, all proposals have been implemented.

> /Lew

Best regards, Michael.

Reply sent to Michael Albinus <michael.albinus <at> gmx.de>:
You have taken responsibility. (Sat, 03 Sep 2016 07:07:02 GMT) Full text and rfc822 format available.

Notification sent to perin <at> acm.org:
bug acknowledged by developer. (Sat, 03 Sep 2016 07:07:02 GMT) Full text and rfc822 format available.

Message #253 received at 13041-done <at> debbugs.gnu.org (full text, mbox):

From: Michael Albinus <michael.albinus <at> gmx.de>
To: perin <at> acm.org
Cc: 13041-done <at> debbugs.gnu.org
Subject: Re: bug#13041: 24.2; diacritic-fold-search
Date: Sat, 03 Sep 2016 09:06:21 +0200

Version: 25.1

nobody writes:

> This is great news!  I’m afraid I’m not in a position to use 25.1 yet,
> but I look forward to it eagerly.  Closing the bug seems right to me;
> if the new functionality has flaws, then they would be *new* bugs.

So I'm closing the bug.

> Thanks very much for letting me know!
>
> /Lew

Best regards, Michael.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 01 Oct 2016 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 8 years and 267 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #13041 24.2; diacritic-fold-search

GNU bug report logs - #13041
24.2; diacritic-fold-search