GNU bug report logs - #45660
28.0.50; Changed word/whitespace syntax

Previous Next

Package: emacs;

Reported by: Juri Linkov <juri <at> linkov.net>

Date: Mon, 4 Jan 2021 18:09:02 UTC

Severity: normal

Found in version 28.0.50

Done: Eli Zaretskii <eliz <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 45660 in the body.
You can then email your comments to 45660 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Mon, 04 Jan 2021 18:09:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Juri Linkov <juri <at> linkov.net>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 04 Jan 2021 18:09:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: bug-gnu-emacs <at> gnu.org
Subject: 28.0.50; Changed word/whitespace syntax
Date: Mon, 04 Jan 2021 19:25:23 +0200
Some unidentified recent change during the last week broke the
definition of word syntax and whitespace syntax.  I noticed the
change of behavior in markchars-mode that now disregards the character
"NARROW NO-BREAK SPACE" as the word separator between thousands, i.e.:

In Emacs 27:
(and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
1

In Emacs 28:
(and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
5

Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096".

Please close this bug report if this change was intentional
because if it provides more correct definitions
then other code could be adopted to such change.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Mon, 04 Jan 2021 18:45:02 GMT) Full text and rfc822 format available.

Message #8 received at 45660 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> linkov.net>
Cc: 45660 <at> debbugs.gnu.org
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Mon, 04 Jan 2021 20:44:37 +0200
> From: Juri Linkov <juri <at> linkov.net>
> Date: Mon, 04 Jan 2021 19:25:23 +0200
> 
> Some unidentified recent change during the last week broke the
> definition of word syntax and whitespace syntax.

It's this commit:

  commit 70484f92a1807897dcd16189442a45385c6e7bbb
  Author:     Eli Zaretskii <eliz <at> gnu.org>
  AuthorDate: Sat Jan 2 12:42:16 2021 +0200
  Commit:     Eli Zaretskii <eliz <at> gnu.org>
  CommitDate: Sat Jan 2 12:42:16 2021 +0200

      Fix syntax of symbol and punctuation characters

      * lisp/international/characters.el: Adjust syntax of punctuation
      and symbol charcaters to follow that of Unicode properties.
      (Bug#44974)


diff --git a/lisp/international/characters.el b/lisp/international/characters.el
index 64460b4..88f2e20 100644
--- a/lisp/international/characters.el
+++ b/lisp/international/characters.el
@@ -317,6 +317,7 @@ ?L
 (modify-syntax-entry #x5be ".") ; MAQAF
 (modify-syntax-entry #x5c0 ".") ; PASEQ
 (modify-syntax-entry #x5c3 ".") ; SOF PASUQ
+(modify-syntax-entry #x5c6 ".") ; NUN HAFUKHA
 (modify-syntax-entry #x5f3 ".") ; GERESH
 (modify-syntax-entry #x5f4 ".") ; GERSHAYIM
 
@@ -521,6 +522,9 @@ ?L
   ;; syntax: ¢£¤¥¨ª¯²³´¶¸¹º.)  There should be a well-defined way of
   ;; relating Unicode categories to Emacs syntax codes.
 
+  ;; FIXME: We should probably just use the Unicode properties to set
+  ;; up the syntax table.
+
   ;; NBSP isn't semantically interchangeable with other whitespace chars,
   ;; so it's more like punctuation.
   (set-case-syntax ?  "." tbl)
@@ -558,7 +562,7 @@ ?L
     (setq c (1+ c)))
 
   ;; Latin Extended Additional
-  (modify-category-entry '(#x1e00 . #x1ef9) ?l)
+  (modify-category-entry '(#x1E00 . #x1EF9) ?l)
 
   ;; Latin Extended-C
   (setq c #x2C60)
@@ -579,13 +583,13 @@ ?L
     (setq c (1+ c)))
 
   ;; Greek
-  (modify-category-entry '(#x0370 . #x03ff) ?g)
+  (modify-category-entry '(#x0370 . #x03FF) ?g)
 
   ;; Armenian
   (setq c #x531)
 
   ;; Greek Extended
-  (modify-category-entry '(#x1f00 . #x1fff) ?g)
+  (modify-category-entry '(#x1F00 . #x1FFF) ?g)
 
   ;; cyrillic
   (modify-category-entry '(#x0400 . #x04FF) ?y)
@@ -605,40 +609,43 @@ ?L
   (while (<= c #x200F)
     (set-case-syntax c "." tbl)
     (setq c (1+ c)))
-  ;; Fixme: These aren't all right:
   (setq c #x2010)
-  (while (<= c #x2016)
-    (set-case-syntax c "_" tbl)
+  ;; Fixme: What to do with characters that have Pi and Pf
+  ;; Unicode properties?
+  (while (<= c #x2017)
+    (set-case-syntax c "." tbl)
     (setq c (1+ c)))
   ;; Punctuation syntax for quotation marks (like `)
-  (while (<= c #x201f)
+  (while (<= c #x201F)
     (set-case-syntax  c "." tbl)
     (setq c (1+ c)))
-  ;; Fixme: These aren't all right:
   (while (<= c #x2027)
-    (set-case-syntax c "_" tbl)
+    (set-case-syntax c "." tbl)
     (setq c (1+ c)))
-  (while (<= c #x206F)
+  (setq c #x2030)
+  (while (<= c #x205E)
     (set-case-syntax c "." tbl)
     (setq c (1+ c)))
+  (let ((chars '(?‹ ?› ?⁄ ?⁒)))
+    (while chars
+      (modify-syntax-entry (car chars) "_")
+      (setq chars (cdr chars))))
 
-  ;; Fixme: The following blocks might be better as symbol rather than
-  ;; punctuation.
   ;; Arrows
   (setq c #x2190)
   (while (<= c #x21FF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
   ;; Mathematical Operators
   (while (<= c #x22FF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
   ;; Miscellaneous Technical
   (while (<= c #x23FF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
   ;; Control Pictures
-  (while (<= c #x243F)
+  (while (<= c #x244F)
     (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
 
@@ -652,13 +659,13 @@ ?L
   ;; Supplemental Mathematical Operators
   (setq c #x2A00)
   (while (<= c #x2AFF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
 
   ;; Miscellaneous Symbols and Arrows
   (setq c #x2B00)
   (while (<= c #x2BFF)
-    (set-case-syntax c "." tbl)
+    (set-case-syntax c "_" tbl)
     (setq c (1+ c)))
 
   ;; Coptic
@@ -676,17 +683,34 @@ ?L
 
   ;; Symbols for Legacy Computing
   (setq c #x1FB00)
+  (while (<= c #x1FBCA)
+    (set-case-syntax c "_" tbl)
+    (setq c (1+ c)))
+  ;; FIXME: Should these be digits?
   (while (<= c #x1FBFF)
     (set-case-syntax c "." tbl)
     (setq c (1+ c)))
 
   ;; Fullwidth Latin
-  (setq c #xff21)
-  (while (<= c #xff3a)
+  (setq c #xFF01)
+  (while (<= c #xFF0F)
+    (set-case-syntax c "." tbl)
+    (setq c (1+ c)))
+  (set-case-syntax #xFF04 "_" tbl)
+  (set-case-syntax #xFF0B "_" tbl)
+  (setq c #xFF21)
+  (while (<= c #xFF3A)
     (modify-category-entry c ?l)
     (modify-category-entry (+ c #x20) ?l)
     (setq c (1+ c)))
 
+  ;; Halfwidth Latin
+  (setq c #xFF64)
+  (while (<= c #xFF65)
+    (set-case-syntax c "." tbl)
+    (setq c (1+ c)))
+  (set-case-syntax #xFF61 "." tbl)
+
   ;; Combining diacritics
   (modify-category-entry '(#x300 . #x362) ?^)
   ;; Combining marks


> I noticed the change of behavior in markchars-mode that now
> disregards the character "NARROW NO-BREAK SPACE" as the word
> separator between thousands, i.e.:
> 
> In Emacs 27:
> (and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
> 1
> 
> In Emacs 28:
> (and (string-match "\\<\\w+\\>" "4 096") (match-end 0))
> 5
> 
> Note there is the character "NARROW NO-BREAK SPACE" between "4" and "096".

Previously, many characters, including u+202F, had the punctuation
('.') syntax.  I modified that to be more close to the Unicode
Character Database (UCD), and u+202F is not a punctuation character
according to the UCD.  It has the Zs general category, which means
"space separator", the same as SPC, NBSP, EN SPACE, and others.

Removing u+202F and other similar characters from the "punctuation"
group had the side effect of leaving it at the default 'w' syntax.

Should we make all Zs characters have the ' ' (whitespace) syntax?
That should be easy, but we should try being consistent in this
regard.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Mon, 04 Jan 2021 18:55:02 GMT) Full text and rfc822 format available.

Message #11 received at 45660 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Eli Zaretskii <eliz <at> gnu.org>, Juri Linkov <juri <at> linkov.net>
Cc: 45660 <at> debbugs.gnu.org
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Mon, 4 Jan 2021 19:54:33 +0100
> Should we make all Zs characters have the ' ' (whitespace) syntax?
> That should be easy, but we should try being consistent in this
> regard.

What would be the downside of doing that?

martin




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Mon, 04 Jan 2021 19:20:02 GMT) Full text and rfc822 format available.

Message #14 received at 45660 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: 45660 <at> debbugs.gnu.org, juri <at> linkov.net
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Mon, 04 Jan 2021 21:19:30 +0200
> Cc: 45660 <at> debbugs.gnu.org
> From: martin rudalics <rudalics <at> gmx.at>
> Date: Mon, 4 Jan 2021 19:54:33 +0100
> 
>  > Should we make all Zs characters have the ' ' (whitespace) syntax?
>  > That should be easy, but we should try being consistent in this
>  > regard.
> 
> What would be the downside of doing that?

As always, changing the syntax of at least some of those characters.
What would that cause is anyone's guess.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Tue, 05 Jan 2021 18:32:02 GMT) Full text and rfc822 format available.

Message #17 received at 45660 <at> debbugs.gnu.org (full text, mbox):

From: Juri Linkov <juri <at> linkov.net>
To: Eli Zaretskii <eliz <at> gnu.org>
Cc: 45660 <at> debbugs.gnu.org
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Tue, 05 Jan 2021 20:20:44 +0200
> Previously, many characters, including u+202F, had the punctuation
> ('.') syntax.  I modified that to be more close to the Unicode
> Character Database (UCD), and u+202F is not a punctuation character
> according to the UCD.  It has the Zs general category, which means
> "space separator", the same as SPC, NBSP, EN SPACE, and others.

So according to the Unicode standard it should have whitespace syntax?

And indeed, I see no reason for similar characters to have different syntax:

  name: NO-BREAK SPACE
  general-category: Zs (Separator, Space)
  syntax:   	which means: whitespace

  name: NARROW NO-BREAK SPACE
  general-category: Zs (Separator, Space)
  syntax: w 	which means: word

> Removing u+202F and other similar characters from the "punctuation"
> group had the side effect of leaving it at the default 'w' syntax.
>
> Should we make all Zs characters have the ' ' (whitespace) syntax?
> That should be easy, but we should try being consistent in this
> regard.

Should the word characters separated by NO-BREAK SPACE by treated as one word?
If there is no reason to treat space characters as part of words, then all
characters with the Zs general category could have the same whitespace syntax.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Tue, 05 Jan 2021 18:46:02 GMT) Full text and rfc822 format available.

Message #20 received at 45660 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> linkov.net>
Cc: 45660 <at> debbugs.gnu.org
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Tue, 05 Jan 2021 20:45:13 +0200
> From: Juri Linkov <juri <at> linkov.net>
> Cc: 45660 <at> debbugs.gnu.org
> Date: Tue, 05 Jan 2021 20:20:44 +0200
> 
> > Previously, many characters, including u+202F, had the punctuation
> > ('.') syntax.  I modified that to be more close to the Unicode
> > Character Database (UCD), and u+202F is not a punctuation character
> > according to the UCD.  It has the Zs general category, which means
> > "space separator", the same as SPC, NBSP, EN SPACE, and others.
> 
> So according to the Unicode standard it should have whitespace syntax?

Unicode doesn't have the concept of "syntax", it's our invention.  For
some syntactic categories, it makes sense to follow the corresponding
Unicode general category.  Two examples are "punctuation" and
"symbols".

The question whether to treat Zs as whitespace syntax is on the
table.  We previously treated many of such characters as
"punctuation", which doesn't seem right to me.  Which is why I removed
them from the "punctuation" syntax, and you got bitten byu the result
(because the default syntax is "word-constituent").

> Should the word characters separated by NO-BREAK SPACE by treated as one word?

That's a good question.  Do we currently treat them as such?  I don't
think so, because NBSP has the '.' syntax, i.e. "punctuation".

> If there is no reason to treat space characters as part of words, then all
> characters with the Zs general category could have the same whitespace syntax.

I tend to agree.  If no objections or new issues arise, I will do that
in a couple of days.

Thanks.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Tue, 05 Jan 2021 18:54:01 GMT) Full text and rfc822 format available.

Message #23 received at 45660 <at> debbugs.gnu.org (full text, mbox):

From: martin rudalics <rudalics <at> gmx.at>
To: Juri Linkov <juri <at> linkov.net>, Eli Zaretskii <eliz <at> gnu.org>
Cc: 45660 <at> debbugs.gnu.org
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Tue, 5 Jan 2021 19:53:13 +0100
> Should the word characters separated by NO-BREAK SPACE by treated as one word?

'forward-word' should stop but a line should not be broken there.  So
IIUC this is a question of what's cheaper in terms of implementation.

martin




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#45660; Package emacs. (Tue, 05 Jan 2021 19:27:02 GMT) Full text and rfc822 format available.

Message #26 received at 45660 <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: martin rudalics <rudalics <at> gmx.at>
Cc: 45660 <at> debbugs.gnu.org, juri <at> linkov.net
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Tue, 05 Jan 2021 21:26:12 +0200
> Cc: 45660 <at> debbugs.gnu.org
> From: martin rudalics <rudalics <at> gmx.at>
> Date: Tue, 5 Jan 2021 19:53:13 +0100
> 
>  > Should the word characters separated by NO-BREAK SPACE by treated as one word?
> 
> 'forward-word' should stop but a line should not be broken there.  So
> IIUC this is a question of what's cheaper in terms of implementation.

We don't break lines according to syntax, we break them according to
"line breakable" category and other rules.




Reply sent to Eli Zaretskii <eliz <at> gnu.org>:
You have taken responsibility. (Fri, 08 Jan 2021 12:07:02 GMT) Full text and rfc822 format available.

Notification sent to Juri Linkov <juri <at> linkov.net>:
bug acknowledged by developer. (Fri, 08 Jan 2021 12:07:02 GMT) Full text and rfc822 format available.

Message #31 received at 45660-done <at> debbugs.gnu.org (full text, mbox):

From: Eli Zaretskii <eliz <at> gnu.org>
To: Juri Linkov <juri <at> linkov.net>
Cc: 45660-done <at> debbugs.gnu.org
Subject: Re: bug#45660: 28.0.50; Changed word/whitespace syntax
Date: Fri, 08 Jan 2021 14:06:11 +0200
> From: Juri Linkov <juri <at> linkov.net>
> Cc: 45660 <at> debbugs.gnu.org
> Date: Tue, 05 Jan 2021 20:20:44 +0200
> 
> > Previously, many characters, including u+202F, had the punctuation
> > ('.') syntax.  I modified that to be more close to the Unicode
> > Character Database (UCD), and u+202F is not a punctuation character
> > according to the UCD.  It has the Zs general category, which means
> > "space separator", the same as SPC, NBSP, EN SPACE, and others.
> 
> So according to the Unicode standard it should have whitespace syntax?
> 
> And indeed, I see no reason for similar characters to have different syntax:
> 
>   name: NO-BREAK SPACE
>   general-category: Zs (Separator, Space)
>   syntax:   	which means: whitespace
> 
>   name: NARROW NO-BREAK SPACE
>   general-category: Zs (Separator, Space)
>   syntax: w 	which means: word
> 
> > Removing u+202F and other similar characters from the "punctuation"
> > group had the side effect of leaving it at the default 'w' syntax.
> >
> > Should we make all Zs characters have the ' ' (whitespace) syntax?
> > That should be easy, but we should try being consistent in this
> > regard.
> 
> Should the word characters separated by NO-BREAK SPACE by treated as one word?
> If there is no reason to treat space characters as part of words, then all
> characters with the Zs general category could have the same whitespace syntax.

No further comments, so I've now made the change on master whereby all
characters with Zs general category are given the whitespace syntax.

I'm therefore closing this bug; please reopen if there any left-overs
or undesired effects.




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Fri, 05 Feb 2021 12:24:06 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 81 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.