GNU bug report logs - #48192
forward-word and friends have inconsistent behaviour with Unicode and ASCII punctuation

Previous Next

Package: emacs;

Reported by: Daphne Preston-Kendal <dpk <at> nonceword.org>

Date: Mon, 3 May 2021 15:02:02 UTC

Severity: normal

Tags: moreinfo

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 48192 in the body.
You can then email your comments to 48192 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#48192; Package emacs. (Mon, 03 May 2021 15:02:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Daphne Preston-Kendal <dpk <at> nonceword.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Mon, 03 May 2021 15:02:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Daphne Preston-Kendal <dpk <at> nonceword.org>
To: bug-gnu-emacs <at> gnu.org
Subject: forward-word and friends have inconsistent behaviour with Unicode and
 ASCII punctuation
Date: Mon, 3 May 2021 16:37:51 +0200
forward-word, backward-word etc. have inconsistent behaviour when
applied to text containing ASCII straight quotation marks vs. Unicode
quotation marks. The word
    don't
with a straight quote (U+0027) counts as a single word, and forward-word
and backward-word will move over the whole thing. Meanwhile,
    don’t
with a curly quote (U+2019) counts as two words, and the cursor will
stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)

This also means count-words/count-words-region give surprising results
when applied to text containing Unicode curly apostrophes, since they
work by counting the number of times the cursor can move
forward-word-strictly between given start and end points. (Since it uses
forward-word-strictly and not forward-word, the problem can’t be solved
by customizing find-word-boundary-function-table.)

The Right Thing in my view would be for Emacs to use the Unicode TR29
word boundary rules to work out where to put the cursor when
forward-word and backward-word are invoked. They handle punctuation
characters correctly, and rules are not too complicated.
<http://www.unicode.org/reports/tr29/#Word_Boundaries>
However, how this would interact with the existing
find-word-boundary-function-table customization method, I don’t know.
CLDR makes customizations of the rules for specific (human) languages;
perhaps they could be ported into Emacs somehow.

As a temporary workaround to get correct-ish word counts for my
documents, I’ve hacked up a function that uses how-many instead of
forward-word to count the number of words in a region.
<https://gitlab.com/dpk/dotfiles/-/blob/master/.emacs.d/lisp/wc-mode.el>





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48192; Package emacs. (Mon, 03 May 2021 15:50:02 GMT) Full text and rfc822 format available.

Message #8 received at 48192 <at> debbugs.gnu.org (full text, mbox):

From: Andreas Schwab <schwab <at> linux-m68k.org>
To: Daphne Preston-Kendal <dpk <at> nonceword.org>
Cc: 48192 <at> debbugs.gnu.org
Subject: Re: bug#48192: forward-word and friends have inconsistent behaviour
 with Unicode and ASCII punctuation
Date: Mon, 03 May 2021 17:49:32 +0200
On Mai 03 2021, Daphne Preston-Kendal wrote:

> forward-word, backward-word etc. have inconsistent behaviour when
> applied to text containing ASCII straight quotation marks vs. Unicode
> quotation marks. The word
>     don't
> with a straight quote (U+0027) counts as a single word, and forward-word
> and backward-word will move over the whole thing. Meanwhile,
>     don’t
> with a curly quote (U+2019) counts as two words, and the cursor will
> stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)

Looks like you have customized the syntax table, because by default,
both ' and ’ have punctuation syntax, thus are not part of a word.  But
text-mode uses a different syntax table, where ' has word syntax.

Andreas.

-- 
Andreas Schwab, schwab <at> linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48192; Package emacs. (Mon, 03 May 2021 15:51:01 GMT) Full text and rfc822 format available.

Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Daphne Preston-Kendal <dpk <at> nonceword.org>
To: bug-gnu-emacs <at> gnu.org
Subject: Re: forward-word and friends have inconsistent behaviour with Unicode
 and ASCII punctuation
Date: Mon, 3 May 2021 17:26:44 +0200
I should note that I just tried to reproduce this bug in a different
buffer in emacs -q, and the behaviour this time was consistently the one
I describe for the curly quotes below; then when I restarted again
without -q, it was behaving like that consistently in all buffers again.
Pfui. (Sorry, I should have documented my environment more thoroughly
before submitting this bug report. I don’t know any more what was
causing the inconsistency.)

However, the behaviour of considering "don't", "can't" etc. and almost
any English possessive as two words for the purposes of count-words etc.
is undoubtedly wrong for most users in my book. However, I appreciate
there are cross-linguistic issues here, and French speakers would be
equally annoyed if "l'allemand" started to count as one word, not two.
(Thanks to John Cowan for this example.)

On 3 May 2021, at 16:37, Daphne Preston-Kendal <dpk <at> nonceword.org> wrote:

> forward-word, backward-word etc. have inconsistent behaviour when
> applied to text containing ASCII straight quotation marks vs. Unicode
> quotation marks. The word
>    don't
> with a straight quote (U+0027) counts as a single word, and forward-word
> and backward-word will move over the whole thing. Meanwhile,
>    don’t
> with a curly quote (U+2019) counts as two words, and the cursor will
> stop at ‘don’ and ‘t’ separately. (Fundamental mode, Emacs 27.2.)
> 
> This also means count-words/count-words-region give surprising results
> when applied to text containing Unicode curly apostrophes, since they
> work by counting the number of times the cursor can move
> forward-word-strictly between given start and end points. (Since it uses
> forward-word-strictly and not forward-word, the problem can’t be solved
> by customizing find-word-boundary-function-table.)
> 
> The Right Thing in my view would be for Emacs to use the Unicode TR29
> word boundary rules to work out where to put the cursor when
> forward-word and backward-word are invoked. They handle punctuation
> characters correctly, and rules are not too complicated.
> <http://www.unicode.org/reports/tr29/#Word_Boundaries>
> However, how this would interact with the existing
> find-word-boundary-function-table customization method, I don’t know.
> CLDR makes customizations of the rules for specific (human) languages;
> perhaps they could be ported into Emacs somehow.
> 
> As a temporary workaround to get correct-ish word counts for my
> documents, I’ve hacked up a function that uses how-many instead of
> forward-word to count the number of words in a region.
> <https://gitlab.com/dpk/dotfiles/-/blob/master/.emacs.d/lisp/wc-mode.el>





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48192; Package emacs. (Fri, 01 Jul 2022 11:35:02 GMT) Full text and rfc822 format available.

Message #14 received at 48192 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Daphne Preston-Kendal <dpk <at> nonceword.org>
Cc: 48192 <at> debbugs.gnu.org
Subject: Re: bug#48192: forward-word and friends have inconsistent behaviour
 with Unicode and ASCII punctuation
Date: Fri, 01 Jul 2022 13:34:36 +0200
Daphne Preston-Kendal <dpk <at> nonceword.org> writes:

> However, the behaviour of considering "don't", "can't" etc. and almost
> any English possessive as two words for the purposes of count-words etc.
> is undoubtedly wrong for most users in my book.

(I'm going through old bug reports that unfortunately weren't resolved
at the time.)

I think it would make sense to make text-mode give ’ (RIGHT SINGLE
QUOTATION MARK) a word constituent syntax, because many people use that
character interchangeably with ' (APOSTROPHE).

But that's not really the intention behind that character.  RIGHT SINGLE
QUOTATION MARK is to allow quoting like ‘this’ -- i.e., the ’ is not
meant to be used inside words.

So changing the syntax here would be controversial since it's "wrong" to
use the ’ character instead of APOSTROPHE, even though it's common.

Does anybody have an opinion here?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Fri, 01 Jul 2022 11:35:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48192; Package emacs. (Fri, 01 Jul 2022 15:15:01 GMT) Full text and rfc822 format available.

Message #19 received at 48192 <at> debbugs.gnu.org (full text, mbox):

From: Robert Pluim <rpluim <at> gmail.com>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 48192 <at> debbugs.gnu.org, Daphne Preston-Kendal <dpk <at> nonceword.org>
Subject: Re: bug#48192: forward-word and friends have inconsistent behaviour
 with Unicode and ASCII punctuation
Date: Fri, 01 Jul 2022 17:14:37 +0200
>>>>> On Fri, 01 Jul 2022 13:34:36 +0200, Lars Ingebrigtsen <larsi <at> gnus.org> said:

    Lars> Daphne Preston-Kendal <dpk <at> nonceword.org> writes:
    >> However, the behaviour of considering "don't", "can't" etc. and almost
    >> any English possessive as two words for the purposes of count-words etc.
    >> is undoubtedly wrong for most users in my book.

    Lars> (I'm going through old bug reports that unfortunately weren't resolved
    Lars> at the time.)

    Lars> I think it would make sense to make text-mode give ’ (RIGHT SINGLE
    Lars> QUOTATION MARK) a word constituent syntax, because many people use that
    Lars> character interchangeably with ' (APOSTROPHE).

    Lars> But that's not really the intention behind that character.  RIGHT SINGLE
    Lars> QUOTATION MARK is to allow quoting like ‘this’ -- i.e., the ’ is not
    Lars> meant to be used inside words.

    Lars> So changing the syntax here would be controversial since it's "wrong" to
    Lars> use the ’ character instead of APOSTROPHE, even though it's common.

    Lars> Does anybody have an opinion here?

What people should do is use U+02BC, MODIFIER LETTER APOSTROPHE, since
that has word-constituent syntax already.

(what, the worldʼs not going to change to suit me, you say? 😼)

Robert
-- 




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48192; Package emacs. (Sat, 30 Jul 2022 14:08:02 GMT) Full text and rfc822 format available.

Message #22 received at 48192 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Robert Pluim <rpluim <at> gmail.com>
Cc: Daphne Preston-Kendal <dpk <at> nonceword.org>, 48192 <at> debbugs.gnu.org
Subject: Re: bug#48192: forward-word and friends have inconsistent behaviour
 with Unicode and ASCII punctuation
Date: Sat, 30 Jul 2022 16:07:13 +0200
Robert Pluim <rpluim <at> gmail.com> writes:

> What people should do is use U+02BC, MODIFIER LETTER APOSTROPHE, since
> that has word-constituent syntax already.
>
> (what, the worldʼs not going to change to suit me, you say? 😼)

😀

In any case, I think the conclusion here is that we don't want to change
anything here, and I'm therefore closing this bug report.





bug closed, send any further explanations to 48192 <at> debbugs.gnu.org and Daphne Preston-Kendal <dpk <at> nonceword.org> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sat, 30 Jul 2022 14:08:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 28 Aug 2022 11:24:08 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 240 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.