GNU bug report logs - #34524
wc: word count incorrect when words separated only by no-break space

Previous Next

Package: coreutils;

Reported by: vampyrebat <at> gmail.com

Date: Mon, 18 Feb 2019 08:13:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 34524 in the body.
You can then email your comments to 34524 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Mon, 18 Feb 2019 08:13:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to vampyrebat <at> gmail.com:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Mon, 18 Feb 2019 08:13:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: vampyrebat <at> gmail.com
To: bug-coreutils <at> gnu.org
Subject: wc: word count incorrect when words separated only by no-break space
Date: Mon, 18 Feb 2019 02:12:15 -0600
$ wc --version
wc (GNU coreutils) 8.29
Packaged by Gentoo (8.29-r1 (p1.0))

The man page for wc states: "A word is a... sequence of characters delimited by white space."

But its concept of white space only seems to include ASCII white space.  U+00A0 NO-BREAK SPACE, for instance, is not recognized.

If your terminal displays UTF-8 encoding:

printf 'how are\xC2\xA0you\n'

or if your terminal displays ISO 8859-1 encoding:

printf 'how are\xA0you\n'

the visible output of this printf is "how are you".  In either case, wc does not recognize the second space as white space, resulting in an incorrect word count:

$ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w
2
$ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w
2




Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Fri, 22 Feb 2019 23:35:02 GMT) Full text and rfc822 format available.

Message #8 received at 34524 <at> debbugs.gnu.org (full text, mbox):

From: Bob Proulx <bob <at> proulx.com>
To: vampyrebat <at> gmail.com
Cc: 34524 <at> debbugs.gnu.org
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Fri, 22 Feb 2019 16:34:04 -0700
vampyrebat <at> gmail.com wrote:
> The man page for wc states: "A word is a... sequence of characters delimited by white space."
> 
> But its concept of white space only seems to include ASCII white
> space.  U+00A0 NO-BREAK SPACE, for instance, is not recognized.

Indeed this is because wc and other coreutils programs, and other
programs, use the libc locale definition.

  $ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 od -tx1 -c
  0000000  c2  a0  0a
          302 240  \n
  0000003

  printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l
  0
  $ printf '\xC2\xA0 \n' | env LC_ALL=en_US.UTF-8 grep '[[:space:]]' | wc -l
  1

This shows that grep does not recognize \xC2\xA0 as a character in the
class of space characters either.

  $ printf '\xC2\xA0\n' | env LC_ALL=en_US.UTF-8 tr '[[:space:]]' x | od -tx1 -c
  0000000  c2  a0  78
          302 240   x
  0000003

And while a space character matches and is translated the other is not.

Since character classes are defined as part of the locale table there
isn't really anything we can do about it on the coreutils wc side of
things.  It would need to be redefined upstream there.

Bob




Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Sun, 24 Feb 2019 05:23:01 GMT) Full text and rfc822 format available.

Message #11 received at 34524 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: vampyrebat <at> gmail.com, 34524 <at> debbugs.gnu.org
Cc: Bruno Haible <bruno <at> clisp.org>
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Sat, 23 Feb 2019 21:22:51 -0800
On 18/02/19 00:12, vampyrebat <at> gmail.com wrote:
> $ wc --version
> wc (GNU coreutils) 8.29
> Packaged by Gentoo (8.29-r1 (p1.0))
> 
> The man page for wc states: "A word is a... sequence of characters delimited by white space."
> 
> But its concept of white space only seems to include ASCII white space.  U+00A0 NO-BREAK SPACE, for instance, is not recognized.
> 
> If your terminal displays UTF-8 encoding:
> 
> printf 'how are\xC2\xA0you\n'
> 
> or if your terminal displays ISO 8859-1 encoding:
> 
> printf 'how are\xA0you\n'
> 
> the visible output of this printf is "how are you".  In either case, wc does not recognize the second space as white space, resulting in an incorrect word count:
> 
> $ printf 'how are\xC2\xA0you\n' | LC_ALL=en_US.utf8 wc -w
> 2
> $ printf 'how are\xA0you\n' | LC_ALL=en_US.iso88591 wc -w
> 2

wc does support multi-byte locales well and we use iswspace()
to test whether it's a separator or not.
Though on glibc, NBSP is not considered a space.
I wrote a little prog to output what is considered a space on glibc locales:

0009 HORIZONTAL TAB
000A NEW LINE (not blank)
000B VERTICAL TAB (not blank)
000C FORM FEED (not blank)
000D CARRIAGE RETURN (not blank)
0020 SPACE
1680 OGHAM SPACE MARK
2000 EN QUAD
2001 EM QUAD
2002 EN SPACE
2003 EM SPACE
2004 THREE-PER-EM SPACE
2005 FOUR-PER-EM SPACE
2006 SIX-PER-EM SPACE
2008 PUNCTUATION SPACE
2009 THIN SPACE
200A HAIR SPACE
2028 LINE SEPARATOR (not blank)
2029 PARAGRAPH SEPARATOR (not blank)
205F MEDIUM MATHEMATICAL SPACE
3000 IDEOGRAPHIC SPACE

In the non breaking space class we have:

00A0 NON BREAKING SPACE
2007 FIGURE SPACE
202F NARROW NO-BREAK SPACE
2060 WORD JOINER

Maybe we should consider these as word separators?
I pasted `printf '=\u00A0=\u2007=\u202F=\u2060=\n'`
into libreoffice writer and it treated all but the last
as a word separator in its word count tool.

There is some discussion of POSIX and unicode classes at:
http://unicode.org/L2/L2003/03139-posix-classes.htm

I guess POSIX is defining lower level functionality
and has to be compat with all uses of iswspace()
which might be used for line reformatting etc.
but wc(1) being higher level, perhaps should consider
the non breaking variants as word separators?
The following change would do that:

diff --git a/src/wc.c b/src/wc.c
index 179abbe..ca990b4 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -147,6 +147,13 @@ the following order: newline, word, character, byte, maximum line length.\n\
   exit (status);
 }

+static int _GL_ATTRIBUTE_PURE
+iswnbspace (wint_t wc)
+{
+  return  wc == L'\u00A0' || wc == L'\u2007' \
+       || wc == L'\u202F' || wc == L'\u2060';
+}
+
 /* FILE is the name of the file (or NULL for standard input)
    associated with the specified counters.  */
 static void
@@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
                           if (width > 0)
                             linepos += width;
                         }
-                      if (iswspace (wide_char))
+                      if (iswspace (wide_char) || iswnbspace (wide_char))
                         goto mb_word_separator;
                       in_word = true;
                     }


Note general word boundary handling is complicated:
https://www.unicode.org/reports/tr29/#Word_Boundaries
Consider this number with figure space:
  $ printf "1\u2007234,56\n"
  1 234,56
That would be considered as one word rather than two.
For more sophisticated contextual processing we would need
to use some of the word break functionality from libunistring.

cheers,
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Sun, 24 Feb 2019 13:59:01 GMT) Full text and rfc822 format available.

Message #14 received at 34524 <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Pádraig Brady <P <at> draigbrady.com>,
 bug-libunistring <at> gnu.org
Cc: vampyrebat <at> gmail.com, 34524 <at> debbugs.gnu.org
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Sun, 24 Feb 2019 14:58:02 +0100
[Ccing bug-libunistring, because this is about Unicode handling in GNU. The
 original thread is in <https://debbugs.gnu.org/cgi/bugreport.cgi?bug=34524>.]

> > The man page for wc states: "A word is a... sequence of characters delimited by white space."
> > 
> > But its concept of white space only seems to include ASCII white space.  U+00A0 NO-BREAK SPACE, for instance, is not recognized.
> > 
> > If your terminal displays UTF-8 encoding:
> > 
> > printf 'how are\xC2\xA0you\n'
> > 
> > or if your terminal displays ISO 8859-1 encoding:
> > 
> > printf 'how are\xA0you\n'
> > 
> > the visible output of this printf is "how are you".  In either case, wc does not recognize the second space as white space, resulting in an incorrect word count:

It is a complicated issue.

I) Relax. Don't be religious about it.
II) POSIX char classes
III) User expectations
IV) The Unicode standard
V) Implementation issues


I) Relax. Don't be religious about it.
======================================

Unicode is an effort to make programs work *reasonably well* with as many
kinds of text as possible.

For example, Unicode 23.2
<http://www.unicode.org/versions/Unicode11.0.0/ch23.pdf> page 859
says:
  "The effect of layout controls is specific to particular text processes.
   As much as possible, layout controls are transparent to those text processes
   for which they were not intended."

Or, Unicode TR 29 <https://www.unicode.org/reports/tr29/tr29-33.html> says:
  "The precise determination of text elements may vary according to
   orthographic conventions for a given script or language. The goal of
   matching user perceptions cannot always be met exactly because the text
   alone does not always contain enough information to unambiguously decide
   boundaries. For example, the period (U+002E FULL STOP) is used ambiguously,
   sometimes for end-of-sentence purposes, sometimes for abbreviations, and
   sometimes for numbers. In most cases, however, programmatic text boundaries
   can match user perceptions quite closely, although sometimes the best that
   can be done is not to surprise the user."

Or, there is criticism: <http://jkorpela.fi/unicode/linebr.html>

Therefore, this is a reminder that sometimes no optimal solution can be found.
Relax.


II) POSIX char classes
======================

> There is some discussion of POSIX and unicode classes at:
> http://unicode.org/L2/L2003/03139-posix-classes.htm
> 
> I guess POSIX is defining lower level functionality
> and has to be compat with all uses of iswspace()
> which might be used for line reformatting etc.
> but wc(1) being higher level, perhaps should consider
> the non breaking variants as word separators?

Exactly, that's the right approach. The POSIX char classes are defined in
glibc/localedata/unicode-gen/unicode_utils.py; in this case what matters is
the is_space function, and it has a comment:
    # Don’t make U+00A0 a space. Non-breaking space means that all programs
    # should treat it like a punctuation character, not like a space.
If U+00A0 was made a space, most programs would treat NO-BREAK SPACE like
SPACE, which is against the purpose of NO-BREAK SPACE. So, in general,
users should be aware that NO-BREAK SPACE is not a space. (And likewise,
the SOFT HYPHEN is not to be treated like HYPHEN, because that would be
against the purpose of the SOFT HYPHEN.)

But 'wc' is a specific program, with a specific purpose, and that might
warrant exceptions.

> I pasted `printf '=\u00A0=\u2007=\u202F=\u2060=\n'`
> into libreoffice writer and it treated all but the last
> as a word separator in its word count tool.

This is a good approach, because text processors usually deal with Unicode
in more detail and with more thought than we usually do in the command-line
/ monospaced world.


III) User expectations
======================

On one hand, user expectation that a no-break space separates words is
justified: In "Dr.\u00A0Pinkwart" a user sees two words.

On the other hand, the opposite user expectation is justified as well.
The English sentence "Look: here he is" is translated into French as
"Regarde\u00A0: le voilà". (It is customary to put a space before colon,
question mark, and exclamation mark in French. And to avoid line breaking
at these points, it must be a NO-BREAK space.) When a translator counts
the words they have translated, "Regarde : le voilà" should count as
3 words, not 4 words. OTOH, it could be argued that in this case, the
problem is that a word (":") consisting only of punctuation characters
should not be counted as a word.

But again: relax. Translators are being paid according to word counts,
but a word count that is 1 too high or 1 too low is not dramatic.


IV) The Unicode standard
========================

On one hand, the Unicode standard makes it clear in several places that
  1) NO-BREAK SPACE prohibits line breaking,
  2) line breaking and words are related.

See for example, the Unicode standard section 5.12
<http://www.unicode.org/versions/Unicode11.0.0/ch05.pdf>
page 219:
  "Line breaking algorithms generally use state machines for determining
   word breaks."

Or the Unicode standard section 23.2
<http://www.unicode.org/versions/Unicode11.0.0/ch23.pdf> page 859
  "Word Joiner. U+2060 word joiner behaves like U+00A0 no-break space
   in that it indicates the absence of line breaks; ..."

On the other hand, in the same section 23.2 it says
  "Line breaking and word breaking are distinct text processes.
   Although a candidate position for a line break in text often coincides
   with a candidate position for a word break, there are also many
   situations where candidate break positions of different types do not
   coincide."

And in the Unicode TR 29 section 4 "Word boundaries"
<https://www.unicode.org/reports/tr29/tr29-33.html#Word_Boundaries>
it treats NO-BREAK SPACE as a word boundary by default - this can be
verified through the program below - but also says that SPACE and
NO-BREAK SPACE "may be tailored to be in MidNum, depending on the environment".

Here's an example program, that uses GNU libunistring:
==============================================================
#include <stdio.h>
#include <uniwbrk.h>

int main ()
{
  printf ("%d\n", uc_wordbreak_property (0x00A0));
  {
    uint8_t string[] = "Regarde : le voilà";
    char p[19];
    u8_wordbreaks (string, 19, p);
    puts ((char *) string);
    for (int i = 0; i < 19; i++)
      if (p[i])
        printf ("word break at position %d\n", i);
  }
  {
    uint8_t string[] = "Regarde\u00A0: le voilà";
    char p[20];
    u8_wordbreaks (string, 20, p);
    puts ((char *) string);
    for (int i = 0; i < 20; i++)
      if (p[i])
        printf ("word break at position %d\n", i);
  }
}
==============================================================
and its output:
0                                (means: WBP_OTHER)
Regarde : le voilà
word break at position 7
word break at position 8
word break at position 9
word break at position 10
word break at position 12
word break at position 13
Regarde : le voilà
word break at position 7
word break at position 9
word break at position 10
word break at position 11
word break at position 13
word break at position 14


V) Implementation issues
========================

> The following change would do that:
> 
> diff --git a/src/wc.c b/src/wc.c
> index 179abbe..ca990b4 100644
> --- a/src/wc.c
> +++ b/src/wc.c
> @@ -147,6 +147,13 @@ the following order: newline, word, character, byte, maximum line length.\n\
>    exit (status);
>  }
> 
> +static int _GL_ATTRIBUTE_PURE
> +iswnbspace (wint_t wc)
> +{
> +  return  wc == L'\u00A0' || wc == L'\u2007' \
> +       || wc == L'\u202F' || wc == L'\u2060';
> +}
> +
>  /* FILE is the name of the file (or NULL for standard input)
>     associated with the specified counters.  */
>  static void
> @@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
>                            if (width > 0)
>                              linepos += width;
>                          }
> -                      if (iswspace (wide_char))
> +                      if (iswspace (wide_char) || iswnbspace (wide_char))
>                          goto mb_word_separator;
>                        in_word = true;
>                      }
> 
> ...
> For more sophisticated contextual processing we would need
> to use some of the word break functionality from libunistring.

I don't think you will be able to satisfactorily blend POSIX behaviour
with Unicode behaviour without introducing a command-line option.

On the POSIX side: POSIX says
<https://pubs.opengroup.org/onlinepubs/9699919799/utilities/wc.html>
  "The wc utility shall consider a word to be a non-zero-length string
   of characters delimited by white space."
and
<https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>
  "space
   Define characters to be classified as white-space characters."
So, when operates according to POSIX expectations, it MUST use
  iswspace (wide_char)
not
  iswspace (wide_char) || iswnbspace (wide_char)

On the Unicode side: It is reasonable to see two "words" in
"Regardez\u00A0:", and the GNU libunistring library implement it like
this. It is also reasonable to expect that 'wc' counts words in the Thai
language, which does not use spaces to delimit words. GNU libunistring
may implement this in the future as well.

For this reason, I would find it best to introduce an option '--unicode'
to 'wc', that would produce Unicode compliant results, at the cost of
  - not following POSIX to the letter,
  - being slower.

Bruno





Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Sun, 24 Feb 2019 17:48:01 GMT) Full text and rfc822 format available.

Message #17 received at 34524 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Bruno Haible <bruno <at> clisp.org>, Pádraig Brady
 <P <at> draigbrady.com>, bug-libunistring <at> gnu.org
Cc: vampyrebat <at> gmail.com, 34524 <at> debbugs.gnu.org
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Sun, 24 Feb 2019 09:47:02 -0800
Bruno Haible wrote:
> I would find it best to introduce an option '--unicode'
> to 'wc', that would produce Unicode compliant results, at the cost of
>    - not following POSIX to the letter,

It'd make sense to have an option. How about a more-general option --words, that 
would let the user define what a word is? This option's operand could use ERE 
syntax, or a shorthand beginning with '+' for common combinations. For example, 
the command:

wc --words='[[:alnum:]]+'

would say that a word consists of the longest contiguous sequence of 
alphanumeric characters. And

wc --words='+unicode'

would use the Unicode definition of word, whatever it is.




Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Mon, 25 Feb 2019 01:08:01 GMT) Full text and rfc822 format available.

Message #20 received at 34524 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bruno Haible <bruno <at> clisp.org>, bug-libunistring <at> gnu.org
Cc: vampyrebat <at> gmail.com, Paul Eggert <eggert <at> CS.UCLA.EDU>,
 34524 <at> debbugs.gnu.org
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Sun, 24 Feb 2019 17:07:18 -0800
On 24/02/19 05:58, Bruno Haible wrote:
> [Ccing bug-libunistring, because this is about Unicode handling in GNU. The
>  original thread is in <https://debbugs.gnu.org/cgi/bugreport.cgi?bug=34524>.]
> 
>>> The man page for wc states: "A word is a... sequence of characters delimited by white space."
>>>
>>> But its concept of white space only seems to include ASCII white space.  U+00A0 NO-BREAK SPACE, for instance, is not recognized.
>>>
>>> If your terminal displays UTF-8 encoding:
>>>
>>> printf 'how are\xC2\xA0you\n'
>>>
>>> or if your terminal displays ISO 8859-1 encoding:
>>>
>>> printf 'how are\xA0you\n'
>>>
>>> the visible output of this printf is "how are you".  In either case, wc does not recognize the second space as white space, resulting in an incorrect word count:
> 
> It is a complicated issue.
> 
> I) Relax. Don't be religious about it.
> II) POSIX char classes
> III) User expectations
> IV) The Unicode standard
> V) Implementation issues
> 
> 
> I) Relax. Don't be religious about it.
> ======================================
> 
> Unicode is an effort to make programs work *reasonably well* with as many
> kinds of text as possible.
> 
> For example, Unicode 23.2
> <http://www.unicode.org/versions/Unicode11.0.0/ch23.pdf> page 859
> says:
>   "The effect of layout controls is specific to particular text processes.
>    As much as possible, layout controls are transparent to those text processes
>    for which they were not intended."
> 
> Or, Unicode TR 29 <https://www.unicode.org/reports/tr29/tr29-33.html> says:
>   "The precise determination of text elements may vary according to
>    orthographic conventions for a given script or language. The goal of
>    matching user perceptions cannot always be met exactly because the text
>    alone does not always contain enough information to unambiguously decide
>    boundaries. For example, the period (U+002E FULL STOP) is used ambiguously,
>    sometimes for end-of-sentence purposes, sometimes for abbreviations, and
>    sometimes for numbers. In most cases, however, programmatic text boundaries
>    can match user perceptions quite closely, although sometimes the best that
>    can be done is not to surprise the user."
> 
> Or, there is criticism: <http://jkorpela.fi/unicode/linebr.html>
> 
> Therefore, this is a reminder that sometimes no optimal solution can be found.
> Relax.
> 
> 
> II) POSIX char classes
> ======================
> 
>> There is some discussion of POSIX and unicode classes at:
>> http://unicode.org/L2/L2003/03139-posix-classes.htm
>>
>> I guess POSIX is defining lower level functionality
>> and has to be compat with all uses of iswspace()
>> which might be used for line reformatting etc.
>> but wc(1) being higher level, perhaps should consider
>> the non breaking variants as word separators?
> 
> Exactly, that's the right approach. The POSIX char classes are defined in
> glibc/localedata/unicode-gen/unicode_utils.py; in this case what matters is
> the is_space function, and it has a comment:
>     # Don’t make U+00A0 a space. Non-breaking space means that all programs
>     # should treat it like a punctuation character, not like a space.
> If U+00A0 was made a space, most programs would treat NO-BREAK SPACE like
> SPACE, which is against the purpose of NO-BREAK SPACE. So, in general,
> users should be aware that NO-BREAK SPACE is not a space. (And likewise,
> the SOFT HYPHEN is not to be treated like HYPHEN, because that would be
> against the purpose of the SOFT HYPHEN.)
> 
> But 'wc' is a specific program, with a specific purpose, and that might
> warrant exceptions.
> 
>> I pasted `printf '=\u00A0=\u2007=\u202F=\u2060=\n'`
>> into libreoffice writer and it treated all but the last
>> as a word separator in its word count tool.
> 
> This is a good approach, because text processors usually deal with Unicode
> in more detail and with more thought than we usually do in the command-line
> / monospaced world.
> 
> 
> III) User expectations
> ======================
> 
> On one hand, user expectation that a no-break space separates words is
> justified: In "Dr.\u00A0Pinkwart" a user sees two words.
> 
> On the other hand, the opposite user expectation is justified as well.
> The English sentence "Look: here he is" is translated into French as
> "Regarde\u00A0: le voilà". (It is customary to put a space before colon,
> question mark, and exclamation mark in French. And to avoid line breaking
> at these points, it must be a NO-BREAK space.) When a translator counts
> the words they have translated, "Regarde : le voilà" should count as
> 3 words, not 4 words. OTOH, it could be argued that in this case, the
> problem is that a word (":") consisting only of punctuation characters
> should not be counted as a word.
> 
> But again: relax. Translators are being paid according to word counts,
> but a word count that is 1 too high or 1 too low is not dramatic.
> 
> 
> IV) The Unicode standard
> ========================
> 
> On one hand, the Unicode standard makes it clear in several places that
>   1) NO-BREAK SPACE prohibits line breaking,
>   2) line breaking and words are related.
> 
> See for example, the Unicode standard section 5.12
> <http://www.unicode.org/versions/Unicode11.0.0/ch05.pdf>
> page 219:
>   "Line breaking algorithms generally use state machines for determining
>    word breaks."
> 
> Or the Unicode standard section 23.2
> <http://www.unicode.org/versions/Unicode11.0.0/ch23.pdf> page 859
>   "Word Joiner. U+2060 word joiner behaves like U+00A0 no-break space
>    in that it indicates the absence of line breaks; ..."
> 
> On the other hand, in the same section 23.2 it says
>   "Line breaking and word breaking are distinct text processes.
>    Although a candidate position for a line break in text often coincides
>    with a candidate position for a word break, there are also many
>    situations where candidate break positions of different types do not
>    coincide."
> 
> And in the Unicode TR 29 section 4 "Word boundaries"
> <https://www.unicode.org/reports/tr29/tr29-33.html#Word_Boundaries>
> it treats NO-BREAK SPACE as a word boundary by default - this can be
> verified through the program below - but also says that SPACE and
> NO-BREAK SPACE "may be tailored to be in MidNum, depending on the environment".
> 
> Here's an example program, that uses GNU libunistring:
> ==============================================================
> #include <stdio.h>
> #include <uniwbrk.h>
> 
> int main ()
> {
>   printf ("%d\n", uc_wordbreak_property (0x00A0));
>   {
>     uint8_t string[] = "Regarde : le voilà";
>     char p[19];
>     u8_wordbreaks (string, 19, p);
>     puts ((char *) string);
>     for (int i = 0; i < 19; i++)
>       if (p[i])
>         printf ("word break at position %d\n", i);
>   }
>   {
>     uint8_t string[] = "Regarde\u00A0: le voilà";
>     char p[20];
>     u8_wordbreaks (string, 20, p);
>     puts ((char *) string);
>     for (int i = 0; i < 20; i++)
>       if (p[i])
>         printf ("word break at position %d\n", i);
>   }
> }
> ==============================================================
> and its output:
> 0                                (means: WBP_OTHER)
> Regarde : le voilà
> word break at position 7
> word break at position 8
> word break at position 9
> word break at position 10
> word break at position 12
> word break at position 13
> Regarde : le voilà
> word break at position 7
> word break at position 9
> word break at position 10
> word break at position 11
> word break at position 13
> word break at position 14
> 
> 
> V) Implementation issues
> ========================
> 
>> The following change would do that:
>>
>> diff --git a/src/wc.c b/src/wc.c
>> index 179abbe..ca990b4 100644
>> --- a/src/wc.c
>> +++ b/src/wc.c
>> @@ -147,6 +147,13 @@ the following order: newline, word, character, byte, maximum line length.\n\
>>    exit (status);
>>  }
>>
>> +static int _GL_ATTRIBUTE_PURE
>> +iswnbspace (wint_t wc)
>> +{
>> +  return  wc == L'\u00A0' || wc == L'\u2007' \
>> +       || wc == L'\u202F' || wc == L'\u2060';
>> +}
>> +
>>  /* FILE is the name of the file (or NULL for standard input)
>>     associated with the specified counters.  */
>>  static void
>> @@ -455,7 +462,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus, off_t current_pos)
>>                            if (width > 0)
>>                              linepos += width;
>>                          }
>> -                      if (iswspace (wide_char))
>> +                      if (iswspace (wide_char) || iswnbspace (wide_char))
>>                          goto mb_word_separator;
>>                        in_word = true;
>>                      }
>>
>> ...
>> For more sophisticated contextual processing we would need
>> to use some of the word break functionality from libunistring.
> 
> I don't think you will be able to satisfactorily blend POSIX behaviour
> with Unicode behaviour without introducing a command-line option.
> 
> On the POSIX side: POSIX says
> <https://pubs.opengroup.org/onlinepubs/9699919799/utilities/wc.html>
>   "The wc utility shall consider a word to be a non-zero-length string
>    of characters delimited by white space."
> and
> <https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html>
>   "space
>    Define characters to be classified as white-space characters."
> So, when operates according to POSIX expectations, it MUST use
>   iswspace (wide_char)
> not
>   iswspace (wide_char) || iswnbspace (wide_char)
> 
> On the Unicode side: It is reasonable to see two "words" in
> "Regardez\u00A0:", and the GNU libunistring library implement it like
> this. It is also reasonable to expect that 'wc' counts words in the Thai
> language, which does not use spaces to delimit words. GNU libunistring
> may implement this in the future as well.
> 
> For this reason, I would find it best to introduce an option '--unicode'
> to 'wc', that would produce Unicode compliant results, at the cost of
>   - not following POSIX to the letter,
>   - being slower.

Wow thanks for all that deep info.

So non break space is generally considered a word delimiter,
though there are complications you detail from unicode.

In regard to options for enabling various behaviors for wc(1),
I'm thinking we might keep the strict POSIX isspace() behavior
with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace()
by default, since that's the most common operation one would want,
and is consistent with libreoffice for example.
I'll adjust the patch along those lines.

I like the --words=unicode idea to give us control over various
more contextual behaviors in future.

thank you!
Pádraig




Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Mon, 25 Feb 2019 03:56:02 GMT) Full text and rfc822 format available.

Message #23 received at 34524 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: vampyrebat <at> gmail.com, Paul Eggert <eggert <at> CS.UCLA.EDU>,
 34524 <at> debbugs.gnu.org
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Sun, 24 Feb 2019 19:55:39 -0800
[Message part 1 (text/plain, inline)]
On 24/02/19 17:07, Pádraig Brady wrote:
> So non break space is generally considered a word delimiter,
> though there are complications you detail from unicode.
> 
> In regard to options for enabling various behaviors for wc(1),
> I'm thinking we might keep the strict POSIX isspace() behavior
> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace()
> by default, since that's the most common operation one would want,
> and is consistent with libreoffice for example.
> I'll adjust the patch along those lines.

Full patch attached.

cheers,
Pádraig

[wc-nbsp.patch (text/x-patch, attachment)]

Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Tue, 26 Feb 2019 04:28:02 GMT) Full text and rfc822 format available.

Notification sent to vampyrebat <at> gmail.com:
bug acknowledged by developer. (Tue, 26 Feb 2019 04:28:02 GMT) Full text and rfc822 format available.

Message #28 received at 34524-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: vampyrebat <at> gmail.com, 34524-done <at> debbugs.gnu.org,
 Paul Eggert <eggert <at> CS.UCLA.EDU>
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Mon, 25 Feb 2019 20:26:55 -0800
[Message part 1 (text/plain, inline)]
On 24/02/19 19:55, Pádraig Brady wrote:
> On 24/02/19 17:07, Pádraig Brady wrote:
>> So non break space is generally considered a word delimiter,
>> though there are complications you detail from unicode.
>>
>> In regard to options for enabling various behaviors for wc(1),
>> I'm thinking we might keep the strict POSIX isspace() behavior
>> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace()
>> by default, since that's the most common operation one would want,
>> and is consistent with libreoffice for example.
>> I'll adjust the patch along those lines.
> 
> Full patch attached.

Updated patch attached. I'll push in a few hours.
Marking this bug as done.

cheers,
Pádraig.

[wc-nbsp.patch (text/x-patch, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Sat, 09 Mar 2019 13:53:01 GMT) Full text and rfc822 format available.

Message #31 received at 34524-done <at> debbugs.gnu.org (full text, mbox):

From: Bruno Haible <bruno <at> clisp.org>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: vampyrebat <at> gmail.com, 34524-done <at> debbugs.gnu.org,
 Paul Eggert <eggert <at> cs.ucla.edu>
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Sat, 09 Mar 2019 14:52:49 +0100
Hi Pádraig,

> >> In regard to options for enabling various behaviors for wc(1),
> >> I'm thinking we might keep the strict POSIX isspace() behavior
> >> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace()
> >> by default

Since you plan to add a --words=... option in the future (as suggested
by Paul or me), it would make sense to add this option now, instead
of testing POSIXLY_CORRECT. If you introduce POSIXLY_CORRECT dependent
behaviour now (and need to keep it for backward-compatibility), you'll
have a hard to understand interface: What will the following do?

  env POSIXLY_CORRECT=1 wc --words=unicode
  wc --words=unicode

Bruno





Information forwarded to bug-coreutils <at> gnu.org:
bug#34524; Package coreutils. (Sun, 10 Mar 2019 03:32:02 GMT) Full text and rfc822 format available.

Message #34 received at 34524 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Bruno Haible <bruno <at> clisp.org>
Cc: vampyrebat <at> gmail.com, Paul Eggert <eggert <at> cs.ucla.edu>,
 34524 <at> debbugs.gnu.org
Subject: Re: bug#34524: wc: word count incorrect when words separated only by
 no-break space
Date: Sat, 9 Mar 2019 19:31:43 -0800
On 09/03/19 05:52, Bruno Haible wrote:
> Hi Pádraig,
> 
>>>> In regard to options for enabling various behaviors for wc(1),
>>>> I'm thinking we might keep the strict POSIX isspace() behavior
>>>> with LC_CTYPE=C and/or POSIXLY_CORRECT=1, and use iswnbspace()
>>>> by default
> 
> Since you plan to add a --words=... option in the future (as suggested
> by Paul or me), it would make sense to add this option now, instead
> of testing POSIXLY_CORRECT. If you introduce POSIXLY_CORRECT dependent
> behaviour now (and need to keep it for backward-compatibility), you'll
> have a hard to understand interface: What will the following do?
> 
>   env POSIXLY_CORRECT=1 wc --words=unicode
>   wc --words=unicode

Well until we actually support more contextual
unicode word separation operation, the --words
option parameter would be a bit redundant.
Generally no-one would need to use POSIXLY_CORRECT
directly with wc, rather setting it globally
on a system or script to minimize changes.

In the above example --words=unicode would be
an explicit option to operate in extension to POSIX,
and so POSIXLY_CORRECT would be ignored there.

cheers,
Pádraig





bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 07 Apr 2019 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 14 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.