GNU bug report logs - #20751
wc -m doesn't count UTF-8 characters properly

Previous Next

Package: coreutils;

Reported by: valdis.vitolins <at> odo.lv

Date: Sat, 6 Jun 2015 17:12:03 UTC

Severity: normal

Tags: notabug

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20751 in the body.
You can then email your comments to 20751 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to help-debbugs <at> gnu.org:
bug#20751; Package debbugs.gnu.org. (Sat, 06 Jun 2015 17:12:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to valdis.vitolins <at> odo.lv:
New bug report received and forwarded. Copy sent to help-debbugs <at> gnu.org. (Sat, 06 Jun 2015 17:12:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Valdis Vītoliņš <valdis.vitolins <at> odo.lv>
To: submit <at> debbugs.gnu.org
Subject: wc -m doesn't count UTF-8 characters properly
Date: Sat, 06 Jun 2015 14:12:29 +0300
[Message part 1 (text/plain, inline)]
Version: wc (GNU coreutils) 8.21

When 'wc -m' is invoked, it should print character count, but it counts
incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
bytes in them, but all have only two UTF-8 encoded characters, which you
can see with any modern text editor. 

wc -c chows correct number of bytes:
wc -c *
 3 3bytes.txt
 4 4bytes.txt
 6 6bytes.txt
13 total

But wc -m shows incorrect number of characters:
wc -m *
 3 3bytes.txt
 3 4bytes.txt
 3 6bytes.txt
 9 total

But should be:
wc -m *
 2 3bytes.txt
 2 4bytes.txt
 2 6bytes.txt
 6 total

 I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64  
GNU/Linux 3.13.0-53-generic kernel

P.S.
If attachments will not pass through system, you can test it by creating
files with following content:

3bytes.txt: aa
4bytes.txt: aā
6bytes.txt: a𐄈



[3bytes.txt (text/plain, attachment)]
[4bytes.txt (text/plain, attachment)]
[6bytes.txt (text/plain, attachment)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#20751; Package coreutils. (Sat, 06 Jun 2015 18:11:02 GMT) Full text and rfc822 format available.

Message #8 received at 20751 <at> debbugs.gnu.org (full text, mbox):

From: Glenn Morris <rgm <at> gnu.org>
To: valdis.vitolins <at> odo.lv
Cc: 20751 <at> debbugs.gnu.org
Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly
Date: Sat, 06 Jun 2015 14:10:23 -0400
You mailed submit <at> debbugs without specifying a Package:, so your bug
report ended up on the help-debbugs list. I have reassigned it to
coreutils. (Please note there is no "wc" package.)

(My mailer is messing up the UTF-8 characters in your report.
Interested parties can see the original at http://debbugs.gnu.org/20751#5 .)

Valdis V toli   wrote:

> Version: wc (GNU coreutils) 8.21
>
> When 'wc -m' is invoked, it should print character count, but it counts
> incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
> bytes in them, but all have only two UTF-8 encoded characters, which you
> can see with any modern text editor. 
>
> wc -c chows correct number of bytes:
> wc -c *
>  3 3bytes.txt
>  4 4bytes.txt
>  6 6bytes.txt
> 13 total
>
> But wc -m shows incorrect number of characters:
> wc -m *
>  3 3bytes.txt
>  3 4bytes.txt
>  3 6bytes.txt
>  9 total
>
> But should be:
> wc -m *
>  2 3bytes.txt
>  2 4bytes.txt
>  2 6bytes.txt
>  6 total
>
>  I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64  
> GNU/Linux 3.13.0-53-generic kernel
>
> P.S.
> If attachments will not pass through system, you can test it by creating
> files with following content:
>
> 3bytes.txt: aa
> 4bytes.txt: aā
> 6bytes.txt: a  

Attachments at http://debbugs.gnu.org/20751#5




Information forwarded to bug-coreutils <at> gnu.org:
bug#20751; Package coreutils. (Sat, 06 Jun 2015 18:50:03 GMT) Full text and rfc822 format available.

Message #11 received at 20751 <at> debbugs.gnu.org (full text, mbox):

From: Valdis Vītoliņš <valdis.vitolins <at> odo.lv>
To: 20751 <at> debbugs.gnu.org
Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly
Date: Sat, 06 Jun 2015 21:49:16 +0300
Note, that UTF-8 characters can be counted by counting bytes with bit
patterns 0xxxxxxx or 11xxxxxx:
https://en.wikipedia.org/wiki/UTF-8#Description

So, general logic should be, that, if:
a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or
b) first two bytes of file are 0xFE 0xFF
https://en.wikipedia.org/wiki/Byte_order_mark

then count bytes with bits 0xxxxxxx and 11xxxxxx.

> You mailed submit <at> debbugs without specifying a Package:, so your bug
> report ended up on the help-debbugs list. I have reassigned it to
> coreutils. (Please note there is no "wc" package.)
> 
> (My mailer is messing up the UTF-8 characters in your report.
> Interested parties can see the original at http://debbugs.gnu.org/20751#5 .)
> 
> Valdis V toli   wrote:
> 
> > Version: wc (GNU coreutils) 8.21
> >
> > When 'wc -m' is invoked, it should print character count, but it counts
> > incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
> > bytes in them, but all have only two UTF-8 encoded characters, which you
> > can see with any modern text editor. 
> >
> > wc -c chows correct number of bytes:
> > wc -c *
> >  3 3bytes.txt
> >  4 4bytes.txt
> >  6 6bytes.txt
> > 13 total
> >
> > But wc -m shows incorrect number of characters:
> > wc -m *
> >  3 3bytes.txt
> >  3 4bytes.txt
> >  3 6bytes.txt
> >  9 total
> >
> > But should be:
> > wc -m *
> >  2 3bytes.txt
> >  2 4bytes.txt
> >  2 6bytes.txt
> >  6 total
> >
> >  I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64  
> > GNU/Linux 3.13.0-53-generic kernel
> >
> > P.S.
> > If attachments will not pass through system, you can test it by creating
> > files with following content:
> >
> > 3bytes.txt: aa
> > 4bytes.txt: aā
> > 6bytes.txt: a  
> 
> Attachments at http://debbugs.gnu.org/20751#5






Information forwarded to bug-coreutils <at> gnu.org:
bug#20751; Package coreutils. (Sat, 06 Jun 2015 21:44:02 GMT) Full text and rfc822 format available.

Message #14 received at 20751 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: valdis.vitolins <at> odo.lv, 20751 <at> debbugs.gnu.org
Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly
Date: Sat, 06 Jun 2015 22:43:28 +0100
tag 20751 notabug
close 20751
stop

On 06/06/15 19:49, Valdis Vītoliņš wrote:
>>> Version: wc (GNU coreutils) 8.21
>>>
>>> When 'wc -m' is invoked, it should print character count, but it counts
>>> incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
>>> bytes in them, but all have only two UTF-8 encoded characters, which you
>>> can see with any modern text editor. 
>>>
>>> wc -c chows correct number of bytes:
>>> wc -c *
>>>  3 3bytes.txt
>>>  4 4bytes.txt
>>>  6 6bytes.txt
>>> 13 total
>>>
>>> But wc -m shows incorrect number of characters:
>>> wc -m *
>>>  3 3bytes.txt
>>>  3 4bytes.txt
>>>  3 6bytes.txt
>>>  9 total
>>>
>>> But should be:
>>> wc -m *
>>>  2 3bytes.txt
>>>  2 4bytes.txt
>>>  2 6bytes.txt
>>>  6 total

I think it's working correctly.
I.E. the \n is included in the count.

thanks,
Pádraig.





Added tag(s) notabug. Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sat, 06 Jun 2015 21:44:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 20751 <at> debbugs.gnu.org and valdis.vitolins <at> odo.lv Request was from Pádraig Brady <P <at> draigBrady.com> to control <at> debbugs.gnu.org. (Sat, 06 Jun 2015 21:44:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-coreutils <at> gnu.org:
bug#20751; Package coreutils. (Sun, 07 Jun 2015 20:51:02 GMT) Full text and rfc822 format available.

Message #21 received at 20751 <at> debbugs.gnu.org (full text, mbox):

From: Valdis Vītoliņš <valdis.vitolins <at> odo.lv>
To: 20751 <at> debbugs.gnu.org
Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly
Date: Sun, 07 Jun 2015 23:50:27 +0300
Thanks for clarification!

I tested it with Bash script:
chars=$(wc -m mylog|cut -d ' ' -f1)
lines=$(wc -l mylog|cut -d ' ' -f1)
let chars="$chars - $lines"
echo $chars

and got the same number as given by vim
:%s/.//gn

(Which was place from what I got confused.)

Hopefully this bug description will help to others.

> 
> I think it's working correctly.
> I.E. the \n is included in the count.
> 
> thanks,
> Pádraig.
> 






Information forwarded to bug-coreutils <at> gnu.org:
bug#20751; Package coreutils. (Sun, 07 Jun 2015 21:48:01 GMT) Full text and rfc822 format available.

Message #24 received at 20751 <at> debbugs.gnu.org (full text, mbox):

From: Stephane Chazelas <stephane.chazelas <at> gmail.com>
To: Valdis Vītoliņš <valdis.vitolins <at> odo.lv>
Cc: 20751 <at> debbugs.gnu.org
Subject: Re: bug#20751: wc -m doesn't count UTF-8 characters properly
Date: Sun, 7 Jun 2015 22:47:29 +0100
2015-06-06 21:49:16 +0300, Valdis Vītoliņš:
> Note, that UTF-8 characters can be counted by counting bytes with bit
> patterns 0xxxxxxx or 11xxxxxx:
> https://en.wikipedia.org/wiki/UTF-8#Description
> 
> So, general logic should be, that, if:
> a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or
> b) first two bytes of file are 0xFE 0xFF
> https://en.wikipedia.org/wiki/Byte_order_mark
> 
> then count bytes with bits 0xxxxxxx and 11xxxxxx.
[...]


Except that only valid characters should be counted. And there,
the definition of valid character is not always clear.

At least an incorrect UTF-8 encoding can't count as valid
characters.

So

printf '\300' | wc -m

should return 0 as 11000000 alone is not a valid character so we
can't use your algorithm without first verifying the validity of
the input.

Then the UTF-8 encoding of the UTF16 surrogate pairs (0xD800 to
0xDFFF) should probably be excluded as well:

printf '\355\240\200' | wc -m

should return 0 for instance..

And maybe code-points above 0x11FFFF now since Unicode seem to
have given up on ever defining characters above that (probably
because of the UTF16 limitation).

Now even in the range 0 -> D700, E000-> 0x11FFFF, there are
still thousands of code points that are not defined yet in the
latest Unicode version. I suppose we can imagine locale
definitions  where each of the known characters are listed and
the rest rejected...

-- 
Stephane




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 06 Jul 2015 11:24:05 GMT) Full text and rfc822 format available.

This bug report was last modified 9 years and 208 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.