GNU bug report logs -
#20751
wc -m doesn't count UTF-8 characters properly
Previous Next
Reported by: valdis.vitolins <at> odo.lv
Date: Sat, 6 Jun 2015 17:12:03 UTC
Severity: normal
Tags: notabug
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 20751 in the body.
You can then email your comments to 20751 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
help-debbugs <at> gnu.org
:
bug#20751
; Package
debbugs.gnu.org
.
(Sat, 06 Jun 2015 17:12:03 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
valdis.vitolins <at> odo.lv
:
New bug report received and forwarded. Copy sent to
help-debbugs <at> gnu.org
.
(Sat, 06 Jun 2015 17:12:03 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Version: wc (GNU coreutils) 8.21
When 'wc -m' is invoked, it should print character count, but it counts
incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
bytes in them, but all have only two UTF-8 encoded characters, which you
can see with any modern text editor.
wc -c chows correct number of bytes:
wc -c *
3 3bytes.txt
4 4bytes.txt
6 6bytes.txt
13 total
But wc -m shows incorrect number of characters:
wc -m *
3 3bytes.txt
3 4bytes.txt
3 6bytes.txt
9 total
But should be:
wc -m *
2 3bytes.txt
2 4bytes.txt
2 6bytes.txt
6 total
I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64
GNU/Linux 3.13.0-53-generic kernel
P.S.
If attachments will not pass through system, you can test it by creating
files with following content:
3bytes.txt: aa
4bytes.txt: aā
6bytes.txt: a𐄈
[3bytes.txt (text/plain, attachment)]
[4bytes.txt (text/plain, attachment)]
[6bytes.txt (text/plain, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20751
; Package
coreutils
.
(Sat, 06 Jun 2015 18:11:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 20751 <at> debbugs.gnu.org (full text, mbox):
You mailed submit <at> debbugs without specifying a Package:, so your bug
report ended up on the help-debbugs list. I have reassigned it to
coreutils. (Please note there is no "wc" package.)
(My mailer is messing up the UTF-8 characters in your report.
Interested parties can see the original at http://debbugs.gnu.org/20751#5 .)
Valdis V toli wrote:
> Version: wc (GNU coreutils) 8.21
>
> When 'wc -m' is invoked, it should print character count, but it counts
> incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
> bytes in them, but all have only two UTF-8 encoded characters, which you
> can see with any modern text editor.
>
> wc -c chows correct number of bytes:
> wc -c *
> 3 3bytes.txt
> 4 4bytes.txt
> 6 6bytes.txt
> 13 total
>
> But wc -m shows incorrect number of characters:
> wc -m *
> 3 3bytes.txt
> 3 4bytes.txt
> 3 6bytes.txt
> 9 total
>
> But should be:
> wc -m *
> 2 3bytes.txt
> 2 4bytes.txt
> 2 6bytes.txt
> 6 total
>
> I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64
> GNU/Linux 3.13.0-53-generic kernel
>
> P.S.
> If attachments will not pass through system, you can test it by creating
> files with following content:
>
> 3bytes.txt: aa
> 4bytes.txt: aā
> 6bytes.txt: a
Attachments at http://debbugs.gnu.org/20751#5
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20751
; Package
coreutils
.
(Sat, 06 Jun 2015 18:50:03 GMT)
Full text and
rfc822 format available.
Message #11 received at 20751 <at> debbugs.gnu.org (full text, mbox):
Note, that UTF-8 characters can be counted by counting bytes with bit
patterns 0xxxxxxx or 11xxxxxx:
https://en.wikipedia.org/wiki/UTF-8#Description
So, general logic should be, that, if:
a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or
b) first two bytes of file are 0xFE 0xFF
https://en.wikipedia.org/wiki/Byte_order_mark
then count bytes with bits 0xxxxxxx and 11xxxxxx.
> You mailed submit <at> debbugs without specifying a Package:, so your bug
> report ended up on the help-debbugs list. I have reassigned it to
> coreutils. (Please note there is no "wc" package.)
>
> (My mailer is messing up the UTF-8 characters in your report.
> Interested parties can see the original at http://debbugs.gnu.org/20751#5 .)
>
> Valdis V toli wrote:
>
> > Version: wc (GNU coreutils) 8.21
> >
> > When 'wc -m' is invoked, it should print character count, but it counts
> > incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
> > bytes in them, but all have only two UTF-8 encoded characters, which you
> > can see with any modern text editor.
> >
> > wc -c chows correct number of bytes:
> > wc -c *
> > 3 3bytes.txt
> > 4 4bytes.txt
> > 6 6bytes.txt
> > 13 total
> >
> > But wc -m shows incorrect number of characters:
> > wc -m *
> > 3 3bytes.txt
> > 3 4bytes.txt
> > 3 6bytes.txt
> > 9 total
> >
> > But should be:
> > wc -m *
> > 2 3bytes.txt
> > 2 4bytes.txt
> > 2 6bytes.txt
> > 6 total
> >
> > I am using Lubuntu 14.04.2 LTS (lsb_release reports Ubuntu), x86_64
> > GNU/Linux 3.13.0-53-generic kernel
> >
> > P.S.
> > If attachments will not pass through system, you can test it by creating
> > files with following content:
> >
> > 3bytes.txt: aa
> > 4bytes.txt: aā
> > 6bytes.txt: a
>
> Attachments at http://debbugs.gnu.org/20751#5
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20751
; Package
coreutils
.
(Sat, 06 Jun 2015 21:44:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 20751 <at> debbugs.gnu.org (full text, mbox):
tag 20751 notabug
close 20751
stop
On 06/06/15 19:49, Valdis Vītoliņš wrote:
>>> Version: wc (GNU coreutils) 8.21
>>>
>>> When 'wc -m' is invoked, it should print character count, but it counts
>>> incorrectly UTF-8 encoded characters. Attached files have 3, 4 an 6
>>> bytes in them, but all have only two UTF-8 encoded characters, which you
>>> can see with any modern text editor.
>>>
>>> wc -c chows correct number of bytes:
>>> wc -c *
>>> 3 3bytes.txt
>>> 4 4bytes.txt
>>> 6 6bytes.txt
>>> 13 total
>>>
>>> But wc -m shows incorrect number of characters:
>>> wc -m *
>>> 3 3bytes.txt
>>> 3 4bytes.txt
>>> 3 6bytes.txt
>>> 9 total
>>>
>>> But should be:
>>> wc -m *
>>> 2 3bytes.txt
>>> 2 4bytes.txt
>>> 2 6bytes.txt
>>> 6 total
I think it's working correctly.
I.E. the \n is included in the count.
thanks,
Pádraig.
Added tag(s) notabug.
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Sat, 06 Jun 2015 21:44:02 GMT)
Full text and
rfc822 format available.
bug closed, send any further explanations to
20751 <at> debbugs.gnu.org and valdis.vitolins <at> odo.lv
Request was from
Pádraig Brady <P <at> draigBrady.com>
to
control <at> debbugs.gnu.org
.
(Sat, 06 Jun 2015 21:44:03 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20751
; Package
coreutils
.
(Sun, 07 Jun 2015 20:51:02 GMT)
Full text and
rfc822 format available.
Message #21 received at 20751 <at> debbugs.gnu.org (full text, mbox):
Thanks for clarification!
I tested it with Bash script:
chars=$(wc -m mylog|cut -d ' ' -f1)
lines=$(wc -l mylog|cut -d ' ' -f1)
let chars="$chars - $lines"
echo $chars
and got the same number as given by vim
:%s/.//gn
(Which was place from what I got confused.)
Hopefully this bug description will help to others.
>
> I think it's working correctly.
> I.E. the \n is included in the count.
>
> thanks,
> Pádraig.
>
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#20751
; Package
coreutils
.
(Sun, 07 Jun 2015 21:48:01 GMT)
Full text and
rfc822 format available.
Message #24 received at 20751 <at> debbugs.gnu.org (full text, mbox):
2015-06-06 21:49:16 +0300, Valdis Vītoliņš:
> Note, that UTF-8 characters can be counted by counting bytes with bit
> patterns 0xxxxxxx or 11xxxxxx:
> https://en.wikipedia.org/wiki/UTF-8#Description
>
> So, general logic should be, that, if:
> a) locale setting is utf-8 (e.g. LANG=xx_XX.UTF-8), or
> b) first two bytes of file are 0xFE 0xFF
> https://en.wikipedia.org/wiki/Byte_order_mark
>
> then count bytes with bits 0xxxxxxx and 11xxxxxx.
[...]
Except that only valid characters should be counted. And there,
the definition of valid character is not always clear.
At least an incorrect UTF-8 encoding can't count as valid
characters.
So
printf '\300' | wc -m
should return 0 as 11000000 alone is not a valid character so we
can't use your algorithm without first verifying the validity of
the input.
Then the UTF-8 encoding of the UTF16 surrogate pairs (0xD800 to
0xDFFF) should probably be excluded as well:
printf '\355\240\200' | wc -m
should return 0 for instance..
And maybe code-points above 0x11FFFF now since Unicode seem to
have given up on ever defining characters above that (probably
because of the UTF16 limitation).
Now even in the range 0 -> D700, E000-> 0x11FFFF, there are
still thousands of code points that are not defined yet in the
latest Unicode version. I suppose we can imagine locale
definitions where each of the known characters are listed and
the rest rejected...
--
Stephane
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Mon, 06 Jul 2015 11:24:05 GMT)
Full text and
rfc822 format available.
This bug report was last modified 9 years and 208 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.