GNU bug report logs -
#79824
fmt not correctly process text with UTF-8 characters encoding
Previous Next
To reply to this bug, email your comments to 79824 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org:
bug#79824; Package
coreutils.
(Wed, 12 Nov 2025 17:08:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Воронов Андрей Александрович <a.voronov <at> fintech.ru>:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org.
(Wed, 12 Nov 2025 17:08:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Good evening,
When I run the fmt to make a text with default 75 columns width it properly convert only the Latin letters from ASCII.
Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
are shorter 2 times accordingly.
Use test case below:
Original text before (last 20 strings):
=================================================
$ tail -20 Kolisnichenko_D._Komandnaia_stroka_Linux_2.md
### Пакет coreutils
Программа expand полезна для преобразования табуляций в пробелы.
Например, программу с табуляциями в начале строк (опция `-i`) в файле `hellocool.c`
преобразует табуляции в несколько пробелов и запишет в файл `hc.c`:
expand -i hellocool.c > hc.c
Печатный текст форматируется под страницу (72 символа в строке) утилитой `fmt`.
## Источники
* [CDRDAO](http://cdrdao.sourceforge.net/) ; Disk-At-Once Recording of Audio and Data CD-Rs/CD-RWs
* [BChunk](https://github.com/hessu/bchunk) ;
* [ccd2iso](https://sourceforge.net/projects/ccd2iso/) ;
===================================================
Same text after transferring these strings by fmt utility with default options:
===================================================
$ fmt Kolisnichenko_D._Komandnaia_stroka_Linux_2.md
...
### Пакет coreutils
Программа expand полезна для
преобразования табуляций в пробелы.
Например, программу с табуляциями в
начале строк (опция `-i`) в файле `hellocool.c`
преобразует табуляции в несколько
пробелов и запишет в файл `hc.c`:
expand -i hellocool.c > hc.c
Печатный текст форматируется под
страницу (72 символа в строке) утилитой
`fmt`.
## Источники
* [CDRDAO](http://cdrdao.sourceforge.net/) ; Disk-At-Once Recording of
Audio and Data CD-Rs/CD-RWs * [BChunk](https://github.com/hessu/bchunk)
; * [ccd2iso](https://sourceforge.net/projects/ccd2iso/) ;
===================================================
Sorry for my English.
God bless you.
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org:
bug#79824; Package
coreutils.
(Wed, 12 Nov 2025 18:48:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 79824 <at> debbugs.gnu.org (full text, mbox):
On 12/11/2025 15:14, Воронов Андрей Александрович wrote:
> Good evening,
>
> When I run the fmt to make a text with default 75 columns width it properly convert only the Latin letters from ASCII.
> Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
> are shorter 2 times accordingly.
Yes this is a known issue which we're gradually getting to.
thanks,
Padraig
Information forwarded
to
bug-coreutils <at> gnu.org:
bug#79824; Package
coreutils.
(Wed, 12 Nov 2025 22:30:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 79824 <at> debbugs.gnu.org (full text, mbox):
Pádraig Brady <P <at> draigBrady.com> writes:
> On 12/11/2025 15:14, Воронов Андрей Александрович wrote:
>> Good evening,
>> When I run the fmt to make a text with default 75 columns width it
>> properly convert only the Latin letters from ASCII.
>> Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
>> are shorter 2 times accordingly.
>
> Yes this is a known issue which we're gradually getting to.
I can have a look at it using mbbuf_t in a similar way to 'fold'.
I think 'fmt' is similar, in that it does not matter much if it is a bit
slower. Handling unicode characters is more important, IMO.
Collin
Information forwarded
to
bug-coreutils <at> gnu.org:
bug#79824; Package
coreutils.
(Mon, 17 Nov 2025 02:11:02 GMT)
Full text and
rfc822 format available.
Message #14 received at 79824 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Thu, 13 Nov 2025 at 04:47, Pádraig Brady <P <at> draigbrady.com> wrote:
> Yes this is a known issue which we're gradually getting to.
>
Dealing with *just* alphabetic scripts is relatively easy, but in general
the rules for flowing unicode text into paragraphs are considerably more
complicated than for plain ASCII.
What's the plan for handling double-width, zero-width, and combining
characters?
"Shy" hyphens?
Scripts that don't put spaces between words?
Non-breaking and non-joiner codepoints, additional line & paragraph
terminators, etc.
Combining characters follow rather than precede the principal character in
the data stream, so scanning would need to continue even after the line is
apparently "full" to ensure that they're included.
I guess this should be coordinated with bug#79631 (UTF-8 support in the
"cut" utility), at least in terms of documenting whether the count applies
to code-points, to composed characters, or to cells (0, 1 or 2 per composed
character); if counting cells, it should document that the number of cells
would be rounded down because double-width characters can't be split.
-Martin
[Message part 2 (text/html, inline)]
This bug report was last modified 21 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.