GNU bug report logs - #79824
fmt not correctly process text with UTF-8 characters encoding

Previous Next

Package: coreutils;

Reported by: Воронов Андрей Александрович <a.voronov <at> fintech.ru>

Date: Wed, 12 Nov 2025 17:08:02 UTC

Severity: normal

To reply to this bug, email your comments to 79824 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#79824; Package coreutils. (Wed, 12 Nov 2025 17:08:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Воронов Андрей Александрович <a.voronov <at> fintech.ru>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 12 Nov 2025 17:08:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Воронов Андрей Александрович
 <a.voronov <at> fintech.ru>
To: bug-coreutils <bug-coreutils <at> gnu.org>
Subject: fmt not correctly process text with UTF-8 characters encoding
Date: Wed, 12 Nov 2025 15:14:54 +0000
[Message part 1 (text/plain, inline)]
Good evening,

When I run the fmt to make a text with default 75 columns width it properly convert only the Latin letters from ASCII.
Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
are shorter 2 times accordingly.

Use test case below:

Original text before (last 20 strings):
=================================================
$ tail -20  Kolisnichenko_D._Komandnaia_stroka_Linux_2.md
### Пакет coreutils

Программа expand полезна для преобразования табуляций в пробелы.
Например, программу с табуляциями в начале строк (опция `-i`) в файле `hellocool.c`
преобразует табуляции в несколько пробелов и запишет в файл `hc.c`:

    expand -i hellocool.c > hc.c


Печатный текст форматируется под страницу (72 символа в строке) утилитой `fmt`.



## Источники

* [CDRDAO](http://cdrdao.sourceforge.net/) ; Disk-At-Once Recording of Audio and Data CD-Rs/CD-RWs
* [BChunk](https://github.com/hessu/bchunk) ;
* [ccd2iso](https://sourceforge.net/projects/ccd2iso/) ;
===================================================

Same text after transferring these strings by fmt utility with default options:

===================================================
$ fmt  Kolisnichenko_D._Komandnaia_stroka_Linux_2.md
...
### Пакет coreutils

Программа expand полезна для
преобразования табуляций в пробелы.
Например, программу с табуляциями в
начале строк (опция `-i`) в файле `hellocool.c`
преобразует табуляции в несколько
пробелов и запишет в файл `hc.c`:

    expand -i hellocool.c > hc.c


Печатный текст форматируется под
страницу (72 символа в строке) утилитой
`fmt`.



## Источники

* [CDRDAO](http://cdrdao.sourceforge.net/) ; Disk-At-Once Recording of
Audio and Data CD-Rs/CD-RWs * [BChunk](https://github.com/hessu/bchunk)
; * [ccd2iso](https://sourceforge.net/projects/ccd2iso/) ;
===================================================

Sorry for my English.
God bless you.

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#79824; Package coreutils. (Wed, 12 Nov 2025 18:48:02 GMT) Full text and rfc822 format available.

Message #8 received at 79824 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Воронов Андрей Александрович <a.voronov <at> fintech.ru>, 79824 <at> debbugs.gnu.org
Subject: Re: bug#79824: fmt not correctly process text with UTF-8 characters
 encoding
Date: Wed, 12 Nov 2025 18:47:07 +0000
On 12/11/2025 15:14, Воронов Андрей Александрович wrote:
> Good evening,
> 
> When I run the fmt to make a text with default 75 columns width it properly convert only the Latin letters from ASCII.
> Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
> are shorter 2 times accordingly.

Yes this is a known issue which we're gradually getting to.

thanks,
Padraig





Information forwarded to bug-coreutils <at> gnu.org:
bug#79824; Package coreutils. (Wed, 12 Nov 2025 22:30:02 GMT) Full text and rfc822 format available.

Message #11 received at 79824 <at> debbugs.gnu.org (full text, mbox):

From: Collin Funk <collin.funk1 <at> gmail.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 79824 <at> debbugs.gnu.org,
 Воронов Андрей Александрович <a.voronov <at> fintech.ru>
Subject: Re: bug#79824: fmt not correctly process text with UTF-8 characters
 encoding
Date: Wed, 12 Nov 2025 14:29:10 -0800
Pádraig Brady <P <at> draigBrady.com> writes:

> On 12/11/2025 15:14, Воронов Андрей Александрович wrote:
>> Good evening,
>> When I run the fmt to make a text with default 75 columns width it
>> properly convert only the Latin letters from ASCII.
>> Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
>> are shorter 2 times accordingly.
>
> Yes this is a known issue which we're gradually getting to.

I can have a look at it using mbbuf_t in a similar way to 'fold'.

I think 'fmt' is similar, in that it does not matter much if it is a bit
slower. Handling unicode characters is more important, IMO.

Collin




Information forwarded to bug-coreutils <at> gnu.org:
bug#79824; Package coreutils. (Mon, 17 Nov 2025 02:11:02 GMT) Full text and rfc822 format available.

Message #14 received at 79824 <at> debbugs.gnu.org (full text, mbox):

From: Martin D Kealey <martin <at> kurahaupo.gen.nz>
To: Pádraig Brady <P <at> draigbrady.com>
Cc: 79824 <at> debbugs.gnu.org,
 Воронов Андрей Александрович
 <a.voronov <at> fintech.ru>
Subject: Re: bug#79824: fmt not correctly process text with UTF-8 characters
 encoding
Date: Mon, 17 Nov 2025 12:09:48 +1000
[Message part 1 (text/plain, inline)]
On Thu, 13 Nov 2025 at 04:47, Pádraig Brady <P <at> draigbrady.com> wrote:

> Yes this is a known issue which we're gradually getting to.
>

Dealing with *just* alphabetic scripts is relatively easy, but in general
the rules for flowing unicode text into paragraphs are considerably more
complicated than for plain ASCII.

What's the plan for handling double-width, zero-width, and combining
characters?
"Shy" hyphens?
Scripts that don't put spaces between words?
Non-breaking and non-joiner codepoints, additional line & paragraph
terminators, etc.

Combining characters follow rather than precede the principal character in
the data stream, so scanning would need to continue even after the line is
apparently "full" to ensure that they're included.

I guess this should be coordinated with bug#79631 (UTF-8 support in the
"cut" utility), at least in terms of documenting whether the count applies
to code-points, to composed characters, or to cells (0, 1 or 2 per composed
character); if counting cells, it should document that the number of cells
would be rounded down because double-width characters can't be split.

-Martin
[Message part 2 (text/html, inline)]

This bug report was last modified 21 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.