GNU bug report logs - #79824
fmt not correctly process text with UTF-8 characters encoding

Previous Next

Package: coreutils;

Reported by: Воронов Андрей Александрович <a.voronov <at> fintech.ru>

Date: Wed, 12 Nov 2025 17:08:02 UTC

Severity: normal

To reply to this bug, email your comments to 79824 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#79824; Package coreutils. (Wed, 12 Nov 2025 17:08:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Воронов Андрей Александрович <a.voronov <at> fintech.ru>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Wed, 12 Nov 2025 17:08:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Воронов Андрей Александрович
 <a.voronov <at> fintech.ru>
To: bug-coreutils <bug-coreutils <at> gnu.org>
Subject: fmt not correctly process text with UTF-8 characters encoding
Date: Wed, 12 Nov 2025 15:14:54 +0000
[Message part 1 (text/plain, inline)]
Good evening,

When I run the fmt to make a text with default 75 columns width it properly convert only the Latin letters from ASCII.
Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
are shorter 2 times accordingly.

Use test case below:

Original text before (last 20 strings):
=================================================
$ tail -20  Kolisnichenko_D._Komandnaia_stroka_Linux_2.md
### Пакет coreutils

Программа expand полезна для преобразования табуляций в пробелы.
Например, программу с табуляциями в начале строк (опция `-i`) в файле `hellocool.c`
преобразует табуляции в несколько пробелов и запишет в файл `hc.c`:

    expand -i hellocool.c > hc.c


Печатный текст форматируется под страницу (72 символа в строке) утилитой `fmt`.



## Источники

* [CDRDAO](http://cdrdao.sourceforge.net/) ; Disk-At-Once Recording of Audio and Data CD-Rs/CD-RWs
* [BChunk](https://github.com/hessu/bchunk) ;
* [ccd2iso](https://sourceforge.net/projects/ccd2iso/) ;
===================================================

Same text after transferring these strings by fmt utility with default options:

===================================================
$ fmt  Kolisnichenko_D._Komandnaia_stroka_Linux_2.md
...
### Пакет coreutils

Программа expand полезна для
преобразования табуляций в пробелы.
Например, программу с табуляциями в
начале строк (опция `-i`) в файле `hellocool.c`
преобразует табуляции в несколько
пробелов и запишет в файл `hc.c`:

    expand -i hellocool.c > hc.c


Печатный текст форматируется под
страницу (72 символа в строке) утилитой
`fmt`.



## Источники

* [CDRDAO](http://cdrdao.sourceforge.net/) ; Disk-At-Once Recording of
Audio and Data CD-Rs/CD-RWs * [BChunk](https://github.com/hessu/bchunk)
; * [ccd2iso](https://sourceforge.net/projects/ccd2iso/) ;
===================================================

Sorry for my English.
God bless you.

[Message part 2 (text/html, inline)]

Information forwarded to bug-coreutils <at> gnu.org:
bug#79824; Package coreutils. (Wed, 12 Nov 2025 18:48:02 GMT) Full text and rfc822 format available.

Message #8 received at 79824 <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Воронов Андрей Александрович <a.voronov <at> fintech.ru>, 79824 <at> debbugs.gnu.org
Subject: Re: bug#79824: fmt not correctly process text with UTF-8 characters
 encoding
Date: Wed, 12 Nov 2025 18:47:07 +0000
On 12/11/2025 15:14, Воронов Андрей Александрович wrote:
> Good evening,
> 
> When I run the fmt to make a text with default 75 columns width it properly convert only the Latin letters from ASCII.
> Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
> are shorter 2 times accordingly.

Yes this is a known issue which we're gradually getting to.

thanks,
Padraig





Information forwarded to bug-coreutils <at> gnu.org:
bug#79824; Package coreutils. (Wed, 12 Nov 2025 22:30:02 GMT) Full text and rfc822 format available.

Message #11 received at 79824 <at> debbugs.gnu.org (full text, mbox):

From: Collin Funk <collin.funk1 <at> gmail.com>
To: Pádraig Brady <P <at> draigBrady.com>
Cc: 79824 <at> debbugs.gnu.org,
 Воронов Андрей Александрович <a.voronov <at> fintech.ru>
Subject: Re: bug#79824: fmt not correctly process text with UTF-8 characters
 encoding
Date: Wed, 12 Nov 2025 14:29:10 -0800
Pádraig Brady <P <at> draigBrady.com> writes:

> On 12/11/2025 15:14, Воронов Андрей Александрович wrote:
>> Good evening,
>> When I run the fmt to make a text with default 75 columns width it
>> properly convert only the Latin letters from ASCII.
>> Russian & possible other not English/Latin (Greek, Cyrillic) characters which stored in two bytes in UTF-8 encoding
>> are shorter 2 times accordingly.
>
> Yes this is a known issue which we're gradually getting to.

I can have a look at it using mbbuf_t in a similar way to 'fold'.

I think 'fmt' is similar, in that it does not matter much if it is a bit
slower. Handling unicode characters is more important, IMO.

Collin




This bug report was last modified 1 day ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.