GNU bug report logs - #77410
term.el sometimes prints undecoded multibyte UTF-8 chars

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: emacs; Reported by: Stephane Zermatten <szermatt@HIDDEN>; Keywords: patch; dated Mon, 31 Mar 2025 17:46:02 UTC; Maintainer for emacs is bug-gnu-emacs@HIDDEN.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 31 Mar 2025 17:45:17 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Mar 31 13:45:17 2025
Received: from localhost ([127.0.0.1]:42745 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1tzJCG-0004jt-6F
	for submit <at> debbugs.gnu.org; Mon, 31 Mar 2025 13:45:17 -0400
Received: from lists.gnu.org ([2001:470:142::17]:51120)
 by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.84_2) (envelope-from <szermatt@HIDDEN>)
 id 1tzFyW-0002Od-Po
 for submit <at> debbugs.gnu.org; Mon, 31 Mar 2025 10:18:54 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <szermatt@HIDDEN>)
 id 1tzFyP-0006KW-0m
 for bug-gnu-emacs@HIDDEN; Mon, 31 Mar 2025 10:18:46 -0400
Received: from mail-wm1-x334.google.com ([2a00:1450:4864:20::334])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <szermatt@HIDDEN>)
 id 1tzFyM-0005lB-IO
 for bug-gnu-emacs@HIDDEN; Mon, 31 Mar 2025 10:18:44 -0400
Received: by mail-wm1-x334.google.com with SMTP id
 5b1f17b1804b1-43ea40a6e98so4136245e9.1
 for <bug-gnu-emacs@HIDDEN>; Mon, 31 Mar 2025 07:18:41 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1743430720; x=1744035520; darn=gnu.org;
 h=mime-version:message-id:date:cc:subject:to:from:sender:from:to:cc
 :subject:date:message-id:reply-to;
 bh=LJCwQxRVD3ss4lriMEK40lWXBYNmRdnxnDLfdF/0gzs=;
 b=OGI1q+13vnESkhu7uh+gcsGnZwNHppQ8pllEGmkZMJL15T1uI/3jmSBiOWBOatvNMY
 EWK++U/sUxDIgsIPkxYRig3JGP9UDYNiYrFJ6g3ne7zxcUujyhzFZz6kVRYNKwnzJx2Q
 mZcPJSFG25FGTPaqArUsc1QevUQdZ1CEdRdWLo8x/s8apC7capl0BPUEop5Am+/Kmany
 1gLXiBg6n7guCmMtne8jlRKye+n1NzKAfKlX9i2LOJqygVpEwxODFK2DuOYpDlQ3BVcR
 O3y7heVu+VJcm8pRcfyKoTXfLVe5mwt81EcyVma8lUEee56M0vZG703Cy9hW6ToH2lLV
 /qAQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1743430720; x=1744035520;
 h=mime-version:message-id:date:cc:subject:to:from:sender
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=LJCwQxRVD3ss4lriMEK40lWXBYNmRdnxnDLfdF/0gzs=;
 b=GfAUbtxsLgaQUqO+qVwDQGs4WSPERMPVSEL4FHizH+u+Ykl5jHICWlYMIgmc2ph7DW
 Uh5SeZBaE2+t9HuLFIOV142o/xQ0hDU2AuF2pEMQHMT5xETw1hPU5PPwFS/FzH4Ktu8n
 BdCbR3wrYj91zOjaWxcpkyXmQjAHQ+9OptjB5TxzThGxQ5EynZBQZmX/gH0tvK+g+/Gf
 kYbKmWEYA03emEuUoTPkJsm9A+cdxfaBh8+WSUpzu/XTiOXi5qQxuqAMvKQ36Hi5b4UR
 VtcbmyAQhuGyuLqJKH2aPxoO1LqTuaDD3h/qYa5MDbWrD/6A7/gfPYd2lPffZp/4gFb5
 Swxw==
X-Gm-Message-State: AOJu0Yxs/CKzc1bw5strkcVfNvko6oKLtlHvQcZd52PM0ljyr+Fhi6EU
 66vEL7heGZUym5GJas5E3p75kBiYdhWmxyVDibtE4R0hVQKIBmCkqWqhoiqY
X-Gm-Gg: ASbGncuvMjmKnAVikp+vV2Wf7TQd7Zg0ymOORH+aE79me5GAURmf4uF1Piqo+KRmoyX
 0keNGjNPjsTlLpQZoM0ZS9zIV3ftGOJPJM+9Bo/7Qb+6glakbHmpUYpTerXrKcIT1djZGqXtuha
 4e1XeQCKEDK7Y5ZU+aXCaKzNXJYNMKQ9HB7FYvpaM3Gl789uP/0DpUdkPrqqaRzj3k8vSW5HsFd
 +T42rRIPjG2yviBBVlEnDBidKbgIQU6gNXqp3bL04dpCUEJOL1rYLi5IoprtFneaYpZmCZEVmTJ
 nuQFpe+xNckJQguatIlpT/rgLapq7Ng0purY39MAqsFDUKN9V+p+GJKhCFFk
X-Google-Smtp-Source: AGHT+IH2r0vurB18Xl9WLYVT+F/bcFedgdUzE7wlG+eEpi/OHIrJt3jq+M14kNzhS+9L51QgoppgkQ==
X-Received: by 2002:a05:600c:699b:b0:43c:e305:6d50 with SMTP id
 5b1f17b1804b1-43db62c034bmr86446655e9.24.1743430719888; 
 Mon, 31 Mar 2025 07:18:39 -0700 (PDT)
Received: from boomer.zia ([62.74.15.163]) by smtp.gmail.com with ESMTPSA id
 ffacd0b85a97d-39c0b79e082sm11610488f8f.69.2025.03.31.07.18.38
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 31 Mar 2025 07:18:39 -0700 (PDT)
From: Stephane Zermatten <szermatt@HIDDEN>
To: bug-gnu-emacs@HIDDEN
Subject: term.el sometimes prints undecoded multibyte UTF-8 chars
Date: Mon, 31 Mar 2025 17:18:35 +0300
Message-ID: <m2iknpthac.fsf@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="=-=-="
Received-SPF: pass client-ip=2a00:1450:4864:20::334;
 envelope-from=szermatt@HIDDEN; helo=mail-wm1-x334.google.com
X-Spam_score_int: -19
X-Spam_score: -2.0
X-Spam_bar: --
X-Spam_report: (-2.0 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FORGED_FROMDOMAIN=0.001,
 FREEMAIL_FROM=0.001, HEADER_FROM_DIFFERENT_DOMAINS=0.001,
 RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001,
 SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: 1.0 (+)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Mon, 31 Mar 2025 13:45:14 -0400
Cc: szermatt@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

--=-=-=
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Tags: patch

If I run a shell in a terminal with M-x term, with a very unicode-heavy
prompt (fish 3.6 + tide), sometimes the Unicode characters are printed
undecoded.

One possible cause of this might be unfortunate chunking in the middle
of a character, which the attached patch fixes.

Without the patch, if I type this in M-x term /usr/bin/bash

for j in $(seq 0 3); do
  for i in $(seq 0 30); do
    printf '\xf0\x9f'; sleep 0.1; printf '\x98\x80';
  done;
  echo;
done

I get
 \360\237\203\022\360\...

Instead of:
 =F0=9F=98=80=F0=9F=98=80=F0=9F=98=80=F0=9F=98=80=F0=9F=98=80=F0=9F=98=80=
=F0=9F=98=80...

With the patch included, I get the correct output.

The issue comes from an incorrect check (> count partial 0), which
should really be (and (>=3D count partial) (> partial 0)), but I
simplified that to (> partial 0) in the patch, because the while loop
guarantees (>=3D count partial).

I rewrote the existing test to cover this case, and try out multiple
different combination of chunks.

I'm still looking into other causes of the issue, but this, at least,
seems like an easy fix.

In GNU Emacs 30.1 (build 2, x86_64-apple-darwin23.6.0, NS appkit-2487.70
 Version 14.7.4 (Build 23H420)) of 2025-03-24 built on boomer.zia
Windowing system distributor 'Apple', version 10.3.2487
System Description:  macOS 14.7.4

Configured using:
 'configure --disable-dependency-tracking --disable-silent-rules
 --enable-locallisppath=3D/usr/local/share/emacs/site-lisp
 --infodir=3D/usr/local/Cellar/emacs-plus@30/30.1/share/info/emacs
 --prefix=3D/usr/local/Cellar/emacs-plus@30/30.1
 --with-native-compilation=3Daot --with-xml2 --with-gnutls
 --without-compress-install --without-dbus --without-imagemagick
 --with-modules --with-rsvg --with-webp --with-ns
 --disable-ns-self-contained 'CFLAGS=3D-O2 -DFD_SETSIZE=3D10000
 -DDARWIN_UNLIMITED_SELECT -I/usr/local/opt/sqlite/include
 -I/usr/local/opt/gcc/include -I/usr/local/opt/libgccjit/include'
 'LDFLAGS=3D-L/usr/local/opt/sqlite/lib -L/usr/local/lib/gcc/14
 -I/usr/local/opt/gcc/include -I/usr/local/opt/libgccjit/include''


--=-=-=
Content-Type: text/patch; charset=utf-8
Content-Disposition: attachment;
 filename=0001-Fix-issue-with-very-short-multibyte-character-chunk.patch
Content-Transfer-Encoding: quoted-printable

From 2bb6cec8f4f72009bcde1edab367f90ab82e5e2a Mon Sep 17 00:00:00 2001
From: Stephane Zermatten <szermatt@HIDDEN>
Date: Mon, 31 Mar 2025 16:41:08 +0300
Subject: [PATCH] Fix issue with very short multibyte character chunk.

Before this change, a chunk containing only a part
of a multibyte character would be discarded and
displayed undecoded on the terminal.

* lisp/term.el
---
 lisp/term.el            |  2 +-
 test/lisp/term-tests.el | 15 ++++++++-------
 2 files changed, 9 insertions(+), 8 deletions(-)

diff --git a/lisp/term.el b/lisp/term.el
index 862103d88e6..a971300c055 100644
--- a/lisp/term.el
+++ b/lisp/term.el
@@ -3116,7 +3116,7 @@ term-emulate-terminal
                                                           (- count 1 parti=
al)))
                                       'eight-bit))
                         (incf partial))
-                      (when (> count partial 0)
+                      (when (> partial 0)
                         (setq term-terminal-undecoded-bytes
                               (substring decoded-substring (- partial)))
                         (setq decoded-substring
diff --git a/test/lisp/term-tests.el b/test/lisp/term-tests.el
index 5ef8c1174df..aad84e171b2 100644
--- a/test/lisp/term-tests.el
+++ b/test/lisp/term-tests.el
@@ -402,13 +402,14 @@ term-to-margin
 (ert-deftest term-decode-partial () ;; Bug#25288.
   "Test multibyte characters sent into multiple chunks."
   ;; Set `locale-coding-system' so test will be deterministic.
-  (let* ((locale-coding-system 'utf-8-unix)
-         (string (make-string 7 ?=D1=88))
-         (bytes (encode-coding-string string locale-coding-system)))
-    (should (equal string
-                   (term-test-screen-from-input
-                    40 1 `(,(substring bytes 0 (/ (length bytes) 2))
-                           ,(substring bytes (/ (length bytes) 2))))))))
+  (let ((locale-coding-system 'utf-8-unix))
+    (should (equal "=D1=88=D1=88=D1=88" (term-test-screen-from-input
+                          40 1 '("\321" "\210\321\210\321\210"))))
+    (should (equal "=D1=88=D1=88=D1=88" (term-test-screen-from-input
+                          40 1 '("\321\210\321" "\210\321\210"))))
+    (should (equal "=D1=88=D1=88=D1=88" (term-test-screen-from-input
+                          40 1 '("\321\210\321\210\321" "\210"))))))
+
 (ert-deftest term-undecodable-input () ;; Bug#29918.
   "Undecodable bytes should be passed through without error."
   (let* ((locale-coding-system 'utf-8-unix) ; As above.
--=20
2.47.0


--=-=-=--




Acknowledgement sent to Stephane Zermatten <szermatt@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs@HIDDEN. Full text available.
Report forwarded to bug-gnu-emacs@HIDDEN:
bug#77410; Package emacs. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 31 Mar 2025 18:00:01 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.