GNU bug report logs - #39506
mm-shr unibyte assumption

Previous Next

Packages: emacs, gnus;

Reported by: dick <dick.r.chiang <at> gmail.com>

Date: Sat, 8 Feb 2020 00:41:02 UTC

Severity: normal

Tags: notabug

Done: dick <dick.r.chiang <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 39506 in the body.
You can then email your comments to 39506 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#39506; Package emacs. (Sat, 08 Feb 2020 00:41:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to dick <dick.r.chiang <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sat, 08 Feb 2020 00:41:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: dick <dick.r.chiang <at> gmail.com>
To: bug-gnu-emacs <bug-gnu-emacs <at> gnu.org>
Subject: patch
Date: Fri, 07 Feb 2020 19:40:34 -0500
[0001-Question-assumption-that-mm-source-buffer-is-unibyte.patch (text/x-diff, inline)]
From 2674b6a08a90b9a97d3adf2b3e4497b61880e173 Mon Sep 17 00:00:00 2001
From: dickmao <none>
Date: Fri, 7 Feb 2020 19:33:13 -0500
Subject: [PATCH] Question assumption that mm source buffer is unibyte

In my Gnus experience, the source buffer that mm-shr acts upon can be
multibyte.

* lisp/gnus/mm-decode.el (mm-with-part): propagate multibyte/unibyte setting
from source buffer to target.
* test/lisp/gnus/mm-decode-tests.el (test-mm-decode-multibyte): add test
---
 lisp/gnus/mm-decode.el            |  8 ++-----
 test/lisp/gnus/mm-decode-tests.el | 39 +++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 6 deletions(-)
 create mode 100644 test/lisp/gnus/mm-decode-tests.el

diff --git a/lisp/gnus/mm-decode.el b/lisp/gnus/mm-decode.el
index d33bb56dc9..19a18b4f45 100644
--- a/lisp/gnus/mm-decode.el
+++ b/lisp/gnus/mm-decode.el
@@ -1255,16 +1255,12 @@ mm-handle-displayed-p
 
 (defmacro mm-with-part (handle &rest forms)
   "Run FORMS in the temp buffer containing the contents of HANDLE."
-  ;; The handle-buffer's content is a sequence of bytes, not a sequence of
-  ;; chars, so the buffer should be unibyte.  It may happen that the
-  ;; handle-buffer is multibyte for some reason, in which case now is a good
-  ;; time to adjust it, since we know at this point that it should
-  ;; be unibyte.
   `(let* ((handle ,handle))
      (when (and (mm-handle-buffer handle)
 		(buffer-name (mm-handle-buffer handle)))
        (with-temp-buffer
-	 (mm-disable-multibyte)
+         (set-buffer-multibyte (buffer-local-value 'enable-multibyte-characters
+                                                   (mm-handle-buffer handle)))
 	 (insert-buffer-substring (mm-handle-buffer handle))
 	 (mm-decode-content-transfer-encoding
 	  (mm-handle-encoding handle)
diff --git a/test/lisp/gnus/mm-decode-tests.el b/test/lisp/gnus/mm-decode-tests.el
new file mode 100644
index 0000000000..8a9d471b74
--- /dev/null
+++ b/test/lisp/gnus/mm-decode-tests.el
@@ -0,0 +1,39 @@
+;;; mm-decode-tests.el --- tests for gnus/mm-decode.el    -*- lexical-binding:t -*-
+
+;; Copyright (C) 2019-2020 Free Software Foundation, Inc.
+
+;; This file is part of GNU Emacs.
+
+;; GNU Emacs is free software: you can redistribute it and/or modify
+;; it under the terms of the GNU General Public License as published by
+;; the Free Software Foundation, either version 3 of the License, or
+;; (at your option) any later version.
+
+;; GNU Emacs is distributed in the hope that it will be useful,
+;; but WITHOUT ANY WARRANTY; without even the implied warranty of
+;; MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+;; GNU General Public License for more details.
+
+;; You should have received a copy of the GNU General Public License
+;; along with GNU Emacs.  If not, see <https://www.gnu.org/licenses/>.
+
+;;; Code:
+
+(require 'ert)
+(require 'mm-decode)
+
+(ert-deftest test-mm-decode-multibyte ()
+  (should
+   (or (not (fboundp 'libxml-parse-html-region))
+       (with-temp-buffer
+         (set-buffer-multibyte t)
+         (save-excursion
+           (insert "<p>最近也想尝试,但是感觉蛮难的,比如不知道如何在"))
+         (let ((handle (mm-make-handle
+                        (current-buffer)
+                        (rfc2231-parse-qp-string
+                         "Content-Type: text/html; charset=UTF-8"))))
+           (not (zerop (length (with-temp-buffer (mm-shr handle)
+                                                 (buffer-string))))))))))
+
+;;; mm-decode-tests.el ends here
-- 
2.24.1





Changed bug title to 'mm-shr unibyte assumption' from 'patch' Request was from dick <dick.r.chiang <at> gmail.com> to control <at> debbugs.gnu.org. (Sat, 08 Feb 2020 15:03:01 GMT) Full text and rfc822 format available.

bug reassigned from package 'emacs' to 'emacs,gnus'. Request was from dick <dick.r.chiang <at> gmail.com> to control <at> debbugs.gnu.org. (Sat, 08 Feb 2020 15:03:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org, bugs <at> gnus.org:
bug#39506; Package emacs,gnus. (Sat, 08 Feb 2020 15:11:02 GMT) Full text and rfc822 format available.

Message #12 received at 39506 <at> debbugs.gnu.org (full text, mbox):

From: dick.r.chiang <at> gmail.com
To: 39506 <at> debbugs.gnu.org, smonnier <at> iro.umontreal.ca
Subject: Re: bug#39506: patch
Date: Sat, 08 Feb 2020 10:10:38 -0500
In the changelog message of commit d4eb2b7, you claim to "set the buffer to unibyte
before inserting" but it's not clear where.  Then in a later comment, you say
"may happen that the handle-buffer is multibyte... in which case now is a good
time to adjust it, since we know ... it should be unibyte."  The revision before that
preserves the multibyte-p setting, so I do the same in this patch.




Information forwarded to bug-gnu-emacs <at> gnu.org, bugs <at> gnus.org:
bug#39506; Package emacs,gnus. (Sat, 08 Feb 2020 18:33:02 GMT) Full text and rfc822 format available.

Message #15 received at 39506 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: dick.r.chiang <at> gmail.com
Cc: 39506 <at> debbugs.gnu.org
Subject: Re: bug#39506: patch
Date: Sat, 08 Feb 2020 13:32:11 -0500
> In the changelog message of commit d4eb2b7, you claim to "set the
> buffer to unibyte before inserting" but it's not clear where.

The focus of the sentence is on "before": the previous code already set
the buffer to unibyte, but it did it afterwards.

> Then in a later comment, you say "may happen that the handle-buffer is
> multibyte... in which case now is a good time to adjust it, since we
> know ... it should be unibyte."  The revision before that preserves
> the multibyte-p setting, so I do the same in this patch.

As the motivation for your patch, you write:
> In my Gnus experience, the source buffer that mm-shr acts upon can be
> multibyte.

Two questions:
- How does `mm-with-part` relate to `mm-shr`?
- Before deciding whether unibyte or multibyte is the right choice, the
  main question is whether the buffer contains bytes or chars.
  AFAIK `mm-with-part` should only ever handle bytes (otherwise calling
  `mm-decode-content-transfer-encoding` doesn't make much sense).
  Your patch suggests you have a use case where this is not true.
  Can you give more details about this case?
  Maybe a backtrace showing how we got to this particular call to
  `mm-with-part`?


-- Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org, bugs <at> gnus.org:
bug#39506; Package emacs,gnus. (Sat, 08 Feb 2020 19:02:01 GMT) Full text and rfc822 format available.

Message #18 received at 39506 <at> debbugs.gnu.org (full text, mbox):

From: dick.r.chiang <at> gmail.com
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 39506 <at> debbugs.gnu.org
Subject: Re: bug#39506: patch
Date: Sat, 08 Feb 2020 14:01:44 -0500
> The focus of the sentence is on "before": the previous code already set the
> buffer to unibyte, but it did it afterwards.

Ah, so the buffer in question is merely the working temp-buffer, not the
handle-buffer.

Indeed, the previous code did it afterwards, and therein lies my salvation
because insert-buffer-substring of multibytes to a unibyte buffer corrupts.
I think the Miles Bader code replicated the multibyteness of the
handle-buffer to the temp-buffer, called insert-buffer-substring, and then
converted the temp-buffer to unibyte.

> - How does `mm-with-part` relate to `mm-shr`?

mm-shr calls mm-with-part.

> - Before deciding whether unibyte or multibyte is the right choice, the
>   main question is whether the buffer contains bytes or chars.

My buffer contained some Chinese multibytes.  You can see my unit test in the
patch.

>   AFAIK `mm-with-part` should only ever handle bytes (otherwise calling
>   `mm-decode-content-transfer-encoding` doesn't make much sense).

Okay, maybe mm-decode-content-transfer-encoding is a noop when it doesn't make
sense.

I'm not completely on top of all this, but my individual use case rather
prefers the old Miles Bader order of ops.  I can easily work around it if you feel
your 2008 change genuinely fixed something that was broken (my patch is largely
speculative and inconsiderate of broader ramifications).




Information forwarded to bug-gnu-emacs <at> gnu.org, bugs <at> gnus.org:
bug#39506; Package emacs,gnus. (Sat, 08 Feb 2020 19:52:02 GMT) Full text and rfc822 format available.

Message #21 received at 39506 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Monnier <monnier <at> iro.umontreal.ca>
To: dick.r.chiang <at> gmail.com
Cc: 39506 <at> debbugs.gnu.org
Subject: Re: bug#39506: patch
Date: Sat, 08 Feb 2020 14:51:02 -0500
>> - Before deciding whether unibyte or multibyte is the right choice, the
>>   main question is whether the buffer contains bytes or chars.
> My buffer contained some Chinese multibytes.

That suggests it contains characters rather than bytes.  How did
that happen?  Where does this buffer ('s contents) come from?

> You can see my unit test in the patch.

In your unit test, you artificially create a multibyte buffer with
chinese chars, so that doesn't answer my question ;-)

AFAIK `mm-with-part` is designed for MIME parts and MIME parts can only
contain bytes at that point.  Only after we extract them as bytes and
apply `mm-decode-content-transfer-encoding` to it can we consider
decoding those bytes into chars.

So I suspect that the source of your problem is earlier, where some code
incorrectly decodes some content too early.  Hence the need to better
understand where those chinese chars come from.


        Stefan





Information forwarded to bug-gnu-emacs <at> gnu.org, bugs <at> gnus.org:
bug#39506; Package emacs,gnus. (Thu, 20 Feb 2020 13:26:01 GMT) Full text and rfc822 format available.

Message #24 received at 39506 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Stefan Monnier <monnier <at> iro.umontreal.ca>
Cc: 39506 <at> debbugs.gnu.org, dick.r.chiang <at> gmail.com
Subject: Re: bug#39506: patch
Date: Thu, 20 Feb 2020 14:24:55 +0100
Stefan Monnier <monnier <at> iro.umontreal.ca> writes:

> So I suspect that the source of your problem is earlier, where some code
> incorrectly decodes some content too early.  Hence the need to better
> understand where those chinese chars come from.

Yes, I think that's correct -- the mm handle buffers should contain
bytes only, so it would be interesting to know how an mm handle buffer
ended up containing characters.

Dick, do you have a way to reproduce that bug?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) notabug. Request was from dick <dick.r.chiang <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 21 Feb 2020 14:33:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 39506 <at> debbugs.gnu.org and dick <dick.r.chiang <at> gmail.com> Request was from dick <dick.r.chiang <at> gmail.com> to control <at> debbugs.gnu.org. (Fri, 21 Feb 2020 14:33:01 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org, bugs <at> gnus.org:
bug#39506; Package emacs,gnus. (Fri, 21 Feb 2020 14:35:01 GMT) Full text and rfc822 format available.

Message #31 received at 39506 <at> debbugs.gnu.org (full text, mbox):

From: dick.r.chiang <at> gmail.com
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 39506 <at> debbugs.gnu.org, Stefan Monnier <monnier <at> iro.umontreal.ca>
Subject: Re: bug#39506: patch
Date: Fri, 21 Feb 2020 09:34:48 -0500
[Message part 1 (text/html, inline)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 21 Mar 2020 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 36 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.