GNU bug report logs - #50391
28.0.50; json-read non-ascii data results in malformed string

Previous Next

Package: emacs;

Reported by: Zhiwei Chen <condy0919 <at> gmail.com>

Date: Sun, 5 Sep 2021 04:21:02 UTC

Severity: normal

Tags: notabug

Found in version 28.0.50

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 50391 in the body.
You can then email your comments to 50391 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#50391; Package emacs. (Sun, 05 Sep 2021 04:21:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Zhiwei Chen <condy0919 <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 05 Sep 2021 04:21:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Zhiwei Chen <condy0919 <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: 28.0.50; json-read non-ascii data results in malformed string
Date: Sun, 05 Sep 2021 12:19:56 +0800
When fetch json from youdao (a dict service in China).

#+begin_src elisp
(url-retrieve
  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
  (lambda (_status)
    (goto-char (1+ url-http-end-of-headers))
    (write-region (point) (point-max) "/tmp/acc1.json")))
#+end_src

Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without 

But If `json-read' then `json-insert', the file is malformed even if
uchardet shows the encoding of the file is utf-8.

#+begin_src elisp
(url-retrieve
  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
  (lambda (_status)
    (goto-char (1+ url-http-end-of-headers))
    (let ((j (json-read)))
    (with-temp-buffer
      (json-insert j)
      (write-region (point-min) (point-max) "/tmp/acc2.json")))))
#+end_src

#+begin_src shell
diff -u <(hexdump -C /tmp/acc1.json | head -n10) <(hexdump -C /tmp/acc2.json | head -n10) | diff-so-fancy
#+end_src

Screenshot: https://pb.nichi.co/jazz-estate-brave

Where diff shows the first word "累积" is encoded incorrectly in
"/tmp/acc2.json". (It uses `c3 a7 c2 b4 c2 af')

Actually,

#+begin_src shell
echo -n "累积" | hexdump -C
#+end_src

should be `e7 b4 af e7 a7 af' in utf-8 where "累" is represented with
`e7 b4 af' and "积" is represented with `e7 a7 af'

The environment variable LANG is `en_US.UTF-8', all tested in `emacs -Q'

-- 
Zhiwei Chen




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#50391; Package emacs. (Sun, 05 Sep 2021 07:32:02 GMT) Full text and rfc822 format available.

Message #8 received at 50391 <at> debbugs.gnu.org (full text, mbox):

From: Philipp <p.stephani2 <at> gmail.com>
To: Zhiwei Chen <condy0919 <at> gmail.com>
Cc: 50391 <at> debbugs.gnu.org
Subject: Re: bug#50391: 28.0.50; json-read non-ascii data results in malformed
 string
Date: Sun, 5 Sep 2021 09:31:40 +0200

> Am 05.09.2021 um 06:19 schrieb Zhiwei Chen <condy0919 <at> gmail.com>:
> 
> 
> When fetch json from youdao (a dict service in China).
> 
> #+begin_src elisp
> (url-retrieve
>  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>  (lambda (_status)
>    (goto-char (1+ url-http-end-of-headers))
>    (write-region (point) (point-max) "/tmp/acc1.json")))
> #+end_src
> 
> Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without 
> 
> But If `json-read' then `json-insert', the file is malformed even if
> uchardet shows the encoding of the file is utf-8.
> 
> #+begin_src elisp
> (url-retrieve
>  "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>  (lambda (_status)
>    (goto-char (1+ url-http-end-of-headers))
>    (let ((j (json-read)))
>    (with-temp-buffer
>      (json-insert j)
>      (write-region (point-min) (point-max) "/tmp/acc2.json")))))
> #+end_src

Does it work if you use the C JSON function (json-parse-buffer) for parsing?  At least for me the two files are then identical.



Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#50391; Package emacs. (Sun, 05 Sep 2021 08:09:01 GMT) Full text and rfc822 format available.

Message #11 received at 50391 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Zhiwei Chen <condy0919 <at> gmail.com>
Cc: 50391 <at> debbugs.gnu.org
Subject: Re: bug#50391: 28.0.50; json-read non-ascii data results in
 malformed string
Date: Sun, 05 Sep 2021 10:08:35 +0200
[Message part 1 (text/plain, inline)]
Zhiwei Chen <condy0919 <at> gmail.com> writes:

> When fetch json from youdao (a dict service in China).
>
> #+begin_src elisp
> (url-retrieve
>   "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>   (lambda (_status)
>     (goto-char (1+ url-http-end-of-headers))
>     (write-region (point) (point-max) "/tmp/acc1.json")))
> #+end_src
>
> Then C-x C-f "/tmp/acc1.json", the file is correctly encoded without 
>
> But If `json-read' then `json-insert', the file is malformed even if
> uchardet shows the encoding of the file is utf-8.

When you do the `write-region', Emacs writes the octets you received
from the web server to a file.  When Emacs loads that file in again, it
guesses that it's utf-8 and decodes it that way, so that's why that
works correctly.

> #+begin_src elisp
> (url-retrieve
>   "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
>   (lambda (_status)
>     (goto-char (1+ url-http-end-of-headers))
>     (let ((j (json-read)))
>     (with-temp-buffer
>       (json-insert j)
>       (write-region (point-min) (point-max) "/tmp/acc2.json")))))
> #+end_src

But here you're asking Emacs to use json-read on a buffer that's not
been decoded.  The http buffer at this points looks like this:

[Message part 2 (image/png, inline)]
[Message part 3 (text/plain, inline)]
You have to say (decode-coding-region (point) (point-max) 'utf-8) first
for that to work.  I.e.,

  (url-retrieve
   "https://dict.youdao.com/suggest?q=accumulate&le=eng&num=80&doctype=json"
   (lambda (_status)
     (goto-char (1+ url-http-end-of-headers))
     (let ((buf (current-buffer))
	   (end (1+ url-http-end-of-headers)))
       (with-temp-buffer
	 (insert-buffer-substring buf end)
	 (goto-char (point-min))
	 (let ((j (json-read)))
	   (erase-buffer)
	   (json-insert j)
	   (write-region (point-min) (point-max) "/tmp/acc2.json"))))))


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no

Added tag(s) notabug. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 05 Sep 2021 08:09:01 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 50391 <at> debbugs.gnu.org and Zhiwei Chen <condy0919 <at> gmail.com> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 05 Sep 2021 08:09:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 03 Oct 2021 11:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 218 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.