GNU bug report logs - #7017
Suggestion: (url-retrieve-internal) hexify multibyte URL string first

Package: emacs;

Reported by: William Xu <william.xwl <at> gmail.com>

Date: Sun, 12 Sep 2010 01:03:02 UTC

Severity: normal

Tags: fixed, patch

Done: Chong Yidong <cyd <at> gnu.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 7017 in the body.
You can then email your comments to 7017 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox

Message #1 received at quiet <at> debbugs.gnu.org (full text, mbox):

From: William Xu <william.xwl <at> gmail.com>
To: quiet <at> debbugs.gnu.org
Subject: Suggestion: (url-retrieve-internal) hexify multibyte URL string first 
Date: Thu, 01 Jul 2010 02:31:05 +0800

Package: emacs

[ resent from emacs-devel ]

Currently, url-retrieve call doesn't take care of multibyte string URL
at all.  Hence, the following example would simply fail: 

  ;; url containing some Chinese characters here
  (url-retrieve	
   "http://a1.twimg.com/profile_images/65068764/我的头像_normal.png"
   (lambda (&rest args) (switch-to-buffer (current-buffer))))

Feeding the same url to `wget', it would first hexify it, then download
it successfully.  I suggest we do the same in url-retrieve, like this: 

(url-retrieve-internal): Hexify multibye URL string first when necessary.

diff --git a/lisp/url/url.el b/lisp/url/url.el
index 6f7b810..15445ef 100644
--- a/lisp/url/url.el
+++ b/lisp/url/url.el
@@ -164,6 +164,9 @@ the list of events, as described in the docstring of `url-retrieve'."
   (url-gc-dead-buffers)
   (if (stringp url)
        (set-text-properties 0 (length url) nil url))
+  (when (multibyte-string-p url)
+    (let ((url-unreserved-chars (append '(?: ?/) url-unreserved-chars)))
+      (setq url (url-hexify-string url))))
   (if (not (vectorp url))
       (setq url (url-generic-parse-url url)))
   (if (not (functionp callback))

-- 
William

http://xwl.appspot.com

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#7017; Package emacs. (Tue, 10 Apr 2012 11:24:01 GMT) Full text and rfc822 format available.

Message #4 received at 7017 <at> debbugs.gnu.org (full text, mbox):

From: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
To: William Xu <william.xwl <at> gmail.com>
Cc: 7017 <at> debbugs.gnu.org
Subject: Re: Suggestion: (url-retrieve-internal) hexify multibyte URL string
	first
Date: Tue, 10 Apr 2012 13:22:34 +0200

William Xu <william.xwl <at> gmail.com> writes:

> Feeding the same url to `wget', it would first hexify it, then download
> it successfully.  I suggest we do the same in url-retrieve, like this: 
>
> (url-retrieve-internal): Hexify multibye URL string first when necessary.

Thanks; applied to the Emacs trunk.

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/

Added tag(s) fixed. Request was from Lars Magne Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Tue, 10 Apr 2012 11:24:02 GMT) Full text and rfc822 format available.

bug marked as fixed in version 24.2, send any further explanations to 7017 <at> debbugs.gnu.org and William Xu <william.xwl <at> gmail.com> Request was from Lars Magne Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Tue, 10 Apr 2012 11:24:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#7017; Package emacs. (Mon, 07 May 2012 21:55:02 GMT) Full text and rfc822 format available.

Message #11 received at 7017 <at> debbugs.gnu.org (full text, mbox):

From: Seth Mason <seth <at> edgecast.com>
To: 7017 <at> debbugs.gnu.org
Subject: url-retrieve seems busted
Date: Mon, 07 May 2012 14:51:29 -0700

If you put the following in a buffer and eval it, you'll get a 404:

    ;; http://httpbin.org/get?x=1
    ;; eval this buffer
    (url-retrieve (buffer-substring-no-properties 4 30) (lambda (&rest args) (switch-to-buffer (current-buffer))))

If you curl/wget the same URL, it'll work fine.

If you look at the request, it's going to "/get%3fx%3d1". It seems to me
that the URL is getting improperly encoded for multibyte strings.

bug No longer marked as fixed in versions 24.2 and reopened. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 08 May 2012 02:16:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#7017; Package emacs. (Tue, 08 May 2012 04:55:01 GMT) Full text and rfc822 format available.

Message #16 received at 7017 <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: Seth Mason <seth <at> edgecast.com>
Cc: 7017 <at> debbugs.gnu.org
Subject: Re: bug#7017: url-retrieve seems busted
Date: Tue, 08 May 2012 12:52:01 +0800

Seth Mason <seth <at> edgecast.com> writes:

> If you put the following in a buffer and eval it, you'll get a 404:
>
>     ;; http://httpbin.org/get?x=1
>     ;; eval this buffer
>     (url-retrieve (buffer-substring-no-properties 4 30) (lambda (&rest
> args) (switch-to-buffer (current-buffer))))
>
> If you curl/wget the same URL, it'll work fine.
>
> If you look at the request, it's going to "/get%3fx%3d1". It seems to me
> that the URL is getting improperly encoded for multibyte strings.

Thanks for pointing this out.

Applying url-hexify-string on the entire URL, as the previous patch did,
is wrong.  We musn't hexify reserved characters that are being used in
their special role.  Unfortunately, figuring out when those characters
are being used in their special role requires an implementation of
RFC2396, which I don't think we currently have in Emacs.

Or, the following not-strictly-correct hack leaves out reserved
characters from hexification.


=== modified file 'lisp/url/url.el'
*** lisp/url/url.el	2012-04-26 12:43:28 +0000
--- lisp/url/url.el	2012-05-08 04:46:45 +0000
***************
*** 180,188 ****
    (url-gc-dead-buffers)
    (if (stringp url)
         (set-text-properties 0 (length url) nil url))
    (when (multibyte-string-p url)
!     (let ((url-unreserved-chars (append '(?: ?/) url-unreserved-chars)))
        (setq url (url-hexify-string url))))
    (if (not (vectorp url))
        (setq url (url-generic-parse-url url)))
    (if (not (functionp callback))
--- 180,193 ----
    (url-gc-dead-buffers)
    (if (stringp url)
         (set-text-properties 0 (length url) nil url))
+ 
    (when (multibyte-string-p url)
!     (let* ((reserved-chars '(?! ?# ?$ ?& ?' ?( ?) ?* ?+ ?, ?/ ?: ?\;
! 			     ?= ?? ?@ ?[ ?]))
! 	   (url-unreserved-chars (append reserved-chars
! 					 url-unreserved-chars)))
        (setq url (url-hexify-string url))))
+ 
    (if (not (vectorp url))
        (setq url (url-generic-parse-url url)))
    (if (not (functionp callback))

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#7017; Package emacs. (Tue, 08 May 2012 05:28:02 GMT) Full text and rfc822 format available.

Message #19 received at 7017 <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: Seth Mason <seth <at> edgecast.com>
Cc: 7017 <at> debbugs.gnu.org
Subject: Re: bug#7017: url-retrieve seems busted
Date: Tue, 08 May 2012 13:25:10 +0800

Chong Yidong <cyd <at> gnu.org> writes:

> Applying url-hexify-string on the entire URL, as the previous patch did,
> is wrong.  We musn't hexify reserved characters that are being used in
> their special role.  Unfortunately, figuring out when those characters
> are being used in their special role requires an implementation of
> RFC2396, which I don't think we currently have in Emacs.

Actually, I think we could use url-generic-parse-url for this.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#7017; Package emacs. (Wed, 09 May 2012 08:38:01 GMT) Full text and rfc822 format available.

Message #22 received at 7017 <at> debbugs.gnu.org (full text, mbox):

From: Chong Yidong <cyd <at> gnu.org>
To: Seth Mason <seth <at> edgecast.com>
Cc: 7017 <at> debbugs.gnu.org
Subject: Re: bug#7017: url-retrieve seems busted
Date: Wed, 09 May 2012 16:34:53 +0800

Chong Yidong <cyd <at> gnu.org> writes:

> Chong Yidong <cyd <at> gnu.org> writes:
>
>> Applying url-hexify-string on the entire URL, as the previous patch did,
>> is wrong.  We musn't hexify reserved characters that are being used in
>> their special role.  Unfortunately, figuring out when those characters
>> are being used in their special role requires an implementation of
>> RFC2396, which I don't think we currently have in Emacs.
>
> Actually, I think we could use url-generic-parse-url for this.

Fixed in trunk (revision 108172).

bug closed, send any further explanations to 7017 <at> debbugs.gnu.org and William Xu <william.xwl <at> gmail.com> Request was from Chong Yidong <cyd <at> gnu.org> to control <at> debbugs.gnu.org. (Wed, 09 May 2012 08:38:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 06 Jun 2012 11:24:02 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 29 days ago.

Previous Next

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #7017 Suggestion: (url-retrieve-internal) hexify multibyte URL string first

GNU bug report logs - #7017
Suggestion: (url-retrieve-internal) hexify multibyte URL string first