GNU bug report logs - #6252
Emacs does not implement URL (aka "percent") decoding correctly.

Previous Next

Package: emacs;

Reported by: José A. Romero L. <escherdragon <at> gmail.com>

Date: Sun, 23 May 2010 00:52:02 UTC

Severity: normal

Tags: fixed

Fixed in version 24.2

Done: Lars Magne Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6252 in the body.
You can then email your comments to 6252 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Sun, 23 May 2010 00:52:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to José A. Romero L. <escherdragon <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Sun, 23 May 2010 00:52:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: José A. Romero L. <escherdragon <at> gmail.com>
To: bug-gnu-emacs <at> gnu.org
Subject: Emacs does not implement URL (aka "percent") decoding correctly.
Date: Sun, 23 May 2010 01:46:54 +0200
On May 18, 20:14, Xah Lee <xah...@gmail.com>  wrote:

> is there emacs lisp function that decode the url percent encoding?
> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
> should become
> http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
> that's a EN DASH (unicode 8211, #o20023, #x2013).
> I know there's a
>   (require 'gnus-util)
>  gnus-url-unhex-string
> but that just unhex, and generate gibberish if the url contain unicode
> chars.
(...)

Seems that RFC 3986 has not been implemented correctly in Emacs. IMHO
that is an important hole you have found there. The standard requires
that all unreserved characters be encoded/decoded as UTF8 bytes. Even
though the encoding part looks OK (in url-util.el), the decoding does
not go that last mile to interpret the decoded bytes as UTF-8.

Until a proper implementation is  done, I guess you could work around
the problem with something like this:

    (decode-coding-string
     (apply 'unibyte-string
            (string-to-list
             (url-unhex-string "http://en.wikipedia.org/wiki/Sylvester
%E2%80%93Gallai_theorem")))
     'utf-8)

(yes, it's ugly as hell but hey, it's free ;])

I've just sent this very message as a bug report to the Emacs team.

Cheers,
-- 
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Mon, 24 May 2010 03:34:02 GMT) Full text and rfc822 format available.

Message #8 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: YAMAMOTO Mitsuharu <mituharu <at> math.s.chiba-u.ac.jp>
To: José A. Romero L. <escherdragon <at> gmail.com>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent")
	decoding	correctly.
Date: Mon, 24 May 2010 12:33:46 +0900
>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escherdragon <at> gmail.com> said:

> Seems that RFC 3986 has not been implemented correctly in
> Emacs. IMHO that is an important hole you have found there. The
> standard requires that all unreserved characters be encoded/decoded
> as UTF8 bytes.

If you are referring to the following part of RFC 3986, it doesn't say
anything about existing URI schemes (as opposed to "a new URI
scheme"), those defining a component that does NOT represent textual
data, or even for textual data, those NOT consisting of characters
from the Universal Character Sets.

  When a new URI scheme defines a component that represents textual
  data consisting of characters from the Universal Character Set
  [UCS], the data should first be encoded as octets according to the
  UTF-8 character encoding [STD63]; then only those octets that do not
  correspond to characters in the unreserved set should be percent-
  encoded.

(See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)

Though returning a multibyte string decoded as UTF-8 would be useful
for many cases, I think some "unhex"ing function should also provide a
functionality to return a unibyte string.

				     YAMAMOTO Mitsuharu
				mituharu <at> math.s.chiba-u.ac.jp




Information forwarded to owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Tue, 25 May 2010 12:34:02 GMT) Full text and rfc822 format available.

Message #11 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: José A. Romero L. <escherdragon <at> gmail.com>
To: 6252 <at> debbugs.gnu.org
Subject: Fwd: bug#6252: Emacs does not implement URL (aka "percent") decoding 
	correctly.
Date: Tue, 25 May 2010 10:56:36 +0200
(sorry, forgot to fwd this to the bugtrack)
---------- Forwarded message ----------
From: José A. Romero L. <escherdragon <at> gmail.com>
Date: 2010/5/24
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent")
decoding correctly.
To: YAMAMOTO Mitsuharu <mituharu <at> math.s.chiba-u.ac.jp>


2010/5/24 YAMAMOTO Mitsuharu <mituharu <at> math.s.chiba-u.ac.jp>:
>>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escherdragon <at> gmail.com> said:
(...)
> If you are referring to the following part of RFC 3986, it doesn't say
> anything about existing URI schemes (as opposed to "a new URI
> scheme"), those defining a component that does NOT represent textual
> data, or even for textual data, those NOT consisting of characters
> from the Universal Character Sets.

You are right. The standard *doesn't say anything* about existing URI
schemes on that matter. Thus  the question would be rather whether to
make the language more or less useful, especially on the light of the
fragment you've just quoted:

     >  When a new URI scheme defines a component that represents textual
     >  data consisting of characters from the Universal Character Set
     >  [UCS], the data should first be encoded as octets according to the
     >  UTF-8 character encoding [STD63]; then only those octets that do not
     >  correspond to characters in the unreserved set should be percent-
     >  encoded.

and the example that immediately follows:

   (...) For example, the character A would be represented as "A",
   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
   as "%C3%80", and the character KATAKANA LETTER A would be represented
   as "%E3%82%A2".

>
> (See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)
>
> Though returning a multibyte string decoded as UTF-8 would be useful
> for many cases, I think some "unhex"ing function should also provide a
> functionality to return a unibyte string.
(...)

That's perfectly valid. OTOH some other "unhex"-ing function (or even
the same) could also provide the functionality to return a multi-byte
string, and even allow to  choose the character encoding (UCS or not)
for the resulting string. After  all, don't you think there should be
a better way to decode a Katakana A than using a kludge like this?:

 (decode-coding-string
    (apply 'unibyte-string
           (string-to-list
            (url-unhex-string "%E3%82%A2")))
    'utf-8)

Cheers,
--
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Wed, 21 Sep 2011 20:37:01 GMT) Full text and rfc822 format available.

Message #14 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
To: José A. Romero L. <escherdragon <at> gmail.com>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: Emacs does not implement URL (aka "percent") decoding correctly.
Date: Wed, 21 Sep 2011 22:17:52 +0200
José A. Romero L. <escherdragon <at> gmail.com> writes:

> On May 18, 20:14, Xah Lee <xah...@gmail.com>  wrote:
>
>> is there emacs lisp function that decode the url percent encoding?
>> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
>> should become
>> http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
>> that's a EN DASH (unicode 8211, #o20023, #x2013).
>> I know there's a
>>   (require 'gnus-util)
>>  gnus-url-unhex-string
>> but that just unhex, and generate gibberish if the url contain unicode
>> chars.
> (...)
>
> Seems that RFC 3986 has not been implemented correctly in Emacs. IMHO
> that is an important hole you have found there. The standard requires
> that all unreserved characters be encoded/decoded as UTF8 bytes. Even
> though the encoding part looks OK (in url-util.el), the decoding does
> not go that last mile to interpret the decoded bytes as UTF-8.

I'm not quite sure I understand what the problem is.  Do you have a test
case that illustrates what url.el does wrong?

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Thu, 22 Sep 2011 07:44:03 GMT) Full text and rfc822 format available.

Message #17 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
To: José A. Romero L. <escherdragon <at> gmail.com>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: Emacs does not implement URL (aka "percent") decoding correctly.
Date: Thu, 22 Sep 2011 09:38:31 +0200
José A. Romero L. <escherdragon <at> gmail.com> writes:

> in short, there seems to be currently no way to perform the opposite
> of url-hexify-string for UTF-8 encoded strings:
>
>     (url-unhex-string (url-hexify-string "ä"))
>     => "ä"

`url-unhex-string' can't know what encoding the %xx-encoding is in, can
it?  The local part of an URL can use a different encoding, I think.

But is that the test case for the bug?  I thought somebody had problems
retrieving something...

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Fri, 23 Sep 2011 08:39:04 GMT) Full text and rfc822 format available.

Message #20 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
To: José A. Romero L. <escherdragon <at> gmail.com>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: Emacs does not implement URL (aka "percent") decoding correctly.
Date: Fri, 23 Sep 2011 10:34:00 +0200
José A. Romero L. <escherdragon <at> gmail.com> writes:

>>>     (url-unhex-string (url-hexify-string "ä"))
>>>     => "ä"

[...]

> Well, if you write a script that transforms URLs to/from strings
> (especially round-trip) you will probably encouter problems
> retrieving stuff from the web if you're not aware of this issue.

So this bug report is purely about the return value of
`url-unhex-string'?  It sounded at the beginning that url.el had
problems fetching something.

If this is just about `url-unhex-string', the obvious solution would be
to add a CODING-SYSTEM parameter to that function.

And please don't keep removing the debbugs address from the Cc list.
Your messages aren't going to the bug tracker if you do that.

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Fri, 23 Sep 2011 16:39:06 GMT) Full text and rfc822 format available.

Message #23 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: José A. Romero L. <escherdragon <at> gmail.com>
To: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: Emacs does not implement URL (aka "percent") decoding correctly.
Date: Fri, 23 Sep 2011 13:12:11 +0200
2011/9/23 Lars Magne Ingebrigtsen <larsi <at> gnus.org>:
(...)
> If this is just about `url-unhex-string', the obvious solution would be
> to add a CODING-SYSTEM parameter to that function.

Yes, as I see it, that's definitely it.

> And please don't keep removing the debbugs address from the Cc list.
> Your messages aren't going to the bug tracker if you do that.
(...)

Oops, sorry, I didn't notice it before -- won't happen again.

Cheers,
-- 
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)




Added tag(s) pending. Request was from Lars Magne Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 25 Sep 2011 22:18:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Sun, 25 Sep 2011 22:24:02 GMT) Full text and rfc822 format available.

Message #28 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
To: José A. Romero L. <escherdragon <at> gmail.com>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent") decoding
	correctly.
Date: Mon, 26 Sep 2011 00:16:14 +0200
José A. Romero L. <escherdragon <at> gmail.com> writes:

>> If this is just about `url-unhex-string', the obvious solution would be
>> to add a CODING-SYSTEM parameter to that function.
>
> Yes, as I see it, that's definitely it.

I think that's a reasonable thing to add, but Emacs is in a feature
freeze, so it'll probably have to wait until after Emacs 24 has been
released.  I'll mark the bug report as "pending".

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Sun, 25 Sep 2011 22:27:02 GMT) Full text and rfc822 format available.

Message #31 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: José A. Romero L. <escherdragon <at> gmail.com>
To: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent") decoding
	correctly.
Date: Mon, 26 Sep 2011 00:25:39 +0200
2011/9/26 Lars Magne Ingebrigtsen <larsi <at> gnus.org>:
(...)
> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released.  I'll mark the bug report as "pending".
(...)

Cool, thanks a lot :)

Cheers,
-- 
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#6252; Package emacs. (Tue, 10 Apr 2012 02:17:04 GMT) Full text and rfc822 format available.

Message #34 received at 6252 <at> debbugs.gnu.org (full text, mbox):

From: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
To: José A. Romero L. <escherdragon <at> gmail.com>
Cc: 6252 <at> debbugs.gnu.org
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent") decoding
	correctly.
Date: Tue, 10 Apr 2012 04:14:58 +0200
Lars Magne Ingebrigtsen <larsi <at> gnus.org> writes:

> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released.  I'll mark the bug report as "pending".

I've now added an optional coding-system parameter to the function to
the Emacs trunk.

-- 
(domestic pets only, the antidote for overdose, milk.)
  bloggy blog http://lars.ingebrigtsen.no/




Added tag(s) fixed. Request was from Lars Magne Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Tue, 10 Apr 2012 02:17:05 GMT) Full text and rfc822 format available.

bug marked as fixed in version 24.2, send any further explanations to 6252 <at> debbugs.gnu.org and José A. Romero L. <escherdragon <at> gmail.com> Request was from Lars Magne Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Tue, 10 Apr 2012 02:17:06 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 08 May 2012 11:24:03 GMT) Full text and rfc822 format available.

This bug report was last modified 13 years and 19 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.