GNU bug report logs -
#6252
Emacs does not implement URL (aka "percent") decoding correctly.
Previous Next
Reported by: José A. Romero L. <escherdragon <at> gmail.com>
Date: Sun, 23 May 2010 00:52:02 UTC
Severity: normal
Tags: fixed
Fixed in version 24.2
Done: Lars Magne Ingebrigtsen <larsi <at> gnus.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 6252 in the body.
You can then email your comments to 6252 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Sun, 23 May 2010 00:52:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
José A. Romero L. <escherdragon <at> gmail.com>
:
New bug report received and forwarded. Copy sent to
bug-gnu-emacs <at> gnu.org
.
(Sun, 23 May 2010 00:52:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
On May 18, 20:14, Xah Lee <xah...@gmail.com> wrote:
> is there emacs lisp function that decode the url percent encoding?
> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
> should become
> http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
> that's a EN DASH (unicode 8211, #o20023, #x2013).
> I know there's a
> (require 'gnus-util)
> gnus-url-unhex-string
> but that just unhex, and generate gibberish if the url contain unicode
> chars.
(...)
Seems that RFC 3986 has not been implemented correctly in Emacs. IMHO
that is an important hole you have found there. The standard requires
that all unreserved characters be encoded/decoded as UTF8 bytes. Even
though the encoding part looks OK (in url-util.el), the decoding does
not go that last mile to interpret the decoded bytes as UTF-8.
Until a proper implementation is done, I guess you could work around
the problem with something like this:
(decode-coding-string
(apply 'unibyte-string
(string-to-list
(url-unhex-string "http://en.wikipedia.org/wiki/Sylvester
%E2%80%93Gallai_theorem")))
'utf-8)
(yes, it's ugly as hell but hey, it's free ;])
I've just sent this very message as a bug report to the Emacs team.
Cheers,
--
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)
Information forwarded
to
owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Mon, 24 May 2010 03:34:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 6252 <at> debbugs.gnu.org (full text, mbox):
>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escherdragon <at> gmail.com> said:
> Seems that RFC 3986 has not been implemented correctly in
> Emacs. IMHO that is an important hole you have found there. The
> standard requires that all unreserved characters be encoded/decoded
> as UTF8 bytes.
If you are referring to the following part of RFC 3986, it doesn't say
anything about existing URI schemes (as opposed to "a new URI
scheme"), those defining a component that does NOT represent textual
data, or even for textual data, those NOT consisting of characters
from the Universal Character Sets.
When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set
[UCS], the data should first be encoded as octets according to the
UTF-8 character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be percent-
encoded.
(See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)
Though returning a multibyte string decoded as UTF-8 would be useful
for many cases, I think some "unhex"ing function should also provide a
functionality to return a unibyte string.
YAMAMOTO Mitsuharu
mituharu <at> math.s.chiba-u.ac.jp
Information forwarded
to
owner <at> debbugs.gnu.org, bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Tue, 25 May 2010 12:34:02 GMT)
Full text and
rfc822 format available.
Message #11 received at 6252 <at> debbugs.gnu.org (full text, mbox):
(sorry, forgot to fwd this to the bugtrack)
---------- Forwarded message ----------
From: José A. Romero L. <escherdragon <at> gmail.com>
Date: 2010/5/24
Subject: Re: bug#6252: Emacs does not implement URL (aka "percent")
decoding correctly.
To: YAMAMOTO Mitsuharu <mituharu <at> math.s.chiba-u.ac.jp>
2010/5/24 YAMAMOTO Mitsuharu <mituharu <at> math.s.chiba-u.ac.jp>:
>>>>>> On Sun, 23 May 2010 01:46:54 +0200, José A. Romero L. <escherdragon <at> gmail.com> said:
(...)
> If you are referring to the following part of RFC 3986, it doesn't say
> anything about existing URI schemes (as opposed to "a new URI
> scheme"), those defining a component that does NOT represent textual
> data, or even for textual data, those NOT consisting of characters
> from the Universal Character Sets.
You are right. The standard *doesn't say anything* about existing URI
schemes on that matter. Thus the question would be rather whether to
make the language more or less useful, especially on the light of the
fragment you've just quoted:
> When a new URI scheme defines a component that represents textual
> data consisting of characters from the Universal Character Set
> [UCS], the data should first be encoded as octets according to the
> UTF-8 character encoding [STD63]; then only those octets that do not
> correspond to characters in the unreserved set should be percent-
> encoded.
and the example that immediately follows:
(...) For example, the character A would be represented as "A",
the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
as "%C3%80", and the character KATAKANA LETTER A would be represented
as "%E3%82%A2".
>
> (See also http://lists.gnu.org/archive/html/emacs-devel/2006-08/msg00065.html)
>
> Though returning a multibyte string decoded as UTF-8 would be useful
> for many cases, I think some "unhex"ing function should also provide a
> functionality to return a unibyte string.
(...)
That's perfectly valid. OTOH some other "unhex"-ing function (or even
the same) could also provide the functionality to return a multi-byte
string, and even allow to choose the character encoding (UCS or not)
for the resulting string. After all, don't you think there should be
a better way to decode a Katakana A than using a kludge like this?:
(decode-coding-string
(apply 'unibyte-string
(string-to-list
(url-unhex-string "%E3%82%A2")))
'utf-8)
Cheers,
--
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Wed, 21 Sep 2011 20:37:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 6252 <at> debbugs.gnu.org (full text, mbox):
José A. Romero L. <escherdragon <at> gmail.com> writes:
> On May 18, 20:14, Xah Lee <xah...@gmail.com> wrote:
>
>> is there emacs lisp function that decode the url percent encoding?
>> e.g.http://en.wikipedia.org/wiki/Sylvester%E2%80%93Gallai_theorem
>> should become
>> http://en.wikipedia.org/wiki/Sylvester–Gallai_theorem
>> that's a EN DASH (unicode 8211, #o20023, #x2013).
>> I know there's a
>> (require 'gnus-util)
>> gnus-url-unhex-string
>> but that just unhex, and generate gibberish if the url contain unicode
>> chars.
> (...)
>
> Seems that RFC 3986 has not been implemented correctly in Emacs. IMHO
> that is an important hole you have found there. The standard requires
> that all unreserved characters be encoded/decoded as UTF8 bytes. Even
> though the encoding part looks OK (in url-util.el), the decoding does
> not go that last mile to interpret the decoded bytes as UTF-8.
I'm not quite sure I understand what the problem is. Do you have a test
case that illustrates what url.el does wrong?
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Thu, 22 Sep 2011 07:44:03 GMT)
Full text and
rfc822 format available.
Message #17 received at 6252 <at> debbugs.gnu.org (full text, mbox):
José A. Romero L. <escherdragon <at> gmail.com> writes:
> in short, there seems to be currently no way to perform the opposite
> of url-hexify-string for UTF-8 encoded strings:
>
> (url-unhex-string (url-hexify-string "ä"))
> => "ä"
`url-unhex-string' can't know what encoding the %xx-encoding is in, can
it? The local part of an URL can use a different encoding, I think.
But is that the test case for the bug? I thought somebody had problems
retrieving something...
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Fri, 23 Sep 2011 08:39:04 GMT)
Full text and
rfc822 format available.
Message #20 received at 6252 <at> debbugs.gnu.org (full text, mbox):
José A. Romero L. <escherdragon <at> gmail.com> writes:
>>> (url-unhex-string (url-hexify-string "ä"))
>>> => "ä"
[...]
> Well, if you write a script that transforms URLs to/from strings
> (especially round-trip) you will probably encouter problems
> retrieving stuff from the web if you're not aware of this issue.
So this bug report is purely about the return value of
`url-unhex-string'? It sounded at the beginning that url.el had
problems fetching something.
If this is just about `url-unhex-string', the obvious solution would be
to add a CODING-SYSTEM parameter to that function.
And please don't keep removing the debbugs address from the Cc list.
Your messages aren't going to the bug tracker if you do that.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Fri, 23 Sep 2011 16:39:06 GMT)
Full text and
rfc822 format available.
Message #23 received at 6252 <at> debbugs.gnu.org (full text, mbox):
2011/9/23 Lars Magne Ingebrigtsen <larsi <at> gnus.org>:
(...)
> If this is just about `url-unhex-string', the obvious solution would be
> to add a CODING-SYSTEM parameter to that function.
Yes, as I see it, that's definitely it.
> And please don't keep removing the debbugs address from the Cc list.
> Your messages aren't going to the bug tracker if you do that.
(...)
Oops, sorry, I didn't notice it before -- won't happen again.
Cheers,
--
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)
Added tag(s) pending.
Request was from
Lars Magne Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Sun, 25 Sep 2011 22:18:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Sun, 25 Sep 2011 22:24:02 GMT)
Full text and
rfc822 format available.
Message #28 received at 6252 <at> debbugs.gnu.org (full text, mbox):
José A. Romero L. <escherdragon <at> gmail.com> writes:
>> If this is just about `url-unhex-string', the obvious solution would be
>> to add a CODING-SYSTEM parameter to that function.
>
> Yes, as I see it, that's definitely it.
I think that's a reasonable thing to add, but Emacs is in a feature
freeze, so it'll probably have to wait until after Emacs 24 has been
released. I'll mark the bug report as "pending".
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Sun, 25 Sep 2011 22:27:02 GMT)
Full text and
rfc822 format available.
Message #31 received at 6252 <at> debbugs.gnu.org (full text, mbox):
2011/9/26 Lars Magne Ingebrigtsen <larsi <at> gnus.org>:
(...)
> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released. I'll mark the bug report as "pending".
(...)
Cool, thanks a lot :)
Cheers,
--
José A. Romero L.
escherdragon <at> gmail.com
"We who cut mere stones must always be envisioning cathedrals."
(Quarry worker's creed)
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#6252
; Package
emacs
.
(Tue, 10 Apr 2012 02:17:04 GMT)
Full text and
rfc822 format available.
Message #34 received at 6252 <at> debbugs.gnu.org (full text, mbox):
Lars Magne Ingebrigtsen <larsi <at> gnus.org> writes:
> I think that's a reasonable thing to add, but Emacs is in a feature
> freeze, so it'll probably have to wait until after Emacs 24 has been
> released. I'll mark the bug report as "pending".
I've now added an optional coding-system parameter to the function to
the Emacs trunk.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog http://lars.ingebrigtsen.no/
Added tag(s) fixed.
Request was from
Lars Magne Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Tue, 10 Apr 2012 02:17:05 GMT)
Full text and
rfc822 format available.
bug marked as fixed in version 24.2, send any further explanations to
6252 <at> debbugs.gnu.org and José A. Romero L. <escherdragon <at> gmail.com>
Request was from
Lars Magne Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Tue, 10 Apr 2012 02:17:06 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 08 May 2012 11:24:03 GMT)
Full text and
rfc822 format available.
This bug report was last modified 13 years and 19 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.