GNU bug report logs - #40794
26.3; HTML entities ☆ and ★ (inter alia) are not parsed by libxml-parse-html-region

Previous Next

Package: emacs;

Reported by: Tim Landscheidt <tim <at> tim-landscheidt.de>

Date: Thu, 23 Apr 2020 13:25:01 UTC

Severity: normal

Tags: notabug

Found in version 26.3

Done: Stefan Kangas <stefan <at> marxist.se>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 40794 in the body.
You can then email your comments to 40794 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#40794; Package emacs. (Thu, 23 Apr 2020 13:25:01 GMT) Full text and rfc822 format available.

Acknowledgement sent to Tim Landscheidt <tim <at> tim-landscheidt.de>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 23 Apr 2020 13:25:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Tim Landscheidt <tim <at> tim-landscheidt.de>
To: bug-gnu-emacs <at> gnu.org
Subject: 26.3; HTML entities &star; and &starf;
 (inter alia) are not parsed by libxml-parse-html-region
Date: Thu, 23 Apr 2020 13:24:12 +0000
(Prologue: This bug showed up in the "ALT" attribute of an
"IMG" element of an HTML mail in Gnus.  I am reasonably cer-
tain that this stems from libxml-parse-html-region and
should be fixed there, but there may be more prudent solu-
tions.)

With GNU Emacs 26.3 on Fedora:

| ELISP> (with-temp-buffer
|          (insert "<!DOCTYPE html>
| <html lang=\"en\">
| <head><title>Title</title></head>
| <body>
|   <p>Hello world</p>
|   <p>&auml;</p>
|   <p>&star;</p>
|   <p>&starf;</p>
| </body>
| </html>")
|          (libxml-parse-html-region (point-min) (point-max)))
| (html
|  ((lang . "en"))
|  (head nil
|        (title nil "Title"))
|  (body nil "\n  "
|        (p nil "Hello world")
|        "\n  "
|        (p nil "ä")
|        "\n  "
|        (p nil "&star;")
|        "\n  "
|        (p nil "&starf;")
|        "\n"))

| ELISP>

These should instead yield "ä" (228), "☆" (9734) and
"★" (9733).

lisp/leim/quail/sgml-input.el seems to contain the necessary
data for &star; and &starf; that could probably be fed to
libxml.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40794; Package emacs. (Wed, 29 Jul 2020 05:27:01 GMT) Full text and rfc822 format available.

Message #8 received at 40794 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Tim Landscheidt <tim <at> tim-landscheidt.de>
Cc: 40794 <at> debbugs.gnu.org
Subject: Re: bug#40794: 26.3; HTML entities &star; and &starf; (inter alia)
 are not parsed by libxml-parse-html-region
Date: Wed, 29 Jul 2020 07:26:15 +0200
Tim Landscheidt <tim <at> tim-landscheidt.de> writes:

> (Prologue: This bug showed up in the "ALT" attribute of an
> "IMG" element of an HTML mail in Gnus.  I am reasonably cer-
> tain that this stems from libxml-parse-html-region and
> should be fixed there, but there may be more prudent solu-
> tions.)

[...]

> These should instead yield "ä" (228), "☆" (9734) and
> "★" (9733).
>
> lisp/leim/quail/sgml-input.el seems to contain the necessary
> data for &star; and &starf; that could probably be fed to
> libxml.

As far as I can tell, libxml2 doesn't take a list of entities as an
input when parsing HTML?  I may have missed something...

Hm, a bit of googling shows http://xmlsoft.org/html/libxml-entities.html
and there is apparently a way to tell libxml2 about further entities?

But I think this all sounds more like a libxml2 than an Emacs bug,
really?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40794; Package emacs. (Wed, 29 Jul 2020 05:37:02 GMT) Full text and rfc822 format available.

Message #11 received at 40794 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Tim Landscheidt <tim <at> tim-landscheidt.de>
Cc: 40794 <at> debbugs.gnu.org
Subject: Re: bug#40794: 26.3; HTML entities &star; and &starf; (inter alia)
 are not parsed by libxml-parse-html-region
Date: Wed, 29 Jul 2020 07:35:51 +0200
I had a look at the libxml2 sources.  The logic isn't really explained,
but apparently they include all the <255-value entities, and then a
selected number of the other entities (about 160 of them).

I have no idea what the logic behind this is...  perhaps they've just
forgotten to add the new ones?  Which makes me think that this is really
a libxml2 bug, and you should report it there instead.

Excerpt:

/************************************************************************
 *									*
 *	The list of HTML predefined entities			*
 *									*
 ************************************************************************/

static const htmlEntityDesc  html40EntitiesTable[] = {
/*
 * the 4 absolute ones, plus apostrophe.
 */
{ 34,	"quot",	"quotation mark = APL quote, U+0022 ISOnum" },
{ 38,	"amp",	"ampersand, U+0026 ISOnum" },
{ 39,	"apos",	"single quote" },
{ 60,	"lt",	"less-than sign, U+003C ISOnum" },
{ 62,	"gt",	"greater-than sign, U+003E ISOnum" },

/*
 * A bunch still in the 128-255 range
 * Replacing them depend really on the charset used.
 */
{ 160,	"nbsp",	"no-break space = non-breaking space, U+00A0 ISOnum" },
{ 161,	"iexcl","inverted exclamation mark, U+00A1 ISOnum" },
{ 162,	"cent",	"cent sign, U+00A2 ISOnum" },

[...]

{ 376,	"Yuml",	"latin capital letter Y with diaeresis, U+0178 ISOlat2" },

/*
 * Anything below should really be kept as entities references
 */
{ 402,	"fnof",	"latin small f with hook = function = florin, U+0192 ISOtech" },

{ 710,	"circ",	"modifier letter circumflex accent, U+02C6 ISOpub" },
{ 732,	"tilde","small tilde, U+02DC ISOdia" },

{ 913,	"Alpha","greek capital letter alpha, U+0391" },
{ 914,	"Beta",	"greek capital letter beta, U+0392" },
{ 915,	"Gamma","greek capital letter gamma, U+0393 ISOgrk3" },
{ 916,	"Delta","greek capital letter delta, U+0394 ISOgrk3" },

[...]

{ 9824,	"spades","black spade suit, U+2660 ISOpub" },
{ 9827,	"clubs","black club suit = shamrock, U+2663 ISOpub" },
{ 9829,	"hearts","black heart suit = valentine, U+2665 ISOpub" },
{ 9830,	"diams","black diamond suit, U+2666 ISOpub" },


-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40794; Package emacs. (Wed, 09 Sep 2020 13:23:02 GMT) Full text and rfc822 format available.

Message #14 received at 40794 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 40794 <at> debbugs.gnu.org, Tim Landscheidt <tim <at> tim-landscheidt.de>
Subject: Re: bug#40794: 26.3; HTML entities &star; and &starf; (inter alia)
 are not parsed by libxml-parse-html-region
Date: Wed, 9 Sep 2020 06:22:11 -0700
Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> I had a look at the libxml2 sources.  The logic isn't really explained,
> but apparently they include all the <255-value entities, and then a
> selected number of the other entities (about 160 of them).
>
> I have no idea what the logic behind this is...  perhaps they've just
> forgotten to add the new ones?  Which makes me think that this is really
> a libxml2 bug, and you should report it there instead.

Agreed.  Tim, could you please report this to the libxml2 developers?




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#40794; Package emacs. (Wed, 25 Nov 2020 10:04:02 GMT) Full text and rfc822 format available.

Message #17 received at 40794 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 40794 <at> debbugs.gnu.org, Tim Landscheidt <tim <at> tim-landscheidt.de>
Subject: Re: bug#40794: 26.3; HTML entities &star; and &starf; (inter alia)
 are not parsed by libxml-parse-html-region
Date: Wed, 25 Nov 2020 02:03:39 -0800
tags 40794 notabug
close 40794
thanks

Stefan Kangas <stefan <at> marxist.se> writes:

> Lars Ingebrigtsen <larsi <at> gnus.org> writes:
>
>> I had a look at the libxml2 sources.  The logic isn't really explained,
>> but apparently they include all the <255-value entities, and then a
>> selected number of the other entities (about 160 of them).
>>
>> I have no idea what the logic behind this is...  perhaps they've just
>> forgotten to add the new ones?  Which makes me think that this is really
>> a libxml2 bug, and you should report it there instead.
>
> Agreed.  Tim, could you please report this to the libxml2 developers?

That was 10 weeks ago, and we seem to agree that this is not a bug in
Emacs.  I'm therefore closing this bug report.

Please report this issue to the libxml2 developers if it is still an
issue.




Added tag(s) notabug. Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Wed, 25 Nov 2020 10:04:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 40794 <at> debbugs.gnu.org and Tim Landscheidt <tim <at> tim-landscheidt.de> Request was from Stefan Kangas <stefan <at> marxist.se> to control <at> debbugs.gnu.org. (Wed, 25 Nov 2020 10:04:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Wed, 23 Dec 2020 12:24:11 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 96 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.