GNU bug report logs - #48211
28.0.50; eww strips whitespace between <mark> elements

Previous Next

Package: emacs;

Reported by: Stefan Kangas <stefan <at> marxist.se>

Date: Mon, 3 May 2021 23:17:02 UTC

Severity: normal

Found in versions 24.1, 28.0.50

Fixed in version 29.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 48211 in the body.
You can then email your comments to 48211 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to larsi <at> gnus.org, bug-gnu-emacs <at> gnu.org:
bug#48211; Package emacs. (Mon, 03 May 2021 23:17:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Stefan Kangas <stefan <at> marxist.se>:
New bug report received and forwarded. Copy sent to larsi <at> gnus.org, bug-gnu-emacs <at> gnu.org. (Mon, 03 May 2021 23:17:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: bug-gnu-emacs <at> gnu.org
Subject: 28.0.50; eww strips whitespace between <mark> elements
Date: Mon, 3 May 2021 18:16:06 -0500
Opening a HTML file in eww with <mark> elements strips whitespace
between elements.

Steps to reproduce:

0. echo "<p><mark>foo</mark> <mark>bar</mark></p>" > /tmp/foo.html
1. emacs -Q
2. M-x eww RET file:///tmp/foo.html RET

Result is that I see, in the eww buffer:

    "foobar"

Expected result is:

    "foo bar"

For a real world example where this matters, see:

    https://dle.rae.es/palabra

In eww, I get:

  1. f. Unidadlingüística, dotadageneralmentedesignificado,
  queseseparadelasdemásmediantepausaspotencialesenlapronunciaciónyblancosenlaescritura.

In Firefox, I get:

  1. f. Unidad lingüística, dotada generalmente de significado, que se
  separa de las demás mediante pausas potenciales en la pronunciación y
  blancos en la escritura.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48211; Package emacs. (Mon, 03 May 2021 23:56:01 GMT) Full text and rfc822 format available.

Message #8 received at 48211 <at> debbugs.gnu.org (full text, mbox):

From: "Basil L. Contovounesios" <contovob <at> tcd.ie>
To: Stefan Kangas <stefan <at> marxist.se>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 48211 <at> debbugs.gnu.org
Subject: Re: bug#48211: 28.0.50; eww strips whitespace between <mark> elements
Date: Tue, 04 May 2021 00:55:03 +0100
found 48211 24.1
quit

Stefan Kangas <stefan <at> marxist.se> writes:

> Opening a HTML file in eww with <mark> elements strips whitespace
> between elements.

I think this is because libxml-parse-html-region specifies
HTML_PARSE_NOBLANKS:

Return CDATA sections (like <style>foo</style>) as text nodes.
3c2317e891 2010-12-06 17:59:52 +0100
https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33

-- 
Basil




bug Marked as found in versions 24.1. Request was from "Basil L. Contovounesios" <contovob <at> tcd.ie> to control <at> debbugs.gnu.org. (Mon, 03 May 2021 23:56:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48211; Package emacs. (Tue, 04 May 2021 00:36:02 GMT) Full text and rfc822 format available.

Message #13 received at 48211 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: "Basil L. Contovounesios" <contovob <at> tcd.ie>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 48211 <at> debbugs.gnu.org
Subject: Re: bug#48211: 28.0.50; eww strips whitespace between <mark> elements
Date: Mon, 3 May 2021 19:35:35 -0500
"Basil L. Contovounesios" <contovob <at> tcd.ie> writes:

> I think this is because libxml-parse-html-region specifies
> HTML_PARSE_NOBLANKS:
>
> Return CDATA sections (like <style>foo</style>) as text nodes.
> 3c2317e891 2010-12-06 17:59:52 +0100
> https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33

Hmm, okay.  For now, I'm seeing this issue with basically any tag that
libxml2 does not already know about, e.g. "<summary>" or "<bdi>".

This is what I came up with before reading Basil's reply:

(with-temp-buffer
  (insert "<p><tt>foo</tt> <tt>bar</tt></p>")
  (libxml-parse-html-region (point-min) (point-max)))

=> (html nil (body nil (p nil (tt nil "foo") " " (tt nil "bar"))))

(with-temp-buffer
  (insert "<p><mark>foo</mark> <mark>bar</mark></p>")
  (libxml-parse-html-region (point-min) (point-max)))

=> (html nil (body nil (p nil (mark nil "foo") (mark nil "bar"))))

I guess this is a bug in libxml2, so I reported it here:

    https://gitlab.gnome.org/GNOME/libxml2/-/issues/247

FWIW, the below diff works around this bug for me.

diff --git a/lisp/net/shr.el b/lisp/net/shr.el
index cbdeb65ba8..3eb3a5bc49 100644
--- a/lisp/net/shr.el
+++ b/lisp/net/shr.el
@@ -1485,6 +1485,12 @@ shr-tag-tt
   ;; The `tt' tag is deprecated in favor of `code'.
   (shr-tag-code dom))

+(defun shr-tag-mark (dom)
+  (shr-generic dom)
+  ;; Hack to work around bug in libxml2 (Bug#48211):
+  ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
+  (insert " "))
+
 (defun shr-tag-ins (cont)
   (let* ((start (point))
          (color "green")




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48211; Package emacs. (Tue, 04 May 2021 00:52:02 GMT) Full text and rfc822 format available.

Message #16 received at 48211 <at> debbugs.gnu.org (full text, mbox):

From: Stefan Kangas <stefan <at> marxist.se>
To: "Basil L. Contovounesios" <contovob <at> tcd.ie>
Cc: Lars Ingebrigtsen <larsi <at> gnus.org>, 48211 <at> debbugs.gnu.org
Subject: Re: bug#48211: 28.0.50; eww strips whitespace between <mark> elements
Date: Mon, 3 May 2021 19:51:06 -0500
Stefan Kangas <stefan <at> marxist.se> writes:

> FWIW, the below diff works around this bug for me.
>
> diff --git a/lisp/net/shr.el b/lisp/net/shr.el
> index cbdeb65ba8..3eb3a5bc49 100644
> --- a/lisp/net/shr.el
> +++ b/lisp/net/shr.el
> @@ -1485,6 +1485,12 @@ shr-tag-tt
>    ;; The `tt' tag is deprecated in favor of `code'.
>    (shr-tag-code dom))
>
> +(defun shr-tag-mark (dom)
> +  (shr-generic dom)
> +  ;; Hack to work around bug in libxml2 (Bug#48211):
> +  ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> +  (insert " "))
> +
>  (defun shr-tag-ins (cont)
>    (let* ((start (point))
>           (color "green")

Well, I should moderate that statement.

It doesn't exactly fix the bug as I'm now getting this instead:

    1. f. Unidad lingüística , dotada generalmente de significado , que
    se separa de las demás mediante pausas potenciales en la
    pronunciación y blancos en la escritura .

    2. f. Representación gráfica de la palabra hablada .

    3. f. Facultad de hablar .

IOW, whitespace is added even if the following character is
punctuation...




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#48211; Package emacs. (Fri, 01 Jul 2022 11:47:01 GMT) Full text and rfc822 format available.

Message #19 received at 48211 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: Stefan Kangas <stefan <at> marxist.se>
Cc: "Basil L. Contovounesios" <contovob <at> tcd.ie>, 48211 <at> debbugs.gnu.org
Subject: Re: bug#48211: 28.0.50; eww strips whitespace between <mark> elements
Date: Fri, 01 Jul 2022 13:46:33 +0200
Stefan Kangas <stefan <at> marxist.se> writes:

> I guess this is a bug in libxml2, so I reported it here:
>
>     https://gitlab.gnome.org/GNOME/libxml2/-/issues/247

[...]

> +(defun shr-tag-mark (dom)
> +  (shr-generic dom)
> +  ;; Hack to work around bug in libxml2 (Bug#48211):
> +  ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> +  (insert " "))

I've now pushed a variation of this to Emacs 29, and included a face and
stuff, as

https://www.w3schools.com/tags/tag_mark.asp

recommends.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




bug marked as fixed in version 29.1, send any further explanations to 48211 <at> debbugs.gnu.org and Stefan Kangas <stefan <at> marxist.se> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Fri, 01 Jul 2022 11:47:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sat, 30 Jul 2022 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 271 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.