GNU bug report logs -
#48211
28.0.50; eww strips whitespace between <mark> elements
Previous Next
Reported by: Stefan Kangas <stefan <at> marxist.se>
Date: Mon, 3 May 2021 23:17:02 UTC
Severity: normal
Found in versions 24.1, 28.0.50
Fixed in version 29.1
Done: Lars Ingebrigtsen <larsi <at> gnus.org>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 48211 in the body.
You can then email your comments to 48211 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
larsi <at> gnus.org, bug-gnu-emacs <at> gnu.org
:
bug#48211
; Package
emacs
.
(Mon, 03 May 2021 23:17:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Stefan Kangas <stefan <at> marxist.se>
:
New bug report received and forwarded. Copy sent to
larsi <at> gnus.org, bug-gnu-emacs <at> gnu.org
.
(Mon, 03 May 2021 23:17:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
Opening a HTML file in eww with <mark> elements strips whitespace
between elements.
Steps to reproduce:
0. echo "<p><mark>foo</mark> <mark>bar</mark></p>" > /tmp/foo.html
1. emacs -Q
2. M-x eww RET file:///tmp/foo.html RET
Result is that I see, in the eww buffer:
"foobar"
Expected result is:
"foo bar"
For a real world example where this matters, see:
https://dle.rae.es/palabra
In eww, I get:
1. f. Unidadlingüística, dotadageneralmentedesignificado,
queseseparadelasdemásmediantepausaspotencialesenlapronunciaciónyblancosenlaescritura.
In Firefox, I get:
1. f. Unidad lingüística, dotada generalmente de significado, que se
separa de las demás mediante pausas potenciales en la pronunciación y
blancos en la escritura.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#48211
; Package
emacs
.
(Mon, 03 May 2021 23:56:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 48211 <at> debbugs.gnu.org (full text, mbox):
found 48211 24.1
quit
Stefan Kangas <stefan <at> marxist.se> writes:
> Opening a HTML file in eww with <mark> elements strips whitespace
> between elements.
I think this is because libxml-parse-html-region specifies
HTML_PARSE_NOBLANKS:
Return CDATA sections (like <style>foo</style>) as text nodes.
3c2317e891 2010-12-06 17:59:52 +0100
https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33
--
Basil
bug Marked as found in versions 24.1.
Request was from
"Basil L. Contovounesios" <contovob <at> tcd.ie>
to
control <at> debbugs.gnu.org
.
(Mon, 03 May 2021 23:56:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#48211
; Package
emacs
.
(Tue, 04 May 2021 00:36:02 GMT)
Full text and
rfc822 format available.
Message #13 received at 48211 <at> debbugs.gnu.org (full text, mbox):
"Basil L. Contovounesios" <contovob <at> tcd.ie> writes:
> I think this is because libxml-parse-html-region specifies
> HTML_PARSE_NOBLANKS:
>
> Return CDATA sections (like <style>foo</style>) as text nodes.
> 3c2317e891 2010-12-06 17:59:52 +0100
> https://git.sv.gnu.org/cgit/emacs.git/commit/?id=3c2317e89100833812a7194c0d9d39ae0f52cb33
Hmm, okay. For now, I'm seeing this issue with basically any tag that
libxml2 does not already know about, e.g. "<summary>" or "<bdi>".
This is what I came up with before reading Basil's reply:
(with-temp-buffer
(insert "<p><tt>foo</tt> <tt>bar</tt></p>")
(libxml-parse-html-region (point-min) (point-max)))
=> (html nil (body nil (p nil (tt nil "foo") " " (tt nil "bar"))))
(with-temp-buffer
(insert "<p><mark>foo</mark> <mark>bar</mark></p>")
(libxml-parse-html-region (point-min) (point-max)))
=> (html nil (body nil (p nil (mark nil "foo") (mark nil "bar"))))
I guess this is a bug in libxml2, so I reported it here:
https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
FWIW, the below diff works around this bug for me.
diff --git a/lisp/net/shr.el b/lisp/net/shr.el
index cbdeb65ba8..3eb3a5bc49 100644
--- a/lisp/net/shr.el
+++ b/lisp/net/shr.el
@@ -1485,6 +1485,12 @@ shr-tag-tt
;; The `tt' tag is deprecated in favor of `code'.
(shr-tag-code dom))
+(defun shr-tag-mark (dom)
+ (shr-generic dom)
+ ;; Hack to work around bug in libxml2 (Bug#48211):
+ ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
+ (insert " "))
+
(defun shr-tag-ins (cont)
(let* ((start (point))
(color "green")
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#48211
; Package
emacs
.
(Tue, 04 May 2021 00:52:02 GMT)
Full text and
rfc822 format available.
Message #16 received at 48211 <at> debbugs.gnu.org (full text, mbox):
Stefan Kangas <stefan <at> marxist.se> writes:
> FWIW, the below diff works around this bug for me.
>
> diff --git a/lisp/net/shr.el b/lisp/net/shr.el
> index cbdeb65ba8..3eb3a5bc49 100644
> --- a/lisp/net/shr.el
> +++ b/lisp/net/shr.el
> @@ -1485,6 +1485,12 @@ shr-tag-tt
> ;; The `tt' tag is deprecated in favor of `code'.
> (shr-tag-code dom))
>
> +(defun shr-tag-mark (dom)
> + (shr-generic dom)
> + ;; Hack to work around bug in libxml2 (Bug#48211):
> + ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> + (insert " "))
> +
> (defun shr-tag-ins (cont)
> (let* ((start (point))
> (color "green")
Well, I should moderate that statement.
It doesn't exactly fix the bug as I'm now getting this instead:
1. f. Unidad lingüística , dotada generalmente de significado , que
se separa de las demás mediante pausas potenciales en la
pronunciación y blancos en la escritura .
2. f. Representación gráfica de la palabra hablada .
3. f. Facultad de hablar .
IOW, whitespace is added even if the following character is
punctuation...
Information forwarded
to
bug-gnu-emacs <at> gnu.org
:
bug#48211
; Package
emacs
.
(Fri, 01 Jul 2022 11:47:01 GMT)
Full text and
rfc822 format available.
Message #19 received at 48211 <at> debbugs.gnu.org (full text, mbox):
Stefan Kangas <stefan <at> marxist.se> writes:
> I guess this is a bug in libxml2, so I reported it here:
>
> https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
[...]
> +(defun shr-tag-mark (dom)
> + (shr-generic dom)
> + ;; Hack to work around bug in libxml2 (Bug#48211):
> + ;; https://gitlab.gnome.org/GNOME/libxml2/-/issues/247
> + (insert " "))
I've now pushed a variation of this to Emacs 29, and included a face and
stuff, as
https://www.w3schools.com/tags/tag_mark.asp
recommends.
--
(domestic pets only, the antidote for overdose, milk.)
bloggy blog: http://lars.ingebrigtsen.no
bug marked as fixed in version 29.1, send any further explanations to
48211 <at> debbugs.gnu.org and Stefan Kangas <stefan <at> marxist.se>
Request was from
Lars Ingebrigtsen <larsi <at> gnus.org>
to
control <at> debbugs.gnu.org
.
(Fri, 01 Jul 2022 11:47:02 GMT)
Full text and
rfc822 format available.
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Sat, 30 Jul 2022 11:24:04 GMT)
Full text and
rfc822 format available.
This bug report was last modified 1 year and 271 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.