GNU bug report logs - #31665
libxml-parse-html-region' doesn't extract text in tables

Previous Next

Package: emacs;

Reported by: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>

Date: Thu, 31 May 2018 09:56:02 UTC

Severity: minor

Tags: fixed, moreinfo

Fixed in version 27.1

Done: Lars Ingebrigtsen <larsi <at> gnus.org>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 31665 in the body.
You can then email your comments to 31665 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Thu, 31 May 2018 09:56:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>:
New bug report received and forwarded. Copy sent to bug-gnu-emacs <at> gnu.org. (Thu, 31 May 2018 09:56:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: bug-gnu-emacs <at> gnu.org
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>
Subject: libxml-parse-html-region' doesn't extract text in tables
Date: Thu, 31 May 2018 17:55:04 +0800
Dear bug-gnu-emacs, libxml-parse-html-region' doesn't extract text in <table>s,

KY> I found that Emacs' built-in function `libxml-parse-html-region'
KY> doesn't extract text existing in the table clause.




Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Thu, 31 May 2018 10:59:02 GMT) Full text and rfc822 format available.

Message #8 received at 31665 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 31665 <at> debbugs.gnu.org
Subject: Re: bug#31665: libxml-parse-html-region' doesn't extract text in
 tables
Date: Thu, 31 May 2018 12:58:37 +0200
積丹尼 Dan Jacobson <jidanni <at> jidanni.org> writes:

> Dear bug-gnu-emacs, libxml-parse-html-region' doesn't extract text in
> <table>s,

Do you have an example table that `libxml-parse-html-region' doesn't
"extract" text from?

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) moreinfo. Request was from Noam Postavsky <npostavs <at> gmail.com> to control <at> debbugs.gnu.org. (Sun, 03 Jun 2018 00:19:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Thu, 07 Jun 2018 07:41:01 GMT) Full text and rfc822 format available.

Message #13 received at 31665 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 31665 <at> debbugs.gnu.org
Subject: Re: bug#31665: libxml-parse-html-region' doesn't extract text in
 tables
Date: Thu, 07 Jun 2018 04:50:15 +0800
[Message part 1 (text/plain, inline)]
>>>>> "LI" == Lars Ingebrigtsen <larsi <at> gnus.org> writes:

LI> Do you have an example table that `libxml-parse-html-region' doesn't
LI> "extract" text from?

OK here is a mail that I cleaned off my personal phone bill from:
[gg.gz (application/gzip, attachment)]

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Sun, 29 Sep 2019 08:36:02 GMT) Full text and rfc822 format available.

Message #16 received at 31665 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 31665 <at> debbugs.gnu.org
Subject: Re: bug#31665: libxml-parse-html-region' doesn't extract text in
 tables
Date: Sun, 29 Sep 2019 10:34:56 +0200
積丹尼 Dan Jacobson <jidanni <at> jidanni.org> writes:

>>>>>> "LI" == Lars Ingebrigtsen <larsi <at> gnus.org> writes:
>
> LI> Do you have an example table that `libxml-parse-html-region' doesn't
> LI> "extract" text from?
>
> OK here is a mail that I cleaned off my personal phone bill from:

What was it you think is missing from that table?  I don't read Chinese,
but there didn't seem to be any text in that table, just a bunch of
images.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) moreinfo. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Sun, 29 Sep 2019 08:36:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Sun, 29 Sep 2019 16:53:01 GMT) Full text and rfc822 format available.

Message #21 received at 31665 <at> debbugs.gnu.org (full text, mbox):

From: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 31665 <at> debbugs.gnu.org
Subject: Re: bug#31665: libxml-parse-html-region' doesn't extract text in
 tables
Date: Mon, 30 Sep 2019 00:52:40 +0800
>>>>> "LI" == Lars Ingebrigtsen <larsi <at> gnus.org> writes:
LI> 積丹尼 Dan Jacobson <jidanni <at> jidanni.org> writes:

>>>>>>> "LI" == Lars Ingebrigtsen <larsi <at> gnus.org> writes:
>> 
LI> Do you have an example table that `libxml-parse-html-region' doesn't
LI> "extract" text from?
>> 
>> OK here is a mail that I cleaned off my personal phone bill from:

LI> What was it you think is missing from that table?  I don't read Chinese,
LI> but there didn't seem to be any text in that table, just a bunch of
LI> images.

It should look like:

+----------------------------------------------------------------------------------------------------------------------------------------------------+
|+---------------------------------------------------------------------------------------------------------------------+                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||[banner2]                                                                                                         | |                             |
|||------------------------------------------------------------------------------------------------------------------| |                             |
|||+---------------------------------------------------------------------------------------------------------------+ | |                             |
||||                                    |親愛的客戶,您好:                   |                                    | | |                             |
||||                                    |-------------------------------------|                                    | | |                             |
||||                                    |為保障您資料的安全,請輸入密碼開啟附 |                                    | | |                             |
||||                                    |加檔案瀏覽您本期的帳單,密碼為『身分 |                                    | | |                             |
||||               [IS1]                |證號碼』(英文字母須大寫),營業人客戶 |               [IS2]                | | |                             |
||||                                    |不需輸入密碼即可瀏覽。               |                                    | | |                             |
||||                                    |若無法開啟附加檔案,請先確認是否已下 |                                    | | |                             |
||||                                    |載Acrobat Reader軟體。               |                                    | | |                             |
||||                                    |-------------------------------------|                                    | | |                             |
|||+---------------------------------------------------------------------------------------------------------------+ | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+-------------------------------------------------------------------------------------------------------------------+|                             |
|||[new1]                                                                                                             ||                             |
|||+-----------------------------------------------------------------------------------------------------------------+||                             |
||||                                                        |                                                [enf201]|||                             |
||||                                                        |--------------------------------------------------------|||                             |
||||[end101]                                                |                                                [enl301]|||                             |
||||                                                        |--------------------------------------------------------|||                             |
||||                                                        |                                                [enl401]|||                             |
|||+-----------------------------------------------------------------------------------------------------------------+||                             |
||+-------------------------------------------------------------------------------------------------------------------+|                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||[hot1]                                                                                                            | |                             |
|||------------------------------------------------------------------------------------------------------------------| |                             |
|||+----------------------------------+                                                                              | |                             |
||||[hot1]|[hot2]|[hot3]|[hot4]|[hot5]|                                                                              | |                             |
|||+----------------------------------+                                                                              | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||[link1]                                                                                                           | |                             |
|||+-----------------------------------------------------------------+                                               | |                             |
||||||            |                |                |                |                                               | |                             |
||||++------------+----------------+----------------+----------------|                                               | |                             |
||||||電子帳單Q&A |    費率說明    |  客戶消費資訊  |    線上繳費    |                                               | |                             |
||||++------------+----------------+----------------+----------------|                                               | |                             |
||||||  服務專線  |    貼心提醒    |不可不知行動優惠| HiNet好康優惠  |                                               | |                             |
|||+-----------------------------------------------------------------+                                               | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
||++                                                                                                                   |                             |
||||                                                                                                                   |                             |
||++                                                                                                                   |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|||                                                      [cht]                                                       | |                             |
||+------------------------------------------------------------------------------------------------------------------+ |                             |
|+---------------------------------------------------------------------------------------------------------------------+                             |
+----------------------------------------------------------------------------------------------------------------------------------------------------+

But instead all we get is:

From: Phone Co. <p <at> cht.com.tw>
Subject: Phone Bill
To: "jidanni <at> jidanni.org" <jidanni <at> jidanni.org>
Date: Thu, 17 May 2018 12:12:06 +0800
Reply-To: x <at> cht.com.tw

[1. text/html]
中華電信電子帳單

*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Mon, 30 Sep 2019 05:06:01 GMT) Full text and rfc822 format available.

Message #24 received at 31665 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 31665 <at> debbugs.gnu.org
Subject: Re: bug#31665: libxml-parse-html-region' doesn't extract text in
 tables
Date: Mon, 30 Sep 2019 07:05:23 +0200
The HTML in that email is invalid.  It's basically on the form

<table>
  <tbody>
    foo
  </tbody>
</table>

"foo" won't be rendered by shr.

shr does try to deal with invalid tables, though.  If the <tbody>
elements hadn't been there, then the "foo" would have been, so I guess
some more work is required in that area.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no





Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Mon, 30 Sep 2019 05:29:01 GMT) Full text and rfc822 format available.

Message #27 received at 31665 <at> debbugs.gnu.org (full text, mbox):

From: Lars Ingebrigtsen <larsi <at> gnus.org>
To: 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Cc: Katsumi Yamaoka <yamaoka <at> jpl.org>, 31665 <at> debbugs.gnu.org
Subject: Re: bug#31665: libxml-parse-html-region' doesn't extract text in
 tables
Date: Mon, 30 Sep 2019 07:28:19 +0200
Lars Ingebrigtsen <larsi <at> gnus.org> writes:

> shr does try to deal with invalid tables, though.  If the <tbody>
> elements hadn't been there, then the "foo" would have been, so I guess
> some more work is required in that area.

I've now fixed this on the trunk.

-- 
(domestic pets only, the antidote for overdose, milk.)
   bloggy blog: http://lars.ingebrigtsen.no




Added tag(s) fixed. Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 30 Sep 2019 05:29:03 GMT) Full text and rfc822 format available.

bug marked as fixed in version 27.1, send any further explanations to 31665 <at> debbugs.gnu.org and 積丹尼 Dan Jacobson <jidanni <at> jidanni.org> Request was from Lars Ingebrigtsen <larsi <at> gnus.org> to control <at> debbugs.gnu.org. (Mon, 30 Sep 2019 05:29:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-gnu-emacs <at> gnu.org:
bug#31665; Package emacs. (Tue, 01 Oct 2019 02:44:01 GMT) Full text and rfc822 format available.

Message #34 received at 31665 <at> debbugs.gnu.org (full text, mbox):

From: Katsumi Yamaoka <yamaoka <at> jpl.org>
To: Lars Ingebrigtsen <larsi <at> gnus.org>
Cc: 31665 <at> debbugs.gnu.org,
 積丹尼 Dan Jacobson <jidanni <at> jidanni.org>
Subject: Re: bug#31665: libxml-parse-html-region' doesn't extract text in
 tables
Date: Tue, 01 Oct 2019 11:43:09 +0900
On Mon, 30 Sep 2019 07:28:19 +0200, Lars Ingebrigtsen wrote:
> I've now fixed this on the trunk.

Verified.  Thank you for improving it!




bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Tue, 29 Oct 2019 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 4 years and 179 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.