X-Loop: help-debbugs@HIDDEN Subject: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed Resent-From: Norihiro Tanaka <noritnk@HIDDEN> Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> Resent-CC: bug-sed@HIDDEN Resent-Date: Fri, 05 Aug 2016 13:52:02 +0000 Resent-Message-ID: <handler.24160.B.14704051016730 <at> debbugs.gnu.org> Resent-Sender: help-debbugs@HIDDEN X-GNU-PR-Message: report 24160 X-GNU-PR-Package: sed X-GNU-PR-Keywords: patch To: 24160 <at> debbugs.gnu.org X-Debbugs-Original-To: <bug-sed@HIDDEN> Received: via spool by submit <at> debbugs.gnu.org id=B.14704051016730 (code B ref -1); Fri, 05 Aug 2016 13:52:02 +0000 Received: (at submit) by debbugs.gnu.org; 5 Aug 2016 13:51:41 +0000 Received: from localhost ([127.0.0.1]:56243 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1bVfXJ-0001kU-I6 for submit <at> debbugs.gnu.org; Fri, 05 Aug 2016 09:51:41 -0400 Received: from eggs.gnu.org ([208.118.235.92]:55877) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <noritnk@HIDDEN>) id 1bVfXI-0001kH-CD for submit <at> debbugs.gnu.org; Fri, 05 Aug 2016 09:51:40 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <noritnk@HIDDEN>) id 1bVfXC-0005c3-7q for submit <at> debbugs.gnu.org; Fri, 05 Aug 2016 09:51:35 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:41193) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <noritnk@HIDDEN>) id 1bVfXC-0005bi-4P for submit <at> debbugs.gnu.org; Fri, 05 Aug 2016 09:51:34 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:46844) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <noritnk@HIDDEN>) id 1bVfX9-0001FN-Hi for bug-sed@HIDDEN; Fri, 05 Aug 2016 09:51:32 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <noritnk@HIDDEN>) id 1bVfX5-0005Zc-T2 for bug-sed@HIDDEN; Fri, 05 Aug 2016 09:51:31 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:44459) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <noritnk@HIDDEN>) id 1bVfX5-0005X2-Cl for bug-sed@HIDDEN; Fri, 05 Aug 2016 09:51:27 -0400 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id CB45A4A0830 for <bug-sed@HIDDEN>; Fri, 5 Aug 2016 22:51:15 +0900 (JST) X-matriXscan-loop-detect: 330c21a9014acb14325ca5e417a99c9413b42fa3 Received: from mail02.kcn.ne.jp ([61.86.6.181]) by mxs01-s with ESMTP; Fri, 05 Aug 2016 22:51:14 +0900 (JST) Received: from [10.120.1.35] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail02.kcn.ne.jp (Postfix) with ESMTPA id 30370F1001F for <bug-sed@HIDDEN>; Fri, 5 Aug 2016 22:51:14 +0900 (JST) Date: Fri, 05 Aug 2016 22:51:16 +0900 From: Norihiro Tanaka <noritnk@HIDDEN> Message-Id: <20160805225116.64FE.27F6AC2D@HIDDEN> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_57A497D20000000064F2_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -4.0 (----) --------_57A497D20000000064F2_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit Hi, We can speeds up sed by caching result of result mbrtowc() for single byte characters. It is effective especially in non-UTF8 multibyte locales which is expensive calculatation. $ yes $(printf %040d 0) | head -1000000 >k Before: $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k real 1.93 user 1.61 sys 0.27 After patching $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k real 0.46 user 0.42 sys 0.03 Thanks, Norihiro --------_57A497D20000000064F2_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-sed-cache-results-of-mbrtowc-for-speed.patch" Content-Disposition: attachment; filename="0001-sed-cache-results-of-mbrtowc-for-speed.patch" Content-Transfer-Encoding: base64 RnJvbSBkYzI3NzM5NDQxNTRiMzA1Yzg5M2I3NDU5ODI5YmRlMjFjNWE2MTgyIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBGcmksIDUgQXVnIDIwMTYgMDg6Mjg6MjAgKzA5MDAKU3ViamVjdDogW1BBVENIIDEvMl0g c2VkOiBjYWNoZSByZXN1bHRzIG9mIG1icnRvd2MgZm9yIHNwZWVkCgoqIHNlZC9tYmNzLmMgKG1i cnRvd2NfY2FjaGUsIG1icmxlbl9jYWNoZSk6IE5ldyB2YXJzLgooaW5pdGlhbGl6ZV9tYmNzKTog SW5pdGlhbGl6ZSB0aGUgY2FjaGUuCiogc2VkL3NlZC5oOiBJbmNsdWRlIGxpbWl0cy5oCihNQlJU T1dDLCBNQlJMRU4pOiBVc2UgdGhlIGNhY2hlLgotLS0KIHNlZC9tYmNzLmMgfCAgIDE0ICsrKysr KysrKysrKysrCiBzZWQvc2VkLmggIHwgICAxMSArKysrKysrKy0tLQogMiBmaWxlcyBjaGFuZ2Vk LCAyMiBpbnNlcnRpb25zKCspLCAzIGRlbGV0aW9ucygtKQoKZGlmZiAtLWdpdCBhL3NlZC9tYmNz LmMgYi9zZWQvbWJjcy5jCmluZGV4IGJjZTM5ZmEuLjgxMDVlY2QgMTAwNjQ0Ci0tLSBhL3NlZC9t YmNzLmMKKysrIGIvc2VkL21iY3MuYwpAQCAtMjQsNiArMjQsOSBAQAogaW50IG1iX2N1cl9tYXg7 CiBib29sIGlzX3V0Zjg7CiAKK3NpemVfdCBtYnJsZW5fY2FjaGVbVUNIQVJfTUFYICsgMV07Cit3 aW50X3QgbWJydG93Y19jYWNoZVtVQ0hBUl9NQVggKyAxXTsKKwogLyogUmV0dXJuIG5vbi16ZXJv IGlmIENIIGlzIHBhcnQgb2YgYSB2YWxpZCBtdWx0aWJ5dGUgc2VxdWVuY2U6CiAgICBFaXRoZXIg aW5jb21wbGV0ZSB5ZXQgdmFsaWQgc2VxdWVuY2UgKGluIGNhc2Ugb2YgYSBsZWFkaW5nIGJ5dGUp LAogICAgb3IgdGhlIGxhc3QgYnl0ZSBvZiBhIHZhbGlkIG11bHRpYnl0ZSBzZXF1ZW5jZS4KQEAg LTczLDQgKzc2LDE1IEBAIGluaXRpYWxpemVfbWJjcyAodm9pZCkKICAgaXNfdXRmOCA9IChzdHJj bXAgKGNvZGVzZXRfbmFtZSwgIlVURi04IikgPT0gMCk7CiAKICAgbWJfY3VyX21heCA9IE1CX0NV Ul9NQVg7CisKKyAgZm9yIChpbnQgaSA9IENIQVJfTUlOOyBpIDw9IENIQVJfTUFYOyArK2kpCisg ICAgeworICAgICAgY2hhciBjID0gaTsKKyAgICAgIHVuc2lnbmVkIGNoYXIgdWMgPSBpOworICAg ICAgbWJzdGF0ZV90IG1icyA9IHsgMCB9OworICAgICAgd2NoYXJfdCB3YzsKKyAgICAgIHNpemVf dCBsZW4gPSBtYnJ0b3djICgmd2MsICZjLCAxLCAmbWJzKTsKKyAgICAgIG1icmxlbl9jYWNoZVt1 Y10gPSBsZW4gPyBsZW4gOiAxOworICAgICAgbWJydG93Y19jYWNoZVt1Y10gPSBsZW4gPT0gMSA/ IHdjIDogV0VPRjsKKyAgICB9CiB9CmRpZmYgLS1naXQgYS9zZWQvc2VkLmggYi9zZWQvc2VkLmgK aW5kZXggYmJkZGQyNS4uMzcxNmJjYiAxMDA2NDQKLS0tIGEvc2VkL3NlZC5oCisrKyBiL3NlZC9z ZWQuaApAQCAtMTksNiArMTksNyBAQAogI2luY2x1ZGUgImJhc2ljZGVmcy5oIgogI2luY2x1ZGUg InJlZ2V4LmgiCiAjaW5jbHVkZSA8c3RkaW8uaD4KKyNpbmNsdWRlIDxsaW1pdHMuaD4KICNpbmNs dWRlICJ1bmxvY2tlZC1pby5oIgogCiAjaW5jbHVkZSAidXRpbHMuaCIKQEAgLTIzOCw5ICsyMzks MTIgQEAgZXh0ZXJuIGJvb2wgdXNlX2V4dGVuZGVkX3N5bnRheF9wOwogZXh0ZXJuIGludCBtYl9j dXJfbWF4OwogZXh0ZXJuIGJvb2wgaXNfdXRmODsKIAorZXh0ZXJuIHNpemVfdCBtYnJsZW5fY2Fj aGVbVUNIQVJfTUFYICsgMV07CitleHRlcm4gd2ludF90IG1icnRvd2NfY2FjaGVbVUNIQVJfTUFY ICsgMV07CisKICNkZWZpbmUgTUJSVE9XQyhwd2MsIHMsIG4sIHBzKSBcCi0gIChtYl9jdXJfbWF4 ID09IDEgPyBcCi0gICAoKihwd2MpID0gYnRvd2MgKCoodW5zaWduZWQgY2hhciAqKSAocykpLCAx KSA6IFwKKyAgKG1icmxlbl9jYWNoZVsqKHVuc2lnbmVkIGNoYXIgKikgKHMpXSA9PSAxID8gXAor ICAgKCoocHdjKSA9IG1icnRvd2NfY2FjaGVbKih1bnNpZ25lZCBjaGFyICopIChzKV0sIDEpIDog XAogICAgbWJydG93YyAoKHB3YyksIChzKSwgKG4pLCAocHMpKSkKIAogI2RlZmluZSBXQ1JUT01C KHMsIHdjLCBwcykgXApAQCAtMjUyLDcgKzI1Niw4IEBAIGV4dGVybiBib29sIGlzX3V0Zjg7CiAg IChtYl9jdXJfbWF4ID09IDEgPyAxIDogbWJzaW5pdCAoKHMpKSkKIAogI2RlZmluZSBNQlJMRU4o cywgbiwgcHMpIFwKLSAgKG1iX2N1cl9tYXggPT0gMSA/IDEgOiBtYnJ0b3djIChOVUxMLCBzLCBu LCBwcykpCisgIChtYnJsZW5fY2FjaGVbKih1bnNpZ25lZCBjaGFyICopIChzKV0gPT0gMSA/IFwK KyAgIDEgOiBtYnJ0b3djIChOVUxMLCBzLCBuLCBwcykpCiAKICNkZWZpbmUgSVNfTUJfQ0hBUihj aCwgcHMpICAgICAgICAgICAgICAgIFwKICAgKG1iX2N1cl9tYXggPT0gMSA/IDAgOiBpc19tYl9j aGFyIChjaCwgcHMpKQotLSAKMS43LjEKCg== --------_57A497D20000000064F2_MULTIPART_MIXED_--
Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-Mailer: MIME-tools 5.505 (Entity 5.505) Content-Type: text/plain; charset=utf-8 X-Loop: help-debbugs@HIDDEN From: help-debbugs@HIDDEN (GNU bug Tracking System) To: Norihiro Tanaka <noritnk@HIDDEN> Subject: bug#24160: Acknowledgement ([PATCH 1/2] sed: cache results of mbrtowc for speed) Message-ID: <handler.24160.B.14704051016730.ack <at> debbugs.gnu.org> References: <20160805225116.64FE.27F6AC2D@HIDDEN> X-Gnu-PR-Message: ack 24160 X-Gnu-PR-Package: sed X-Gnu-PR-Keywords: patch Reply-To: 24160 <at> debbugs.gnu.org Date: Fri, 05 Aug 2016 13:52:02 +0000 Thank you for filing a new bug report with debbugs.gnu.org. This is an automatically generated reply to let you know your message has been received. Your message is being forwarded to the package maintainers and other interested parties for their attention; they will reply in due course. Your message has been sent to the package maintainer(s): bug-sed@HIDDEN If you wish to submit further information on this problem, please send it to 24160 <at> debbugs.gnu.org. Please do not send mail to help-debbugs@HIDDEN unless you wish to report a problem with the Bug-tracking system. --=20 24160: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=3D24160 GNU Bug Tracking System Contact help-debbugs@HIDDEN with problems
X-Loop: help-debbugs@HIDDEN Subject: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed Resent-From: Assaf Gordon <assafgordon@HIDDEN> Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> Resent-CC: bug-sed@HIDDEN Resent-Date: Fri, 05 Aug 2016 14:47:01 +0000 Resent-Message-ID: <handler.24160.B24160.147040837612103 <at> debbugs.gnu.org> Resent-Sender: help-debbugs@HIDDEN X-GNU-PR-Message: followup 24160 X-GNU-PR-Package: sed X-GNU-PR-Keywords: patch To: Norihiro Tanaka <noritnk@HIDDEN>, 24160 <at> debbugs.gnu.org Received: via spool by 24160-submit <at> debbugs.gnu.org id=B24160.147040837612103 (code B ref 24160); Fri, 05 Aug 2016 14:47:01 +0000 Received: (at 24160) by debbugs.gnu.org; 5 Aug 2016 14:46:16 +0000 Received: from localhost ([127.0.0.1]:56767 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1bVgO8-000398-KH for submit <at> debbugs.gnu.org; Fri, 05 Aug 2016 10:46:16 -0400 Received: from mail-qk0-f194.google.com ([209.85.220.194]:35288) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <assafgordon@HIDDEN>) id 1bVgO6-00038t-Pe for 24160 <at> debbugs.gnu.org; Fri, 05 Aug 2016 10:46:15 -0400 Received: by mail-qk0-f194.google.com with SMTP id q62so23633259qkf.2 for <24160 <at> debbugs.gnu.org>; Fri, 05 Aug 2016 07:46:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-transfer-encoding; bh=nATZfaaGKIs9QRNeHd8s0TLqYAQ5btF5Z7vp6WeRO3U=; b=pp6cpW13g1rWMSGWTSnolBaweH07P+oEef1SgwvNm5GWlsfCQeNyH9AFE+nOKaaSVZ /zCPpOCBVXrdT/GF+xTD6Ue/EcvtcH/9Xh1rYJNor4AyxGQR7oV1L+n4eQwg8zh/o3DJ z2DX3mqPM0wyuRT+0DiyoZ+QXYsqhYf1WVFvGF2BjI+TyN52+KJLyUcfqWtdjtjl05wk i9K6RUzZnZz3g3ZoTDVO+sEF6UBEq2/PSrlDQ2UqmqPS4Fapm1mhso5XujVLzxt7dwF7 4GKEJPSSYoRk45wuMHAKb9Epo8a8c7XTaoShn+H9PDQFPRPaxReGBEZVPsnMwW4QSGVr SeDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=nATZfaaGKIs9QRNeHd8s0TLqYAQ5btF5Z7vp6WeRO3U=; b=lEO0QKjeuCB1zv+viP991/eXae2sMxQ6g1BZHJ8C3MJwt5xLUCsMvjg9YywLsDjpVc 7LAiOH/yhQkwj5oDJVUtv/vDVH2JNwdvdAkFutat+V1oglR4tWlIy2hBW6DtJAGFD/Y0 g8gOvo6xcn3ww7YgdET2c7CNm2A/FrnJ25NCYRV5ntqfWBlwYT8oKVQClkpUdgW4bcX7 JHaKKqWuWEiXlFeIjnHuAzPdcYtrztIEeDkDaVMnpeq73t65zeUOBWyZ4fxKDAyeYYns 7N/5wT1mavEIfPVdLH2eJZgIi6gssLpwagrGn90vmyEDV+md0vMmRuuUYjPluo1eNF7m QzKw== X-Gm-Message-State: AEkoout8ysgeRZ8+MTrJTqFRfVm/zzaEZhMWPdwuRbu2ZyccmPAMt9OgUexWVaErUVSKjA== X-Received: by 10.55.73.145 with SMTP id w139mr13044261qka.114.1470408369316; Fri, 05 Aug 2016 07:46:09 -0700 (PDT) Received: from disco.erlich.nygenome.org ([69.74.14.178]) by smtp.googlemail.com with ESMTPSA id 7sm9778104qkd.25.2016.08.05.07.46.03 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 05 Aug 2016 07:46:06 -0700 (PDT) References: <20160805225116.64FE.27F6AC2D@HIDDEN> From: Assaf Gordon <assafgordon@HIDDEN> Message-ID: <5fb8c9bf-2233-f782-f9a1-9d55ca33f083@HIDDEN> Date: Fri, 5 Aug 2016 10:45:59 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.2.0 MIME-Version: 1.0 In-Reply-To: <20160805225116.64FE.27F6AC2D@HIDDEN> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.7 (/) X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -0.7 (/) Hello Norihiro, Thank you for the patch. On 08/05/2016 09:51 AM, Norihiro Tanaka wrote: > We can speeds up sed by caching result of result mbrtowc() for single > byte characters. It is effective especially in non-UTF8 multibyte > locales which is expensive calculatation. Regarding this: ==== #define MBRTOWC(pwc, s, n, ps) \ - (mb_cur_max == 1 ? \ - (*(pwc) = btowc (*(unsigned char *) (s)), 1) : \ + (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \ + (*(pwc) = mbrtowc_cache[*(unsigned char *) (s)], 1) : \ mbrtowc ((pwc), (s), (n), (ps))) #define MBRLEN(s, n, ps) \ - (mb_cur_max == 1 ? 1 : mbrtowc (NULL, s, n, ps)) + (mbrlen_cache[*(unsigned char *) (s)] == 1 ? \ + 1 : mbrtowc (NULL, s, n, ps)) ==== By using a cache table, isn't this code ignoring mbstate ? For example, in shift-jis encoding, the character '[' can either be standalone, or a second character in a sequence such as '\x83\x5b' ? Wouldn't it also prevent detection of invalid sequences ? As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input, so this change will not likely affect utf8 or C locales. regards, - assaf
X-Loop: help-debbugs@HIDDEN Subject: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed Resent-From: Norihiro Tanaka <noritnk@HIDDEN> Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> Resent-CC: bug-sed@HIDDEN Resent-Date: Sat, 06 Aug 2016 07:14:02 +0000 Resent-Message-ID: <handler.24160.B24160.14704676159693 <at> debbugs.gnu.org> Resent-Sender: help-debbugs@HIDDEN X-GNU-PR-Message: followup 24160 X-GNU-PR-Package: sed X-GNU-PR-Keywords: patch To: Assaf Gordon <assafgordon@HIDDEN> Cc: 24160 <at> debbugs.gnu.org Received: via spool by 24160-submit <at> debbugs.gnu.org id=B24160.14704676159693 (code B ref 24160); Sat, 06 Aug 2016 07:14:02 +0000 Received: (at 24160) by debbugs.gnu.org; 6 Aug 2016 07:13:35 +0000 Received: from localhost ([127.0.0.1]:57088 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1bVvnb-0002WH-1J for submit <at> debbugs.gnu.org; Sat, 06 Aug 2016 03:13:35 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:44664) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <noritnk@HIDDEN>) id 1bVvnZ-0002W3-IB for 24160 <at> debbugs.gnu.org; Sat, 06 Aug 2016 03:13:34 -0400 Received: from mxs01-s (mailgw1.kcn.ne.jp [61.86.15.233]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 6C01C4A083A for <24160 <at> debbugs.gnu.org>; Sat, 6 Aug 2016 16:13:26 +0900 (JST) X-matriXscan-loop-detect: 6194e6788300ac8807d6daf6022f9f9dd28bff71 Received: from mail09.kcn.ne.jp ([61.86.6.188]) by mxs01-s with ESMTP; Sat, 06 Aug 2016 16:13:25 +0900 (JST) Received: from [10.120.1.17] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail09.kcn.ne.jp (Postfix) with ESMTPA id 560F71BD0097; Sat, 6 Aug 2016 16:13:25 +0900 (JST) Date: Sat, 06 Aug 2016 16:13:27 +0900 From: Norihiro Tanaka <noritnk@HIDDEN> In-Reply-To: <5fb8c9bf-2233-f782-f9a1-9d55ca33f083@HIDDEN> References: <20160805225116.64FE.27F6AC2D@HIDDEN> <5fb8c9bf-2233-f782-f9a1-9d55ca33f083@HIDDEN> Message-Id: <20160806161326.E614.27F6AC2D@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -1.2 (-) X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.2 (-) On Fri, 5 Aug 2016 10:45:59 -0400 Assaf Gordon <assafgordon@HIDDEN> wrote: > Hello Norihiro, > > Thank you for the patch. > > By using a cache table, isn't this code ignoring mbstate ? > For example, in shift-jis encoding, the character '[' can either be standalone, > or a second character in a sequence such as '\x83\x5b' ? > Wouldn't it also prevent detection of invalid sequences ? > > As a side-note, gnu sed's current implementation has special code path for multibyte-non-utf8 input, > so this change will not likely affect utf8 or C locales. > > regards, > - assaf Hi Assaf, Thanks for review. When MBRTOWC() or MBRLEN() are called in shift-jis, mbstate is always initial state or the equivalent to a state with initial state except invalid sequence and incomplete sequence found, as shift-jis is state-less encoding. Even if their sequences were found, mbstate should be set to initial state manually to check following characters in the string. So I think that we can ignore mbstate in state-less encoding. However, the assumption is wrong for state-full encoding as ISO-2022 and UTF-7. Does sed support state-full encoding which has shift sequence? At least, It seems that regex does not support state-full encoding. Thanks, Norihiro
X-Loop: help-debbugs@HIDDEN Subject: bug#24160: [PATCH 1/2] sed: cache results of mbrtowc for speed Resent-From: Norihiro Tanaka <noritnk@HIDDEN> Original-Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> Resent-CC: bug-sed@HIDDEN Resent-Date: Mon, 19 Sep 2016 02:33:01 +0000 Resent-Message-ID: <handler.24160.B24160.14742523525613 <at> debbugs.gnu.org> Resent-Sender: help-debbugs@HIDDEN X-GNU-PR-Message: followup 24160 X-GNU-PR-Package: sed X-GNU-PR-Keywords: patch To: 24160 <at> debbugs.gnu.org Received: via spool by 24160-submit <at> debbugs.gnu.org id=B24160.14742523525613 (code B ref 24160); Mon, 19 Sep 2016 02:33:01 +0000 Received: (at 24160) by debbugs.gnu.org; 19 Sep 2016 02:32:32 +0000 Received: from localhost ([127.0.0.1]:34674 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1bloNj-0001ST-Sz for submit <at> debbugs.gnu.org; Sun, 18 Sep 2016 22:32:32 -0400 Received: from mailgw01.kcn.ne.jp ([61.86.7.208]:57686) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <noritnk@HIDDEN>) id 1bloNi-0001SD-3a for 24160 <at> debbugs.gnu.org; Sun, 18 Sep 2016 22:32:30 -0400 Received: from mxs02-s (mailgw2.kcn.ne.jp [61.86.15.234]) by mailgw01.kcn.ne.jp (Postfix) with ESMTP id 42BDF4A086A for <24160 <at> debbugs.gnu.org>; Mon, 19 Sep 2016 11:32:23 +0900 (JST) X-matriXscan-loop-detect: c9a5cf15e860450258d2e4a3759089e73446cb61 Received: from mail08.kcn.ne.jp ([61.86.6.187]) by mxs02-s with ESMTP; Mon, 19 Sep 2016 11:32:20 +0900 (JST) Received: from [10.120.1.60] (i118-21-128-66.s30.a048.ap.plala.or.jp [118.21.128.66]) by mail08.kcn.ne.jp (Postfix) with ESMTPA id 3CCE412B802E for <24160 <at> debbugs.gnu.org>; Mon, 19 Sep 2016 11:32:20 +0900 (JST) Date: Mon, 19 Sep 2016 11:32:20 +0900 From: Norihiro Tanaka <noritnk@HIDDEN> In-Reply-To: <20160805225116.64FE.27F6AC2D@HIDDEN> References: <20160805225116.64FE.27F6AC2D@HIDDEN> Message-Id: <20160919113219.41D1.27F6AC2D@HIDDEN> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="------_57DF4CEB0000000041C6_MULTIPART_MIXED_" Content-Transfer-Encoding: 7bit X-Mailer: Becky! ver. 2.65.07 [ja] X-matriXscan-Sophos-AV: Clean X-matriXscan-Action: Approve X-matriXscan: Uncategorized X-Spam-Score: -2.3 (--) X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -2.3 (--) --------_57DF4CEB0000000041C6_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit On Fri, 05 Aug 2016 22:51:16 +0900 Norihiro Tanaka <noritnk@HIDDEN> wrote: > Hi, > > We can speeds up sed by caching result of result mbrtowc() for single > byte characters. It is effective especially in non-UTF8 multibyte > locales which is expensive calculatation. > > $ yes $(printf %040d 0) | head -1000000 >k > > Before: > > $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k > real 1.93 > user 1.61 > sys 0.27 > > After patching > > $ time -p env LC_ALL=ja_JP.eucjp sed/sed -ne /a.b/p k > real 0.46 > user 0.42 > sys 0.03 > > Thanks, > Norihiro I rewrote the patch as using localeinfo in gnulib. --------_57DF4CEB0000000041C6_MULTIPART_MIXED_ Content-Type: text/plain; charset="US-ASCII"; name="0001-sed-use-cache-provided-by-localeinfo-for-mbrtowc-and.patch" Content-Disposition: attachment; filename="0001-sed-use-cache-provided-by-localeinfo-for-mbrtowc-and.patch" Content-Transfer-Encoding: base64 RnJvbSBjMWE5ZDcwOTM2NzU2ODg3YzdjZGY1NWI1YjMyODI2ZGY3MmI5ZDUyIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBOb3JpaGlybyBUYW5ha2EgPG5vcml0bmtAa2NuLm5lLmpwPgpE YXRlOiBTdW4sIDE4IFNlcCAyMDE2IDE3OjQ2OjU3ICswOTAwClN1YmplY3Q6IFtQQVRDSF0gc2Vk OiB1c2UgY2FjaGUgcHJvdmlkZWQgYnkgbG9jYWxlaW5mbyBmb3IgbWJydG93YyBhbmQgbWJybGVu CgoqIHNlZC9zZWQuaCAoTUJSVE9XQywgTUJSTEVOKTogVXNlIGNhY2hlIHByb3ZpZGVkIGJ5IGxv Y2FsZWluZm8uCihNQlJUT1dDLCBNQlJMRU4pOiBVc2UgdGhlIGNhY2hlLgotLS0KIHNlZC9zZWQu aCB8ICAgIDcgKysrKy0tLQogMSBmaWxlcyBjaGFuZ2VkLCA0IGluc2VydGlvbnMoKyksIDMgZGVs ZXRpb25zKC0pCgpkaWZmIC0tZ2l0IGEvc2VkL3NlZC5oIGIvc2VkL3NlZC5oCmluZGV4IDA4M2Jh YWUuLjllYmM4MTUgMTAwNjQ0Ci0tLSBhL3NlZC9zZWQuaAorKysgYi9zZWQvc2VkLmgKQEAgLTI0 Nyw4ICsyNDcsOCBAQCBleHRlcm4gYm9vbCBpc191dGY4OwogZXh0ZXJuIGJvb2wgc2FuZGJveDsK IAogI2RlZmluZSBNQlJUT1dDKHB3YywgcywgbiwgcHMpIFwKLSAgKG1iX2N1cl9tYXggPT0gMSA/ IFwKLSAgICgqKHB3YykgPSBidG93YyAoKih1bnNpZ25lZCBjaGFyICopIChzKSksIDEpIDogXAor ICAobG9jYWxlaW5mby5zYmNsZW5bKih1bnNpZ25lZCBjaGFyICopIChzKV0gPT0gMSA/IFwKKyAg ICgqKHB3YykgPSBsb2NhbGVpbmZvLnNiY3Rvd2NbKih1bnNpZ25lZCBjaGFyICopIChzKV0sIDEp IDogXAogICAgbWJydG93YyAoKHB3YyksIChzKSwgKG4pLCAocHMpKSkKIAogI2RlZmluZSBXQ1JU T01CKHMsIHdjLCBwcykgXApAQCAtMjYwLDcgKzI2MCw4IEBAIGV4dGVybiBib29sIHNhbmRib3g7 CiAgIChtYl9jdXJfbWF4ID09IDEgPyAxIDogbWJzaW5pdCAoKHMpKSkKIAogI2RlZmluZSBNQlJM RU4ocywgbiwgcHMpIFwKLSAgKG1iX2N1cl9tYXggPT0gMSA/IDEgOiBtYnJ0b3djIChOVUxMLCBz LCBuLCBwcykpCisgIChsb2NhbGVpbmZvLnNiY2xlblsqKHVuc2lnbmVkIGNoYXIgKikgKHMpXSA9 PSAxID8gXAorICAgMSA6IG1icnRvd2MgKE5VTEwsIHMsIG4sIHBzKSkKIAogI2RlZmluZSBJU19N Ql9DSEFSKGNoLCBwcykgICAgICAgICAgICAgICAgXAogICAobWJfY3VyX21heCA9PSAxID8gMCA6 IGlzX21iX2NoYXIgKGNoLCBwcykpCi0tIAoxLjcuMQoK --------_57DF4CEB0000000041C6_MULTIPART_MIXED_--
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.