Paul Eggert <eggert@HIDDEN>
to control <at> debbugs.gnu.org.
Full text available.Received: (at 55331) by debbugs.gnu.org; 9 May 2022 18:50:01 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Mon May 09 14:50:01 2022 Received: from localhost ([127.0.0.1]:59446 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1no8SH-0002SF-1w for submit <at> debbugs.gnu.org; Mon, 09 May 2022 14:50:01 -0400 Received: from out5-smtp.messagingengine.com ([66.111.4.29]:53653) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <benson_muite@HIDDEN>) id 1no8Mt-0002FO-0l for 55331 <at> debbugs.gnu.org; Mon, 09 May 2022 14:44:27 -0400 Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id 7C50E5C01CA; Mon, 9 May 2022 14:44:21 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute1.internal (MEProxy); Mon, 09 May 2022 14:44:21 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=emailplus.org; h=cc:cc:content-transfer-encoding:content-type:date:date:from :from:in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm1; t=1652121861; x= 1652208261; bh=6XjTX5eXv33RjH5AkypQh4kfaiXo2P4TXZq0EqLUp/A=; b=j yxcvX0X9FcDjfqoBwow/jI8FH2jwj7fe6W+CU4F0X7tQf1S0+SGqdCALujEd4UZV ccKNvWsqCJYvOUEQIUezpsX1IuZyItpQVsjdavjmmtPTAIveocQefBgcQlbLis/U RIbX97354JXJTpvWeQaLXg6pTmE8UjVkCrs9ZY0t9g4x8rVD8WInYKfuXuBX0kmp ip4PSfwT4qgO1ovTyGj8KHhTquMWwc9dgo6Ke0eSFEH7HsqT+qIM8yQIQ4yWhNGm 7lSQuHy/iKxUpZAU3IfG7sClK0ylanpeZ+7KxRN4rgKCyXcf6BeSCz5epSp70Wr5 JTup67vgZDAzhtswTJRCQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:date:date:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; t=1652121861; x=1652208261; bh=6XjTX5eXv33Rj H5AkypQh4kfaiXo2P4TXZq0EqLUp/A=; b=VeDK/sY9emwp52I8YOUCG1L1OXgX1 reUwOgWQ06dR5zva2PwGSLXygXCr5jfS/lrhgsdmmQcsPL5VNpJvijiy8b1Ekr/j ew2G8YT2NJm8yPBbBQoWqwcOWY88SVq7lwxwlObZ0tS2ONp6EE/dkdv0WRA4BQaM /Ji5spBGsNzqqg9pk2f120GoW+u0Rj2GicLmbWRjyWc9yimT/0POjc6+WmsF4ABH BQ5H6iH7zODRiUD0oqjd6vKtyQh976VSN75I45v0vI8+8t4BCIc8sx+qR5VKqZGN 0ex/G9cDfT3ErtpstLxY50WkOICULttxTTSnRuXJQPjoERtMRb5K2q/hg== X-ME-Sender: <xms:BWF5YpjFtUvl8MvEYYjOYipJLeNN2vXakI6aJmL9SywG1gyba_aluA> <xme:BWF5YuDNAcyRH99OajBSukmsGuB08Q2nRdup-dUnGSFC51Z46hgT7ds32S1v_OsO_ sQOawaroBpVcV-c> X-ME-Received: <xmr:BWF5YpFAnH1nsmOJlQH3a_VRDQeZpOxuNC0LaFuDxvPPVK_267PZMFxr72A79nc> X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeelgdduvdekucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepkfffgggfuffvvehfhfgjtgfgsehtjeertddtfeejnecuhfhrohhmpeeuvghn shhonhcuofhuihhtvgcuoegsvghnshhonhgpmhhuihhtvgesvghmrghilhhplhhushdroh hrgheqnecuggftrfgrthhtvghrnhepveetledtueellefhgeduvddtgfejgeduveeviedu veevleejleekgedugeeuuefhnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpe hmrghilhhfrhhomhepsggvnhhsohhnpghmuhhithgvsegvmhgrihhlphhluhhsrdhorhhg X-ME-Proxy: <xmx:BWF5YuQvGWaNQBoY5T7uo1afZUgn96kGxvPSvOaERLYppahpfv7sDw> <xmx:BWF5YmwANDQxNflQWIo69STxmUd-1oOw8IXy3ngrBuItWy4tkBlKWQ> <xmx:BWF5Yk7S-344CfFrlwc-sLCe4Q6SLi4q3s4TEo0zwXuWslEWNS_kMQ> <xmx:BWF5YsZ-gMlEz7jhwo0dI1xVz7KzZx3khtrExETiHLd-ro5eLtesBA> Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 9 May 2022 14:44:20 -0400 (EDT) Message-ID: <86421642-9579-a9bb-8ef0-61c9cfcbee8f@HIDDEN> Date: Mon, 9 May 2022 21:44:17 +0300 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: Re: bug#55331: Improved support for combining diacritics Content-Language: en-US To: Paul Eggert <eggert@HIDDEN> References: <55709462-5ea6-ff90-a0bc-5c919cb1af47@HIDDEN> <85688b8d-04ff-bcfa-814a-a8415d9df291@HIDDEN> From: Benson Muite <benson_muite@HIDDEN> In-Reply-To: <85688b8d-04ff-bcfa-814a-a8415d9df291@HIDDEN> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Score: -0.7 (/) X-Debbugs-Envelope-To: 55331 X-Mailman-Approved-At: Mon, 09 May 2022 14:50:00 -0400 Cc: 55331 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -1.7 (-) On 5/9/22 21:30, Paul Eggert wrote: > On 5/8/22 23:38, Benson Muite wrote: > > It might be nice for 'grep' to have ways to perform Unicode > normalization before matching. In the meantime perhaps you can get what > you want by normalizing the text before running it through 'grep'. Thanks for the advice. uconv should work.
bug-grep@HIDDEN:bug#55331; Package grep.
Full text available.
Received: (at 55331) by debbugs.gnu.org; 9 May 2022 18:30:38 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon May 09 14:30:38 2022
Received: from localhost ([127.0.0.1]:59422 helo=debbugs.gnu.org)
by debbugs.gnu.org with esmtp (Exim 4.84_2)
(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
id 1no89W-0000z0-00
for submit <at> debbugs.gnu.org; Mon, 09 May 2022 14:30:38 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:39560)
by debbugs.gnu.org with esmtp (Exim 4.84_2)
(envelope-from <eggert@HIDDEN>) id 1no89T-0000yM-Vk
for 55331 <at> debbugs.gnu.org; Mon, 09 May 2022 14:30:36 -0400
Received: from localhost (localhost [127.0.0.1])
by zimbra.cs.ucla.edu (Postfix) with ESMTP id B18511600D1;
Mon, 9 May 2022 11:30:29 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
with ESMTP id sC7awXmK3iUh; Mon, 9 May 2022 11:30:29 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
by zimbra.cs.ucla.edu (Postfix) with ESMTP id 10E371600D4;
Mon, 9 May 2022 11:30:29 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
with ESMTP id V71HQyVjOWhQ; Mon, 9 May 2022 11:30:28 -0700 (PDT)
Received: from [192.168.1.9] (cpe-172-91-119-151.socal.res.rr.com
[172.91.119.151])
by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id E039B1600D1;
Mon, 9 May 2022 11:30:28 -0700 (PDT)
Message-ID: <85688b8d-04ff-bcfa-814a-a8415d9df291@HIDDEN>
Date: Mon, 9 May 2022 11:30:28 -0700
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.8.1
Subject: Re: bug#55331: Improved support for combining diacritics
Content-Language: en-US
To: Benson Muite <benson_muite@HIDDEN>
References: <55709462-5ea6-ff90-a0bc-5c919cb1af47@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
In-Reply-To: <55709462-5ea6-ff90-a0bc-5c919cb1af47@HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 55331
Cc: 55331 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -3.3 (---)
On 5/8/22 23:38, Benson Muite wrote:
> When using
>=20
> grep -E "\s[a-z\`\'a=CC=84a=CC=81a=CC=80e=CC=84e=CC=81e=CC=80i=CC=84i=CC=
=81i=CC=80i=CC=A3i=CC=A3=CC=84i=CC=A3=CC=81i=CC=A3=CC=80o=CC=84o=CC=81o=CC=
=80=E1=BB=8D=E1=BB=8D=CC=84=E1=BB=8D=E1=BB=8D=CC=81=E1=BB=8D=CC=80u=CC=84=
u=CC=81u=CC=80u=CC=A3=CC=84=E1=BB=A5=CC=81=E1=BB=A5=CC=80n=CC=84n=CC=81n=CC=
=80m=CC=84m=CC=81m=CC=80]{4}$"
>=20
> to extract 4 letter Igbo words
The {4} means "4 characters", not "4 letters", and a combining character=20
counts as a character.
It might be nice for 'grep' to have ways to perform Unicode=20
normalization before matching. In the meantime perhaps you can get what=20
you want by normalizing the text before running it through 'grep'.
bug-grep@HIDDEN:bug#55331; Package grep.
Full text available.
Received: (at submit) by debbugs.gnu.org; 9 May 2022 07:03:39 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon May 09 03:03:39 2022
Received: from localhost ([127.0.0.1]:55821 helo=debbugs.gnu.org)
by debbugs.gnu.org with esmtp (Exim 4.84_2)
(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
id 1nnxQh-0004cH-0T
for submit <at> debbugs.gnu.org; Mon, 09 May 2022 03:03:39 -0400
Received: from lists.gnu.org ([209.51.188.17]:37352)
by debbugs.gnu.org with esmtp (Exim 4.84_2)
(envelope-from <benson_muite@HIDDEN>) id 1nnx4o-0001qh-GK
for submit <at> debbugs.gnu.org; Mon, 09 May 2022 02:41:02 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10]:52216)
by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
(Exim 4.90_1) (envelope-from <benson_muite@HIDDEN>)
id 1nnx4j-0004OV-DR
for bug-grep@HIDDEN; Mon, 09 May 2022 02:41:00 -0400
Received: from wout3-smtp.messagingengine.com ([64.147.123.19]:58163)
by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
(Exim 4.90_1) (envelope-from <benson_muite@HIDDEN>)
id 1nnx4h-0001bU-K2
for bug-grep@HIDDEN; Mon, 09 May 2022 02:40:57 -0400
Received: from compute3.internal (compute3.nyi.internal [10.202.2.43])
by mailout.west.internal (Postfix) with ESMTP id BC7B0320098A
for <bug-grep@HIDDEN>; Mon, 9 May 2022 02:40:50 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162])
by compute3.internal (MEProxy); Mon, 09 May 2022 02:40:50 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=emailplus.org;
h=cc:content-transfer-encoding:content-type:date:date:from:from
:in-reply-to:message-id:mime-version:reply-to:sender:subject
:subject:to:to; s=fm1; t=1652078450; x=1652164850; bh=rdRoNk/s8j
lcreROUMZZpZPSUiYA59biJNQsbVLXhyo=; b=mVLcOIkVCWEiM8+6tGU2219dr1
7iLNBdu7VHFSRC7IHFI4LHnz/EFHK6cm7R90DWPter9+rt4IbZvubaZzDHqUS0ak
In4dhzhXGDzPIsPLSjM/qCO3aTnbl4Yy1lxob3516MQ/Skjg2Bhv4UbtkWWdpzL1
uNR43Y4xbVZ5vvuCvxrc5kC4mzN6jwFdl+GiozEiq6LAlKZMGkk9VEKkujh7knd+
+gNUhtvmoeRolRODB72+tEcKWFwt+PtgL5Xfa0y5FWR8MopdKWTCTjei+/bf2fUT
SZgn1a+CuPBdrWGIPi/jed1D1GA4AiqFvDIiqUnwOwzjBhvJEj7+Op840uSQ==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
messagingengine.com; h=cc:content-transfer-encoding:content-type
:date:date:from:from:in-reply-to:message-id:mime-version
:reply-to:sender:subject:subject:to:to:x-me-proxy:x-me-proxy
:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t=1652078450; x=
1652164850; bh=rdRoNk/s8jlcreROUMZZpZPSUiYA59biJNQsbVLXhyo=; b=G
IcvJW0IeLrT0UWYf3DxWV2piNMwIqsOEKSZLcE0GJ2BWfvJd+UnDPslMlRDOACy1
SJsfoQ0gH5RF+mIHZXwNCRK1HObZUB9RlZfsVTmugHZDsWnUCW1ZxSQdkN6SXhfY
ByxRiaW56vIQbnw6rZY0wcAIoRGFOlAcxDswrDf8rflgArMJpMIjDSf/affn/0T+
uTtoI1MV0xbI1dqq4CdNqBaXCxmDG3j3Vpx9Yp9ZCVclc1eiNTasrOiATjsYf9M5
ET03RHOknr5/fTULfFp2ndtdgBLfVVPQBacBk1fAQQZQRLdVCKO9YRXwA/rfvWWU
iRlWu+yqLtgqqu+4P49Mg==
X-ME-Sender: <xms:crd4YkXk5rhhnO2JHmpK1bMW6ADg0C0McoYXYAlyYyxHwf6VFWV50w>
<xme:crd4YolL97ilRyRju96CIQWpzajnLiYsrPLlhe9PBMQ5wX_-cdCok8hBgW7tO-doR
SKEAr5SkcivVkDd>
X-ME-Received: <xmr:crd4YoaX1FZ3aTBjHUsG0C7T-9xK96Mxcac3cBwPcnyMSJeMc1ztdGSf2jDDm2jSRac>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvfedrfeekgdduudduucetufdoteggodetrfdotf
fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen
uceurghilhhouhhtmecufedttdenucenucfjughrpefkffggfgfhuffvtgfgsehtkeertd
dtfeejnecuhfhrohhmpeeuvghnshhonhcuofhuihhtvgcuoegsvghnshhonhgpmhhuihht
vgesvghmrghilhhplhhushdrohhrgheqnecuggftrfgrthhtvghrnhepgefhfeehleejie
elkeefleeghfehfeelhfdthefhieefvefftdegudehfffhhfehnecuvehluhhsthgvrhfu
ihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepsggvnhhsohhnpghmuhhithgvse
gvmhgrihhlphhluhhsrdhorhhg
X-ME-Proxy: <xmx:crd4YjVUEiGPA6OZMtQJqdyIaA8X_9CCjkUOiMGlrxmNzaOG8hf3lg>
<xmx:crd4Yuk8lV8n8I4bmoFG99XNqFLPe5K2zQydR6UCNty6sRzML0fTkw>
<xmx:crd4Yofh9nwBK6S1BF-QtxVY7-_dgq8a7LNwE5En4x5UvCXx9ZLEaA>
<xmx:crd4YoQGXOHKy68ffazoFm8tbXH-AsGGZxuW20sYl9cz26d4SQG57g>
Received: by mail.messagingengine.com (Postfix) with ESMTPA for
<bug-grep@HIDDEN>; Mon, 9 May 2022 02:40:41 -0400 (EDT)
Message-ID: <55709462-5ea6-ff90-a0bc-5c919cb1af47@HIDDEN>
Date: Mon, 9 May 2022 09:38:26 +0300
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
Thunderbird/91.2.0
Content-Language: en-US
From: Benson Muite <benson_muite@HIDDEN>
Subject: Improved support for combining diacritics
To: bug-grep@HIDDEN
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Received-SPF: pass client-ip=64.147.123.19;
envelope-from=benson_muite@HIDDEN; helo=wout3-smtp.messagingengine.com
X-Spam_score_int: -27
X-Spam_score: -2.8
X-Spam_bar: --
X-Spam_report: (-2.8 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_FROM=0.001,
RCVD_IN_DNSWL_LOW=-0.7, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: -1.7 (-)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Mon, 09 May 2022 03:03:37 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>,
<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>,
<mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.7 (--)
Hi,
Unicode allows for combining diacritics. When using
grep -E "\s[a-z\`\'āáàēéèīíìịị̄ị́ị̀ōóòọọ̄ọọ́ọ̀ūúùụ̄ụ́ụ̀n̄ńǹm̄ḿm̀]{4}$"
to extract 4 letter Igbo words from a text, akụ̀ is incorrectly
classified as a 4 letter word, when it is a three letter word. Would a
patch to fix this be accepted?
Regards,
Benson Muite
Benson Muite <benson_muite@HIDDEN>:bug-grep@HIDDEN.
Full text available.bug-grep@HIDDEN:bug#55331; Package grep.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.