GNU bug report logs - #78439
Accent insensitive grep

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Reported by: "Avid Seeker" <avidseeker@HIDDEN>; dated Thu, 15 May 2025 07:47:02 UTC; Maintainer for grep is bug-grep@HIDDEN.

Message received at 78439 <at> debbugs.gnu.org:


Received: (at 78439) by debbugs.gnu.org; 15 May 2025 16:19:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu May 15 12:19:53 2025
Received: from localhost ([127.0.0.1]:55376 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1uFbJI-0004wv-6S
	for submit <at> debbugs.gnu.org; Thu, 15 May 2025 12:19:53 -0400
Received: from mail.cs.ucla.edu ([131.179.128.66]:53128)
 by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.84_2) (envelope-from <eggert@HIDDEN>)
 id 1uFbIo-0004vh-1m
 for 78439 <at> debbugs.gnu.org; Thu, 15 May 2025 12:19:27 -0400
Received: from localhost (localhost [127.0.0.1])
 by mail.cs.ucla.edu (Postfix) with ESMTP id 4FECB3C0140A0;
 Thu, 15 May 2025 09:19:15 -0700 (PDT)
Received: from mail.cs.ucla.edu ([127.0.0.1])
 by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavis, port 10032) with ESMTP
 id NthAMA7Rzisg; Thu, 15 May 2025 09:19:15 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by mail.cs.ucla.edu (Postfix) with ESMTP id 287D33C0149C6;
 Thu, 15 May 2025 09:19:15 -0700 (PDT)
DKIM-Filter: OpenDKIM Filter v2.10.3 mail.cs.ucla.edu 287D33C0149C6
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cs.ucla.edu;
 s=9D0B346E-2AEB-11ED-9476-E14B719DCE6C; t=1747325955;
 bh=8soblP3JaemDQZKum8be67J0ZRPSsL8Qx+A/TMkgNeA=;
 h=Message-ID:Date:MIME-Version:To:From;
 b=K25149QgUVi4oL2IA2soxE04OokaRDyI0eE0QpJsgiZnWLivjseFqpV3Jt5AzQxCu
 U8YoMsZ01PYeVNVLDLCVKyRRJOa5PJlJkf99oKeCHEaHWQDgye55ZeDaT+IhFMmERQ
 xwuUC4lDyl2Kaa92QR8FnuTJ3M6V/mucrDWVvPpsldeNX+wwv0EXKZvuvJLDNQ8CP/
 yXWCkkhGyS6ZjoLisYmMGHwJc0jqLS2rQJLWcFfaXdFKOHOQIZ2Sssd1Rwg4VVM0lL
 AHM684clcPe2FBdjwSbUYmRRPtBP+I8vlIB7F+wv4coLpYAj67E9Cvl8/G8J+538HB
 ZVDxedk57qA7A==
X-Virus-Scanned: amavis at mail.cs.ucla.edu
Received: from mail.cs.ucla.edu ([127.0.0.1])
 by localhost (mail.cs.ucla.edu [127.0.0.1]) (amavis, port 10026) with ESMTP
 id 9WD6NINoEkt5; Thu, 15 May 2025 09:19:15 -0700 (PDT)
Received: from [192.168.254.12]
 (47-147-225-25.fdr01.snmn.ca.ip.frontiernet.net [47.147.225.25])
 by mail.cs.ucla.edu (Postfix) with ESMTPSA id 0F7B03C0140A0;
 Thu, 15 May 2025 09:19:15 -0700 (PDT)
Message-ID: <36aaec4c-6a7e-4a7c-b9bf-e0ddf2efaa67@HIDDEN>
Date: Thu, 15 May 2025 09:19:14 -0700
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: bug#78439: Accent insensitive grep
To: Avid Seeker <avidseeker@HIDDEN>
References: <D9WHYA9BBOX7.394N0TBSJEIHJ@HIDDEN>
Content-Language: en-US
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
In-Reply-To: <D9WHYA9BBOX7.394N0TBSJEIHJ@HIDDEN>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Spam-Score: 0.0 (/)
X-Debbugs-Envelope-To: 78439
Cc: 78439 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -1.0 (-)

On 2025-05-14 22:49, Avid Seeker via Bug reports for GNU grep wrote:

> are equivalence classes the
> right tool to approach this?

They're supposed to be, yes ...

> I see that they depend on LC_COLLATE, in
> which case it would be possible to setup a custom locale that matches
> digraphs.

... though you're venturing into uncharted territory here. Please let us 
know of any monsters you find.

> Is there a way to setup a locale without having to recompile glibc

Yes, use localedef.





Information forwarded to bug-grep@HIDDEN:
bug#78439; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 15 May 2025 07:46:18 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu May 15 03:46:18 2025
Received: from localhost ([127.0.0.1]:50611 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1uFTIH-0003HZ-Cw
	for submit <at> debbugs.gnu.org; Thu, 15 May 2025 03:46:18 -0400
Received: from lists.gnu.org ([2001:470:142::17]:33014)
 by debbugs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.84_2) (envelope-from <avidseeker@HIDDEN>)
 id 1uFRT0-0002jA-Ns
 for submit <at> debbugs.gnu.org; Thu, 15 May 2025 01:49:15 -0400
Received: from eggs.gnu.org ([2001:470:142:3::10])
 by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256)
 (Exim 4.90_1) (envelope-from <avidseeker@HIDDEN>)
 id 1uFRSu-00034x-R3
 for bug-grep@HIDDEN; Thu, 15 May 2025 01:49:08 -0400
Received: from layka.disroot.org ([178.21.23.139])
 by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128)
 (Exim 4.90_1) (envelope-from <avidseeker@HIDDEN>)
 id 1uFRSs-0000OJ-OG
 for bug-grep@HIDDEN; Thu, 15 May 2025 01:49:08 -0400
Received: from mail01.disroot.lan (localhost [127.0.0.1])
 by disroot.org (Postfix) with ESMTP id F12A1252B1
 for <bug-grep@HIDDEN>; Thu, 15 May 2025 07:49:02 +0200 (CEST)
X-Virus-Scanned: SPAM Filter at disroot.org
Received: from layka.disroot.org ([127.0.0.1])
 by localhost (disroot.org [127.0.0.1]) (amavis, port 10024) with ESMTP
 id L0h5YT0v0zFN for <bug-grep@HIDDEN>;
 Thu, 15 May 2025 07:49:02 +0200 (CEST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=disroot.org; s=mail;
 t=1747288142; bh=Xu5wWXBoen1LAu/+jswb7ZnfGuJhK88VGbgkOzROvSU=;
 h=Date:From:Subject:To;
 b=DiGgZXa2lvksGTqDEJDn7G+tMZyxuBiv7pTGu5ljyOyRF6Kcpd1wHcSHAF6g/QVsA
 NWKO6nxVlTJnIv9Cjj2Sn09cvaqiTwV8TrbLZjz87voGjFdLRXipgSCxhuB5ZcmFnK
 65ixavmohKImfEr/WDMXdcU6TmwqO0GjcriNfWvblBY4cyvq2uGclK/mC7se2JdDo1
 rEj30O7YFCY0Mn9oy/hT7CLbSRVJpXHK2NIgfIQ0I5/XuxnYAs0H/+suPy4gLPqTew
 CbnpIkhxgpXRDLmjtkgZNZfGGLS6sDonjjejTHctcmNwNvkpaMvOkcu81UppnFiCz3
 SfTrDGhgmr9cg==
Content-Type: text/plain; charset=UTF-8; format=Flowed
Date: Thu, 15 May 2025 05:49:00 +0000
Message-Id: <D9WHYA9BBOX7.394N0TBSJEIHJ@HIDDEN>
From: "Avid Seeker" <avidseeker@HIDDEN>
Subject: Accent insensitive grep
To: <bug-grep@HIDDEN>
Mime-Version: 1.0
Content-Transfer-Encoding: quoted-printable
Received-SPF: pass client-ip=178.21.23.139;
 envelope-from=avidseeker@HIDDEN; helo=layka.disroot.org
X-Spam_score_int: -20
X-Spam_score: -2.1
X-Spam_bar: --
X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1,
 DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1,
 RCVD_IN_VALIDITY_CERTIFIED_BLOCKED=0.001, RCVD_IN_VALIDITY_RPBL_BLOCKED=0.001,
 SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no
X-Spam_action: no action
X-Spam-Score: 0.9 (/)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Thu, 15 May 2025 03:46:11 -0400
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.1 (/)

Re-iterating the question on SO <https://stackoverflow.com/questions/209378=
64/> of applying an
accent-insensitive grep to text. (e.g: all accents of a letter 'e' should b=
e regarded as an ascii 'e').

The response by Adam Katz mentions:
> You should not expect equivalence classes to be portable as they are too =
arcane.

What's the stance of grep developers on this? are equivalence classes the
right tool to approach this? I see that they depend on LC_COLLATE, in
which case it would be possible to setup a custom locale that matches
digraphs.

In the example he gave, he also mentions:
> This matches all words like aei... [but won't match] =C3=A6i... it's quit=
e
> likely that digraphs are beyond the reach of even the best equivalence
> class map.

Is there a way to setup a locale without having to recompile glibc or
are these locale values hardcoded into programs using glibc?

Thanks,
Avid




Acknowledgement sent to "Avid Seeker" <avidseeker@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#78439; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Thu, 15 May 2025 16:30:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.