GNU bug report logs - #16812
Eszett handling

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Severity: wishlist; Reported by: mathstuf@HIDDEN; dated Wed, 19 Feb 2014 19:04:01 UTC; Maintainer for grep is bug-grep@HIDDEN.
Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 16812 <at> debbugs.gnu.org:


Received: (at 16812) by debbugs.gnu.org; 8 Mar 2014 18:52:53 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Mar 08 13:52:53 2014
Received: from localhost ([127.0.0.1]:56910 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WMMMe-0006iz-Vf
	for submit <at> debbugs.gnu.org; Sat, 08 Mar 2014 13:52:53 -0500
Received: from smtp.cs.ucla.edu ([131.179.128.62]:40608)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eggert@HIDDEN>) id 1WMMMd-0006is-0e
 for 16812 <at> debbugs.gnu.org; Sat, 08 Mar 2014 13:52:51 -0500
Received: from localhost (localhost.localdomain [127.0.0.1])
 by smtp.cs.ucla.edu (Postfix) with ESMTP id 6D8D739E8013
 for <16812 <at> debbugs.gnu.org>; Sat,  8 Mar 2014 10:52:50 -0800 (PST)
X-Virus-Scanned: amavisd-new at smtp.cs.ucla.edu
Received: from smtp.cs.ucla.edu ([127.0.0.1])
 by localhost (smtp.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10024)
 with ESMTP id JmwC5-dD73+5 for <16812 <at> debbugs.gnu.org>;
 Sat,  8 Mar 2014 10:52:49 -0800 (PST)
Received: from [192.168.1.9] (pool-108-0-233-62.lsanca.fios.verizon.net
 [108.0.233.62])
 by smtp.cs.ucla.edu (Postfix) with ESMTPSA id BA6BC39E8008
 for <16812 <at> debbugs.gnu.org>; Sat,  8 Mar 2014 10:52:49 -0800 (PST)
Message-ID: <531B6701.5030802@HIDDEN>
Date: Sat, 08 Mar 2014 10:52:49 -0800
From: Paul Eggert <eggert@HIDDEN>
Organization: UCLA Computer Science Department
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: 16812 <at> debbugs.gnu.org
Subject: Re:  Eszett handling
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Spam-Score: -2.3 (--)
X-Debbugs-Envelope-To: 16812
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -2.3 (--)

'grep' is conforming to its specification, even though it's not as 
useful as it might be when searching German text.  The situation with 
'ß'/'SS' is different than the situation with 'lj'/'Lj'/'LJ' because in the 
latter case 'grep' is dealing only with individual characters.

There's a related issue with 'ß' versus the recently-introduced capital 
sharp-S 'ẞ'.  These do not match each other with 'grep --ignore-case' in 
the current savannah git master.  This is an unfortunate property of how 
the glibc regex code behaves: the regex code uppercases both pattern and 
data before comparing, but in the standard German locale 'ß' is 
unchanged by uppercasing.

I'll leave this bug open as it is an awkward situation.  Fixing it would 
require changing the glibc regex code, which is a big deal -- it would 
have some performance implications in a lot of programs.  So I'm not 
optimistic about fixing it any time soon.




Information forwarded to bug-grep@HIDDEN:
bug#16812; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 20 Feb 2014 16:54:50 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Feb 20 11:54:50 2014
Received: from localhost ([127.0.0.1]:33824 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WGWtd-0001QJ-Cg
	for submit <at> debbugs.gnu.org; Thu, 20 Feb 2014 11:54:50 -0500
Received: from eggs.gnu.org ([208.118.235.92]:41130)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <jsmeix@HIDDEN>) id 1WGQXw-0005gR-Du
 for submit <at> debbugs.gnu.org; Thu, 20 Feb 2014 05:08:00 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <jsmeix@HIDDEN>) id 1WGQXk-0007iq-Nt
 for submit <at> debbugs.gnu.org; Thu, 20 Feb 2014 05:07:55 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50 autolearn=disabled
 version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:55898)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <jsmeix@HIDDEN>) id 1WGQXk-0007im-Ll
 for submit <at> debbugs.gnu.org; Thu, 20 Feb 2014 05:07:48 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:42773)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <jsmeix@HIDDEN>) id 1WGQXe-0004am-N6
 for bug-grep@HIDDEN; Thu, 20 Feb 2014 05:07:48 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <jsmeix@HIDDEN>) id 1WGQXY-0007gN-Sn
 for bug-grep@HIDDEN; Thu, 20 Feb 2014 05:07:42 -0500
Received: from cantor2.suse.de ([195.135.220.15]:39138 helo=mx2.suse.de)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <jsmeix@HIDDEN>) id 1WGQXY-0007fF-KT
 for bug-grep@HIDDEN; Thu, 20 Feb 2014 05:07:36 -0500
Received: from relay1.suse.de (charybdis-ext.suse.de [195.135.220.254])
 by mx2.suse.de (Postfix) with ESMTP id 2C7B7AD11;
 Thu, 20 Feb 2014 10:07:34 +0000 (UTC)
Date: Thu, 20 Feb 2014 11:07:34 +0100 (CET)
From: Johannes Meixner <jsmeix@HIDDEN>
To: bug-grep@HIDDEN
Subject: Re: bug#16812: Eszett handling
In-Reply-To: <20140219185918.GA2438@HIDDEN>
Message-ID: <alpine.LNX.2.00.1402201051240.8941@HIDDEN>
References: <20140219185918.GA2438@HIDDEN>
User-Agent: Alpine 2.00 (LNX 1167 2008-08-23)
MIME-Version: 1.0
Content-Type: MULTIPART/MIXED;
 BOUNDARY="2013985540-1468786226-1392890854=:8941"
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 3.x
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -5.0 (-----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Thu, 20 Feb 2014 11:54:47 -0500
Cc: mathstuf@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.0 (-----)

  This message is in MIME format.  The first part should be readable text,
  while the remaining parts are likely unreadable without MIME-aware tools.

--2013985540-1468786226-1392890854=:8941
Content-Type: TEXT/PLAIN; charset=utf-8; format=flowed
Content-Transfer-Encoding: quoted-printable


Hello,

On Feb 19 13:59 Ben Boeckel wrote (excerpt):
> [ I am not subscribed; please keep me on the CC. ]
...
> I had a thought about how the German eszett was handled
...
> Basically, it seems that grep doesn't support alternates when changing
> case. The uppercase of '=C3=9F' is either 'SS' or '?' depending on the
> context

As far as I understand it you are talking about
"Unicode case folding".

As far as I know grep does not support "Unicode case folding".

Currently grep works on a pure "character by character" base
where each character could be in UTF-8 encoding (a possible
encoding for Unicode characters) so that grep supports
the UTF-8 encoding which could be misunderstood that
grep supports Unicode but the latter is not true.

For more details see the various (usually very long mail threads)
regarding "grep -i" in particular together with UTF-8.

For example on

http://lists.gnu.org/archive/html/bug-grep/2012-06/threads.html#00011

mail threads like
"Ignore case handling of special unicode characters (case folding)"
which is
http://savannah.gnu.org/bugs/?36682
or the mail thread
"grep -i (case-insensitive) is broken with UTF8"


Kind Regards
Johannes Meixner
--=20
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- German=
y
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffe=
r
--2013985540-1468786226-1392890854=:8941--




Information forwarded to bug-grep@HIDDEN:
bug#16812; Package grep. Full text available.

Message received at 16812 <at> debbugs.gnu.org:


Received: (at 16812) by debbugs.gnu.org; 19 Feb 2014 20:28:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 19 15:28:05 2014
Received: from localhost ([127.0.0.1]:60662 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WGDkS-0007qx-Hw
	for submit <at> debbugs.gnu.org; Wed, 19 Feb 2014 15:28:04 -0500
Received: from mx1.redhat.com ([209.132.183.28]:39641)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <eblake@HIDDEN>) id 1WGDkO-0007qW-PP
 for 16812 <at> debbugs.gnu.org; Wed, 19 Feb 2014 15:28:02 -0500
Received: from int-mx09.intmail.prod.int.phx2.redhat.com
 (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])
 by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id s1JKRxBh008257
 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
 Wed, 19 Feb 2014 15:27:59 -0500
Received: from [10.3.113.83] (ovpn-113-83.phx2.redhat.com [10.3.113.83])
 by int-mx09.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id
 s1JKRwvc027936; Wed, 19 Feb 2014 15:27:58 -0500
Message-ID: <530513CE.8000507@HIDDEN>
Date: Wed, 19 Feb 2014 13:27:58 -0700
From: Eric Blake <eblake@HIDDEN>
Organization: Red Hat, Inc.
User-Agent: Mozilla/5.0 (X11; Linux x86_64;
 rv:24.0) Gecko/20100101 Thunderbird/24.3.0
MIME-Version: 1.0
To: mathstuf@HIDDEN, 16812 <at> debbugs.gnu.org
Subject: Re: bug#16812: Eszett handling
References: <20140219185918.GA2438@HIDDEN>
In-Reply-To: <20140219185918.GA2438@HIDDEN>
X-Enigmail-Version: 1.6
OpenPGP: url=http://people.redhat.com/eblake/eblake.gpg
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature";
 boundary="WtaembBnr02bTo7sIQVqagCHHB66qIFew"
X-Scanned-By: MIMEDefang 2.68 on 10.5.11.22
X-Spam-Score: -5.6 (-----)
X-Debbugs-Envelope-To: 16812
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -5.6 (-----)

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--WtaembBnr02bTo7sIQVqagCHHB66qIFew
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

On 02/19/2014 11:59 AM, Ben Boeckel wrote:
> [ I am not subscribed; please keep me on the CC. ]
>=20
> Hi,
>=20
>>From the new grep announcement on LWN[1], I had a thought about how the=

> German eszett was handled. It seems that it hasn't been handled at all.=

> This may fall to the same resolution as the recent LJ/Lj thread[2]
> though.
>=20
> Basically, it seems that grep doesn't support alternates when changing
> case. The uppercase of '=C3=9F' is either 'SS' or '=E1=BA=9E' depending=
 on the
> context[3].

Alas, in terms of POSIX functionality, we can only change case between
single-character entities.  Changing =C3=9F to SS is a
single->multi-character change; it is DIFFERENT than the Turkish i
situation (there, although we change between single-byte and multi-byte,
the changes are still always single character).  Similar problems apply
to Greek trailing sigma, which is also a context-sensitive change operati=
on.

As long as we are stuck using the POSIX definition of case changes on a
character-by-character basis, where the input and output are 1:1
character mappings, we cannot handle the German eszett case specially.
For PROPER handling of locale-sensitive case rules, we'd need full
Unicode rules that operate on words, rather than characters, which
quickly gets out of scope of what we can do in POSIX regex.


--=20
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


--WtaembBnr02bTo7sIQVqagCHHB66qIFew
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Public key at http://people.redhat.com/eblake/eblake.gpg
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBCAAGBQJTBRPOAAoJEKeha0olJ0NqvBoH/Ajj45Eh9kCjUd9zRkmv2nGv
uWx+WtHH4ICbLSM9s+cTzGvqBn+U+n4K1IUpwgCsnGLFnjQhYxh2rxBktuxsbWd0
D0s0EAjNooB7drhah7uLT91qOcxOOkPqeed0LlkphMmCazwro/qgdp5HaBluxBPJ
NyC9EpzE/L0aOkrKtd0el9bcVOrcEhslPo3bpBFuINVgb3YRPSs0FQlHKG85tmyG
YyeoiB0/rBr5qI4oqPxabwsjeQkj0uA1GxB2t02BM4yWoN5w1yEPGjepDGiNOU1u
gdAVSXRkq1UJ3gkVc1vHV5qG4YplFrV/gsfCKsmxHIEufuEv44X6951C5t83XxM=
=Iay+
-----END PGP SIGNATURE-----

--WtaembBnr02bTo7sIQVqagCHHB66qIFew--




Information forwarded to bug-grep@HIDDEN:
bug#16812; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 19 Feb 2014 19:03:04 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Wed Feb 19 14:03:04 2014
Received: from localhost ([127.0.0.1]:60564 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1WGCQB-0005Ue-5i
	for submit <at> debbugs.gnu.org; Wed, 19 Feb 2014 14:03:03 -0500
Received: from eggs.gnu.org ([208.118.235.92]:57326)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <mathstuf@HIDDEN>) id 1WGCMx-0005NO-Kc
 for submit <at> debbugs.gnu.org; Wed, 19 Feb 2014 13:59:44 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <mathstuf@HIDDEN>) id 1WGCMo-0007CX-AL
 for submit <at> debbugs.gnu.org; Wed, 19 Feb 2014 13:59:38 -0500
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 T_DKIM_INVALID autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:59875)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <mathstuf@HIDDEN>) id 1WGCMo-0007CS-7H
 for submit <at> debbugs.gnu.org; Wed, 19 Feb 2014 13:59:34 -0500
Received: from eggs.gnu.org ([2001:4830:134:3::10]:59005)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <mathstuf@HIDDEN>) id 1WGCMk-0001TW-4N
 for bug-grep@HIDDEN; Wed, 19 Feb 2014 13:59:34 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <mathstuf@HIDDEN>) id 1WGCMd-0006vQ-9l
 for bug-grep@HIDDEN; Wed, 19 Feb 2014 13:59:30 -0500
Received: from mail-ie0-x22a.google.com ([2607:f8b0:4001:c03::22a]:65032)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <mathstuf@HIDDEN>) id 1WGCMd-0006tA-4X
 for bug-grep@HIDDEN; Wed, 19 Feb 2014 13:59:23 -0500
Received: by mail-ie0-f170.google.com with SMTP id rl12so550487iec.1
 for <bug-grep@HIDDEN>; Wed, 19 Feb 2014 10:59:21 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:subject:message-id:reply-to:mime-version:content-type
 :content-disposition:content-transfer-encoding:user-agent;
 bh=7rwUelcuoIcLsbj7/WCGd8jsx9JpY7rCdki13aA48Jg=;
 b=SzclpPVSUoUgNqRLDRSF6iBv/6y4meCkNiBA8X2Ri1fZxVsGrh6lDWLiwr7T1316Li
 aboUfSYVnicR/vl/dHgtDoV5QV+UlD37rR8v1xIR33X+BcNDyEmcYTaV5fl0iTPvoyZ/
 r+NE4jPw6Xayf1CMOUdL3vJF44FQW5hdng5qJPyjqsyNZtDtwUEt3k2dM+irs8r3n1vh
 BgSFeR3y0UYOdtOhRjWMlsCSgCXRecsBazqvu4Gw56xz7+qntTE0Anr51mK0/Al9CSwZ
 PSnLkMh81fWUhajWV3vxMbpG7qJhEayc+bxOjfKhpENKbN7e2YoL6F6p1M96Xelr6BII
 4YiQ==
X-Received: by 10.43.129.70 with SMTP id hh6mr1984902icc.68.1392836361742;
 Wed, 19 Feb 2014 10:59:21 -0800 (PST)
Received: from erythro (tripoint.kitware.com. [66.194.253.20])
 by mx.google.com with ESMTPSA id ai4sm52247382igd.3.2014.02.19.10.59.19
 for <bug-grep@HIDDEN>
 (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128);
 Wed, 19 Feb 2014 10:59:19 -0800 (PST)
Date: Wed, 19 Feb 2014 13:59:18 -0500
From: Ben Boeckel <mathstuf@HIDDEN>
To: bug-grep@HIDDEN
Subject: Eszett handling
Message-ID: <20140219185918.GA2438@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="r5Pyd7+fXNt84Ff3"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
User-Agent: Mutt/1.5.21 (2010-09-15)
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-Mailman-Approved-At: Wed, 19 Feb 2014 14:03:01 -0500
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
Reply-To: mathstuf@HIDDEN
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <http://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <http://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <http://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)


--r5Pyd7+fXNt84Ff3
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

[ I am not subscribed; please keep me on the CC. ]

Hi,

From the new grep announcement on LWN[1], I had a thought about how the
German eszett was handled. It seems that it hasn't been handled at all.
This may fall to the same resolution as the recent LJ/Lj thread[2]
though.

Basically, it seems that grep doesn't support alternates when changing
case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the
context[3]. From some poking, only the latter is supported. My
thought[4] was that the code would generate '[ßSS]' which would be wrong
when matching and would instead need to do '(ß|SS)'. It now seems that
'(ß|SS|ẞ)' or even '(ß|[sS][sS]|ẞ)' would need to be generated instead
using the new code.

I've attached a test case I wrote based on 'turkish-eyes'. I release it
to the public domain.

Thanks,

--Ben

[1]https://lwn.net/Articles/586899/
[2]https://lists.gnu.org/archive/html/bug-grep/2014-02/msg00004.html
[3]https://en.wikipedia.org/wiki/Capital_%C3%9F
[4]https://lwn.net/Articles/587010/

--r5Pyd7+fXNt84Ff3
Content-Type: text/plain; charset=utf-8
Content-Disposition: attachment; filename=german-eszett
Content-Transfer-Encoding: 8bit

#!/bin/sh
# Ensure that case-insensitive matching works with German eszett

. "${srcdir=.}/init.sh"; path_prepend_ ../src

require_en_utf8_locale_
require_compiled_in_MB_support

fail=0

L=de_DE.UTF-8

ss=$(printf '\303\237')     # lowercase eszett
SS=$(printf '\341\272\236') # uppercase eszett

# Ensure that this matches:
# printf 'ß:SS ß:ẞ\n'|LC_ALL=de_DE.UTF-8 grep -i 'SS:ß ẞ:ß'

      data="$ss:SS $ss:$SS"
search_str="SS:$ss $SS:$ss "
printf "$data\n" > in || framework_failure_

for opt in -E -F -G; do
  LC_ALL=$L grep $opt -i "$search_str" in > out || fail=1
  compare out in || fail=1
done

Exit $fail

--r5Pyd7+fXNt84Ff3--




Acknowledgement sent to mathstuf@HIDDEN:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#16812; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Mon, 25 Nov 2019 12:00:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.