It seems that it hasn't been handled at all. This may fall to the same resolution as the recent LJ/Lj thread[2] though. Basically, it seems that grep doesn't support alternates when changing case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the context[3]. From some poking, only the latter is supported. My thought[4] was that the code would generate '[ßSS]' which would be wrong when matching and would instead need to do '(ß|SS)'. It now seems that '(ß|SS|ẞ)' or even '(ß|[sS][sS]|ẞ)' would need to be generated instead using the new code. I've attached a test case I wrote based on 'turkish-eyes'. I release it to the public domain. Thanks, --Ben [1]https://lwn.net/Articles/586899/ [2]https://lists.gnu.org/archive/html/bug-grep/2014-02/msg00004.html [3]https://en.wikipedia.org/wiki/Capital_%C3%9F [4]https://lwn.net/Articles/587010/ --r5Pyd7+fXNt84Ff3 Content-Type: text/plain; charset=utf-8 Content-Disposition: attachment; filename=german-eszett Content-Transfer-Encoding: 8bit #!/bin/sh # Ensure that case-insensitive matching works with German eszett . "${srcdir=.}/init.sh"; path_prepend_ ../src require_en_utf8_locale_ require_compiled_in_MB_support fail=0 L=de_DE.UTF-8 ss=$(printf '\303\237') # lowercase eszett SS=$(printf '\341\272\236') # uppercase eszett # Ensure that this matches: # printf 'ß:SS ß:ẞ\n'|LC_ALL=de_DE.UTF-8 grep -i 'SS:ß ẞ:ß' data="$ss:SS $ss:$SS" search_str="SS:$ss $SS:$ss " printf "$data\n" > in || framework_failure_ for opt in -E -F -G; do LC_ALL=$L grep $opt -i "$search_str" in > out || fail=1 compare out in || fail=1 done Exit $fail --r5Pyd7+fXNt84Ff3--
It seems that it hasn't been handled at all.= > This may fall to the same resolution as the recent LJ/Lj thread[2] > though. >=20 > Basically, it seems that grep doesn't support alternates when changing > case. The uppercase of '=C3=9F' is either 'SS' or '=E1=BA=9E' depending= on the > context[3]. Alas, in terms of POSIX functionality, we can only change case between single-character entities. Changing =C3=9F to SS is a single->multi-character change; it is DIFFERENT than the Turkish i situation (there, although we change between single-byte and multi-byte, the changes are still always single character). Similar problems apply to Greek trailing sigma, which is also a context-sensitive change operati= on. As long as we are stuck using the POSIX definition of case changes on a character-by-character basis, where the input and output are 1:1 character mappings, we cannot handle the German eszett case specially. For PROPER handling of locale-sensitive case rules, we'd need full Unicode rules that operate on words, rather than characters, which quickly gets out of scope of what we can do in POSIX regex. --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org --WtaembBnr02bTo7sIQVqagCHHB66qIFew Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJTBRPOAAoJEKeha0olJ0NqvBoH/Ajj45Eh9kCjUd9zRkmv2nGv uWx+WtHH4ICbLSM9s+cTzGvqBn+U+n4K1IUpwgCsnGLFnjQhYxh2rxBktuxsbWd0 D0s0EAjNooB7drhah7uLT91qOcxOOkPqeed0LlkphMmCazwro/qgdp5HaBluxBPJ NyC9EpzE/L0aOkrKtd0el9bcVOrcEhslPo3bpBFuINVgb3YRPSs0FQlHKG85tmyG YyeoiB0/rBr5qI4oqPxabwsjeQkj0uA1GxB2t02BM4yWoN5w1yEPGjepDGiNOU1u gdAVSXRkq1UJ3gkVc1vHV5qG4YplFrV/gsfCKsmxHIEufuEv44X6951C5t83XxM= =Iay+ -----END PGP SIGNATURE----- --WtaembBnr02bTo7sIQVqagCHHB66qIFew--
The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --2013985540-1468786226-1392890854=:8941 Content-Type: TEXT/PLAIN; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Hello, On Feb 19 13:59 Ben Boeckel wrote (excerpt): > [ I am not subscribed; please keep me on the CC. ] ... > I had a thought about how the German eszett was handled ... > Basically, it seems that grep doesn't support alternates when changing > case. The uppercase of '=C3=9F' is either 'SS' or '?' depending on the > context As far as I understand it you are talking about "Unicode case folding". As far as I know grep does not support "Unicode case folding". Currently grep works on a pure "character by character" base where each character could be in UTF-8 encoding (a possible encoding for Unicode characters) so that grep supports the UTF-8 encoding which could be misunderstood that grep supports Unicode but the latter is not true. For more details see the various (usually very long mail threads) regarding "grep -i" in particular together with UTF-8. For example on http://lists.gnu.org/archive/html/bug-grep/2012-06/threads.html#00011 mail threads like "Ignore case handling of special unicode characters (case folding)" which is http://savannah.gnu.org/bugs/?36682 or the mail thread "grep -i (case-insensitive) is broken with UTF8" Kind Regards Johannes Meixner --=20 SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- German= y HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffe= r --2013985540-1468786226-1392890854=:8941--
