GNU bug report logs - #27681
grep: Combining Mark-Nonspacing are classified as [:punct:]

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: grep; Reported by: Santiago <santiagorr@HIDDEN>; Done: Paul Eggert <eggert@HIDDEN>; Maintainer for grep is bug-grep@HIDDEN.
bug closed, send any further explanations to 27681 <at> debbugs.gnu.org and Santiago <santiagorr@HIDDEN> Request was from Paul Eggert <eggert@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 27681 <at> debbugs.gnu.org:


Received: (at 27681) by debbugs.gnu.org; 17 Jul 2017 09:20:44 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Mon Jul 17 05:20:44 2017
Received: from localhost ([127.0.0.1]:43202 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1dX2Cp-00050y-RI
	for submit <at> debbugs.gnu.org; Mon, 17 Jul 2017 05:20:44 -0400
Received: from mx1.riseup.net ([198.252.153.129]:45322)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <santiagorr@HIDDEN>) id 1dX2Cn-00050p-Nc
 for 27681 <at> debbugs.gnu.org; Mon, 17 Jul 2017 05:20:42 -0400
Received: from piha.riseup.net (unknown [10.0.1.163])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "*.riseup.net",
 Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK))
 by mx1.riseup.net (Postfix) with ESMTPS id ABF161A19ED;
 Mon, 17 Jul 2017 09:20:40 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=riseup.net; s=squak;
 t=1500283240; bh=ZaIsgCExxW1MKJsALrpvaaZ1vXSx4RDdS5UQwuknfkM=;
 h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
 b=CHO4peu1+/siIk/Wp94GIL/fqY0/782t4YnCxdbFach5bhw8gCYI9Imx8oDoIBX2p
 bPBlNIjwiLHWu4YHGDSgxLsx1T15gEJj0dyxMCPM+ZccZvenLSmBSvZR01jvqAnwxr
 B1ZPp0EPxAj7xn742vPGmgdjdZ50QZelym/7I99k=
Received: from [127.0.0.1] (localhost [127.0.0.1])
 (Authenticated sender: santiagorr@HIDDEN) by (piha) 
 with ESMTPSA id 009071D9300
Date: Mon, 17 Jul 2017 11:20:36 +0200
From: Santiago <santiagorr@HIDDEN>
To: Paul Eggert <eggert@HIDDEN>
Subject: Re: bug#27681: grep: Combining Mark-Nonspacing are classified as
 [:punct:]
Message-ID: <20170717092036.37hjugzzvuhuggu2@HIDDEN>
References: <20120305110843.27013.55764.reportbug@HIDDEN>
 <20170713132140.u6cnzfqbigp2xxzw@HIDDEN>
 <2bd35475-a5d1-717c-3fc6-01d4bbbb343c@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <2bd35475-a5d1-717c-3fc6-01d4bbbb343c@HIDDEN>
X-Spam-Score: -0.7 (/)
X-Debbugs-Envelope-To: 27681
Cc: 27681 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.7 (/)

El 13/07/17 a las 12:03, Paul Eggert escribió:
> Surely this is a glibc bug, not a grep bug. Grep is just following the
> character classification of glibc. I can reproduce the problem by compiling
> and running the attached program, which uses only glibc (not grep). This
> program exits with status 1, whereas you want it to exit with status 0. So I
> suggest filing a glibc bug report.

Done. Thanks,

  -- Santiago




Information forwarded to bug-grep@HIDDEN:
bug#27681; Package grep. Full text available.

Message received at 27681 <at> debbugs.gnu.org:


Received: (at 27681) by debbugs.gnu.org; 13 Jul 2017 19:03:13 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jul 13 15:03:13 2017
Received: from localhost ([127.0.0.1]:37357 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1dVjOL-0001fW-CN
	for submit <at> debbugs.gnu.org; Thu, 13 Jul 2017 15:03:13 -0400
Received: from zimbra.cs.ucla.edu ([131.179.128.68]:55860)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <eggert@HIDDEN>) id 1dVjOI-0001fI-Lp
 for 27681 <at> debbugs.gnu.org; Thu, 13 Jul 2017 15:03:11 -0400
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id E18E5160189;
 Thu, 13 Jul 2017 12:03:03 -0700 (PDT)
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10032)
 with ESMTP id 6tIx391KNkGP; Thu, 13 Jul 2017 12:03:03 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
 by zimbra.cs.ucla.edu (Postfix) with ESMTP id 35B1E16019D;
 Thu, 13 Jul 2017 12:03:03 -0700 (PDT)
X-Virus-Scanned: amavisd-new at zimbra.cs.ucla.edu
Received: from zimbra.cs.ucla.edu ([127.0.0.1])
 by localhost (zimbra.cs.ucla.edu [127.0.0.1]) (amavisd-new, port 10026)
 with ESMTP id LC6VTFIsSTwG; Thu, 13 Jul 2017 12:03:03 -0700 (PDT)
Received: from [172.30.71.135] (wifi-natpool-131-179-61-183.host.ucla.edu
 [131.179.61.183])
 by zimbra.cs.ucla.edu (Postfix) with ESMTPSA id 1752B160168;
 Thu, 13 Jul 2017 12:03:03 -0700 (PDT)
Subject: Re: bug#27681: grep: Combining Mark-Nonspacing are classified as
 [:punct:]
To: Santiago <santiagorr@HIDDEN>, 27681 <at> debbugs.gnu.org
References: <20120305110843.27013.55764.reportbug@HIDDEN>
 <20170713132140.u6cnzfqbigp2xxzw@HIDDEN>
From: Paul Eggert <eggert@HIDDEN>
Message-ID: <2bd35475-a5d1-717c-3fc6-01d4bbbb343c@HIDDEN>
Date: Thu, 13 Jul 2017 12:03:02 -0700
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <20170713132140.u6cnzfqbigp2xxzw@HIDDEN>
Content-Type: multipart/mixed; boundary="------------723F171E0BB0549B1963E2BD"
Content-Language: en-US
X-Spam-Score: -0.0 (/)
X-Debbugs-Envelope-To: 27681
Cc: 662629-submitter@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -0.0 (/)

This is a multi-part message in MIME format.
--------------723F171E0BB0549B1963E2BD
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 7bit

Surely this is a glibc bug, not a grep bug. Grep is just following the 
character classification of glibc. I can reproduce the problem by 
compiling and running the attached program, which uses only glibc (not 
grep). This program exits with status 1, whereas you want it to exit 
with status 0. So I suggest filing a glibc bug report.

--------------723F171E0BB0549B1963E2BD
Content-Type: text/x-csrc;
 name="combining.c"
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment;
 filename="combining.c"

#include <locale.h>
#include <regex.h>

static char const combining_acute_accent[] = "\xcc\x81";

int
main (void)
{
  regex_t re;
  if (! setlocale (LC_ALL, "en_US.UTF-8"))
    return 3;
  if (regcomp (&re, "[[:alpha:]]", 0) != 0)
    return 2;
  if (regexec (&re, combining_acute_accent, 0, 0, 0) != 0)
    return 1;
  return 0;
}

--------------723F171E0BB0549B1963E2BD--




Information forwarded to bug-grep@HIDDEN:
bug#27681; Package grep. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 13 Jul 2017 13:21:59 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Jul 13 09:21:59 2017
Received: from localhost ([127.0.0.1]:36377 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1dVe47-0008PE-Ez
	for submit <at> debbugs.gnu.org; Thu, 13 Jul 2017 09:21:59 -0400
Received: from eggs.gnu.org ([208.118.235.92]:36014)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <santiagorr@HIDDEN>) id 1dVe45-0008P1-Gf
 for submit <at> debbugs.gnu.org; Thu, 13 Jul 2017 09:21:57 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <santiagorr@HIDDEN>) id 1dVe3z-0006fO-Io
 for submit <at> debbugs.gnu.org; Thu, 13 Jul 2017 09:21:52 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,T_DKIM_INVALID
 autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:48133)
 by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <santiagorr@HIDDEN>)
 id 1dVe3z-0006fI-Ed
 for submit <at> debbugs.gnu.org; Thu, 13 Jul 2017 09:21:51 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55214)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <santiagorr@HIDDEN>) id 1dVe3y-00074Q-Ei
 for bug-grep@HIDDEN; Thu, 13 Jul 2017 09:21:51 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <santiagorr@HIDDEN>) id 1dVe3v-0006d6-9Q
 for bug-grep@HIDDEN; Thu, 13 Jul 2017 09:21:50 -0400
Received: from mx1.riseup.net ([198.252.153.129]:52400)
 by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
 (Exim 4.71) (envelope-from <santiagorr@HIDDEN>)
 id 1dVe3v-0006cW-15
 for bug-grep@HIDDEN; Thu, 13 Jul 2017 09:21:47 -0400
Received: from piha.riseup.net (unknown [10.0.1.163])
 (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits))
 (Client CN "*.riseup.net",
 Issuer "COMODO RSA Domain Validation Secure Server CA" (verified OK))
 by mx1.riseup.net (Postfix) with ESMTPS id 743641A1DCB;
 Thu, 13 Jul 2017 13:21:45 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=riseup.net; s=squak;
 t=1499952105; bh=388bC0MTodS6wk0qIPFs8vwlagZsZ2qJSrQtw/bN9j0=;
 h=Date:From:To:Cc:Subject:References:In-Reply-To:From;
 b=qcRGQWswrs72o4CFS5O50kpODMtp0+BdWTFeouYQzFFm8hScts7O5VTydtSx9QSCK
 OkQlBd2GX+5l/MnB8g4NgvVOB9y7Dl4g/mVKZhJlbx8tX/Gqtlb0WLux19DRnR9d83
 Jwm5LHQSIE5UueJGHYPG05OepfuSOx6c5KC8UF3o=
Received: from [127.0.0.1] (localhost [127.0.0.1])
 (Authenticated sender: santiagorr@HIDDEN) by (piha) 
 with ESMTPSA id E9D6F1D88EC
Date: Thu, 13 Jul 2017 15:21:40 +0200
From: Santiago <santiagorr@HIDDEN>
To: bug-grep@HIDDEN
Subject: grep: Combining Mark-Nonspacing are classified as [:punct:]
Message-ID: <20170713132140.u6cnzfqbigp2xxzw@HIDDEN>
References: <20120305110843.27013.55764.reportbug@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20120305110843.27013.55764.reportbug@HIDDEN>
Content-Transfer-Encoding: quoted-printable
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
 [fuzzy]
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.6.x
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.1 (----)
X-Debbugs-Envelope-To: submit
Cc: 662629-submitter@HIDDEN
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.1 (----)

Hi,

I would like to forward the issue below, reported by Panu Kalliokoskii
in 2012 (better late than never!). I think the correct category is
Mark-nonspacing, but I am not very familiar with Unicode though.

It still occurs in grep 3.1. In this case, using the U+0301 acute accent:

 $ echo a=CC=81rbol | grep -o '[[:alpha:]]*'
 a
 rbol

Cheers,

 -- Santiago

On Mon, 05 Mar 2012 13:08:43 +0200 "Panu A. Kalliokoski" <atehwa@HIDDEN=
> wrote:
> Package: grep
> Version: 2.6.3-3
> Severity: normal
>=20
>=20
> It seems that grep misclassifies combining letters (unicode class Lm) a=
s
> punctuation, when they should be letters.  For instance:
>=20
> $ echo d=CC=AA=CA=8C=CC=80li=CC=80 | grep -o '[[:alpha:]]*'
> d
> =CA=8C
> li
>=20
> As a consequence, combining accents are not seen as "word-constituent":
>=20
> $ echo d=CC=AA=CA=8C=CC=80li=CC=80 | grep -o '\w*'
> d
> =CA=8C
> li
>=20
> This causes also false positives on word-boundary conditions, such as
> the below:
>=20
> $ echo d=CC=AA=CA=8C=CC=80li=CC=80 | grep -w =CA=8C
> d=CC=AA=CA=8C=CC=80li=CC=80
>=20
> I suggest that combining letters should be part of [:alpha:] instead of
> [:punct:].




Acknowledgement sent to Santiago <santiagorr@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-grep@HIDDEN. Full text available.
Report forwarded to bug-grep@HIDDEN:
bug#27681; Package grep. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Tue, 31 Dec 2019 19:30:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.