GNU bug report logs - #21251
sed: POSIX and the z command

Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.

Package: sed; Severity: wishlist; Reported by: Stephane Chazelas <stephane.chazelas@HIDDEN>; Keywords: moreinfo notabug; dated Thu, 13 Aug 2015 14:56:01 UTC; Maintainer for sed is bug-sed@HIDDEN.
Severity set to 'wishlist' from 'normal' Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 21251 <at> debbugs.gnu.org:


Received: (at 21251) by debbugs.gnu.org; 31 Jan 2017 21:50:05 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Tue Jan 31 16:50:05 2017
Received: from localhost ([127.0.0.1]:52619 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cYgJR-0006a0-5p
	for submit <at> debbugs.gnu.org; Tue, 31 Jan 2017 16:50:05 -0500
Received: from mail-wm0-f67.google.com ([74.125.82.67]:34940)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1cYgJP-0006ZR-8J
 for 21251 <at> debbugs.gnu.org; Tue, 31 Jan 2017 16:50:03 -0500
Received: by mail-wm0-f67.google.com with SMTP id u63so1178209wmu.2
 for <21251 <at> debbugs.gnu.org>; Tue, 31 Jan 2017 13:50:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:in-reply-to:user-agent;
 bh=DObdO6vTubie4fkeJUt1dr6j7joI646UkaOtQET8Pek=;
 b=CjBPVXeDeb2bSc+YZ+JBO0dvheJQ2KerIQSE5NKyjCf9cRsLy/u662tjX5zJtQgofm
 17DHNRNOjJ/PBrnO1VHEwU0WFqGGZW5H7ezEhQJd/OJqAp/8EOPHibcjaUYoir3YJ2ma
 2M/r3NW6BPfjPuTPkjH+nhAenulRCcIEq5NAbewE4iBCica2edgA0Qwbh+E1ns4G4d3d
 IjZuoszWnJpCjYmqLGVAFcUeamMcuxXVScNVHHoXbK+nKHQaNHKVxKNBXl8O4EQa/5sm
 7kU6bOnmqV+jl6CwvjaBXG2gwsetpOm2b1TTgg5iujf6p2OCF8eldC7ey7tRtq2K97gG
 HXDg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:in-reply-to:user-agent;
 bh=DObdO6vTubie4fkeJUt1dr6j7joI646UkaOtQET8Pek=;
 b=NP9w7K79hVH8dI76ILh5oIfZ8Yz2ne/EFdRNe0cRqcKFj3ivtKk+FcZSQtLlQThqsD
 tkWHCNPRk9xAkh+y1wISZcHseScBhdvbPOOToeiQn6XeEgxqiqZSmW8FXeuowawJD2NF
 QZEOTl0P524FZi4D87tqTaYsk863KUUoDbqJQOBNLpkV2u8Ckfzjiv5CNayjuLAUoxIE
 o4mut+YoZseItJDgC8mQtQg3pCbkffROZV5znSIWVqhNAkc2aSlTFK3T5ZTrnJ51rosb
 kubZofBDwz9cPSYMSXdQRYChf6VH4dsE1yDi6NkMyD3o+OPL0UP4iOO9yqzGe/d9FPSd
 OJtw==
X-Gm-Message-State: AIkVDXIkhG4YRZCUM1EvLf0JnvecEUjNX4fLga+gqL84BpmzGJJcnsK87Uufd3pfoEAd7g==
X-Received: by 10.223.135.163 with SMTP id b32mr25443372wrb.184.1485899397193; 
 Tue, 31 Jan 2017 13:49:57 -0800 (PST)
Received: from chaz.gmail.com ([90.198.140.127])
 by smtp.gmail.com with ESMTPSA id k70sm26066437wmc.3.2017.01.31.13.49.55
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Tue, 31 Jan 2017 13:49:56 -0800 (PST)
Date: Tue, 31 Jan 2017 21:49:55 +0000
From: Stephane Chazelas <stephane.chazelas@HIDDEN>
To: Assaf Gordon <assafgordon@HIDDEN>
Subject: Re: bug#21251: sed: POSIX and the z command
Message-ID: <20170131214955.GB11631@HIDDEN>
References: <20150813145520.GC4313@HIDDEN>
 <20170128014818.GA15326@HIDDEN>
 <20170128100155.GA5699@HIDDEN>
 <20170128210424.GB8951@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170128210424.GB8951@HIDDEN>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 21251
Cc: 21251 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.5 (/)

2017-01-28 21:04:25 +0000, Assaf Gordon:
[...]
> >>I'd expect the behaviour to be unspecified if the input is not
> >>text (as would be the case if there are invalid multi-byte
> >>sequences).
> >
> >Exactly.
> 
> So the above somewhat confuses me (as my previous email):
> 
> Let's say I was to write a new simple 'sed' for POSIX systems.
> If POSIX/OpenGroup encourages me (as a software writer for posix
> systems) to use the POSIX regexec API, then implicitly my 'sed'
> program wouldn't match invalid multibyte sequences.
> But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
> it means that in practical terms I shouldn't use POSIX API and
> implement my own regex engine...
[...]

Just to clear what I think might be the source of the confusion,
this bug is not about GNU sed not being POSIX compliant in this
instance (it is compliant), but a documentation bug about the
claim that POSIX mandates s/.*// to not empty the pattern space
if it contains invalid characters being wrong. POSIX doesn't
mandate that, it mandates nothing of sed when the input is not
text.

The current sed behaviour is compliant. When the input is not
text, *anything* is compliant as POSIX leaves the behaviour of
sed unspecified then. That's an area not covered by POSIX,
you're on your own. In particular, you're free to ensure that
s/.*// empties the pattern space if you like.

That "simple sed" can do fgets() on a statically allocated
buffer of LINE_MAX length and use POSIX regexec() on it and
still be conformant.

Now, though that would be the subject of another "feature
request" bug and as you say one that would cover all the text
utilities, not just "sed", I (not POSIX) argue that it would be
better if individual bytes that don't form part of valid
characters would be treated as a character of their own rather
than pretend they're not there.

That could be done by adding a (non-POSIX) flag to regcomp() and
fnmatch() to enable that behaviour.

Or like python does in some cases, work with APIs that work on
some wchar_t* instead of char* but for the translation from
char* UTF-8 to wchar_t*, use a reserved range for byte values
that don't form part of valid characters.Like python that uses
code points U+DC80 to U+DCFF for bytes 0x80 to 0xff that don't
form part of valid characters (U+D800 to U+DFFF are not
characters, they are code points which are otherwise reserved
for UTF-16 encoding).

Without having to change the APIs, another approach (in UTF-8
locales) could be to preprocess the input to change for instance
a standalone 0x80 into the would-be UTF-8 encoding of U+DC80
before calling regexec() (for which at the moment "." matches on
even though it's not a character) and do the reverse on output.
That would have some performance impact though.

Note that at the moment there's some discrepency between GNU
tools on the treatment of the would-be UTF-8 encoding of those
D800-DF00 non-characters (the UTF-16 surrogate pairs).

For instance, some treat "ed b2 80" (the would-be-UTF-8-encoding
of DC80) as 0 character, some as 1, some as 3, some as 1 and 3
at the same time:

$ export C=$'\xed\xb2\x80'
$ bash -c '[[ $C = ??? ]]' && echo yes
yes

For bash (and zsh and ksh93), those 3 bytes don't form part of a
valid character, so are considered as characters which IMO is
the best thing to do.

$ printf %s "$C" | wc -m
0

That's not a character, so we print 0 (as required by POSIX I
beleive, wc is _not_ a text utility).

$ touch "$C"; find "$C" -name '*'
$ touch "$C"; find "$C" -name '?'
$ touch "$C"; find "$C" -name '???'
$

That file can't be matched by name!

$ printf '%s\n' "$C" | grep -xl .
(standard input)
$ printf '%s\n' "$C" | sed 's/^.$/yes/'
yes

But:

$ printf '%s\n' "$C" | grep -xPl .
$ printf '%s\n' "$C" | ./grep -Plx '.*'
(standard input)
$ printf '@%s@\n' "$C" | ./grep -Plx '@.*@'
$


Worse: it can be one character and three at the same time:

$ expr "$C" : '^.$'
3
$ printf '%s\n' "$C" | awk '/^.$/ {print length}'
3


(note that's on Linux-Mint 18.1, so not with the latest versions
of those utilities, one would have to check with the latest
versions).

(again, that's not a POSIX compliance issue for text utilities).

-- 
Stephane




Information forwarded to bug-sed@HIDDEN:
bug#21251; Package sed. Full text available.
Added tag(s) notabug and moreinfo. Request was from Assaf Gordon <assafgordon@HIDDEN> to control <at> debbugs.gnu.org. Full text available.

Message received at 21251 <at> debbugs.gnu.org:


Received: (at 21251) by debbugs.gnu.org; 28 Jan 2017 21:05:03 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jan 28 16:05:03 2017
Received: from localhost ([127.0.0.1]:49449 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cXaBD-00031m-BM
	for submit <at> debbugs.gnu.org; Sat, 28 Jan 2017 16:05:03 -0500
Received: from mail-qt0-f194.google.com ([209.85.216.194]:33582)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@HIDDEN>) id 1cXaBA-00031E-Rs
 for 21251 <at> debbugs.gnu.org; Sat, 28 Jan 2017 16:05:01 -0500
Received: by mail-qt0-f194.google.com with SMTP id n13so51338140qtc.0
 for <21251 <at> debbugs.gnu.org>; Sat, 28 Jan 2017 13:05:00 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:in-reply-to:user-agent;
 bh=PelaogIPUVbmEmpSLCfwbrgfIvgCfKiLX2e1pRzXtJE=;
 b=fj3GUUUkFjdpbNYD/ylWjVLJ+C4K8aiGZZLohTr+2N4K4CKWsHAeFaVYiIdh+H/uyv
 49rUNjvzVkuk9JH4gBCB/AVjJKAwzLmKh6xQCr/pImZTXpmdG3gb/i1Qf5Jwi+VDW6G/
 ccvwQsgbj2Z75bz4mpAdbXP8pDYksbHgGeMhcg9Z7Xl+o3pFRBJiE9rrwbQ08EXkpxw4
 kcSsD1oIRBTAd7+/vN+7uS8hErMnRhbQwPSEUWZafbZNFkC48y3xp+RyWIijREnaKlx6
 BmPcMrUQLCSngkROchI/dS3HhqOos2BIKfQgvWbJ0TZDLhdaGjMgLmP4IQx9WmDvg6D4
 XC8w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:in-reply-to:user-agent;
 bh=PelaogIPUVbmEmpSLCfwbrgfIvgCfKiLX2e1pRzXtJE=;
 b=AruAFSYFGTB9DwYSSEnJLTTrVMRA9M0a79wz8oL1c95LcQxtFTpUClcj+Qu5TlhSG/
 ojpIkSgcuzNmqnKNPDMePdt8svd3CwG7/3mcoqNzaFfLLQHvRkhox8vTZkRnZtu3LJkS
 tfoTcnVxqEs4OxWf4U7Qkbd4csbNdd28/i5kiLMXBIBKopnudsySeXtFbeDHWLtSUoww
 yBLESUm6HHv5OhukdOYeNCHDrH8UYGsWKKGe0OFfhQahkHxWg0HfAn+EB6nwd7IPlbIc
 BY5ofbiGZDX+z2JDEWYJzSubo5CIoCStE6ZBn4AQLUhCzUAPA6GDHGgiza51YiJM46pb
 zMiA==
X-Gm-Message-State: AIkVDXJCQooIzwYdLY4mDP73EGni6q5WWI4WkTUooljK6Px0UDDouysljwmJ8q7mnmqyGQ==
X-Received: by 10.200.2.8 with SMTP id k8mr13450941qtg.163.1485637495378;
 Sat, 28 Jan 2017 13:04:55 -0800 (PST)
Received: from gmail.com (housegordon.org. [104.236.108.240])
 by smtp.gmail.com with ESMTPSA id u54sm7543103qtu.35.2017.01.28.13.04.54
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 28 Jan 2017 13:04:54 -0800 (PST)
Date: Sat, 28 Jan 2017 21:04:25 +0000
From: Assaf Gordon <assafgordon@HIDDEN>
To: Stephane Chazelas <stephane.chazelas@HIDDEN>
Subject: Re: bug#21251: sed: POSIX and the z command
Message-ID: <20170128210424.GB8951@HIDDEN>
References: <20150813145520.GC4313@HIDDEN>
 <20170128014818.GA15326@HIDDEN>
 <20170128100155.GA5699@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
In-Reply-To: <20170128100155.GA5699@HIDDEN>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 21251
Cc: 21251 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.5 (/)

Hello Stephane,

On Sat, Jan 28, 2017 at 10:01:55AM +0000, Stephane Chazelas wrote:
>It doesn't preclude the use of regexec. It just leaves the
>behaviour unspecified when the input is not text

Thanks for the clarification.

>I'd argue that for sequences of bytes that don't form valid
>characters, it would be nicer if "." or "[^anything]" matched
>each of the individual bytes.

Concretely, GNU sed uses several regex engines now (gnulib's dfa for
fast matching, then either glibc's or gnulib's RE for general matching 
and substitution).

To support this behaviour we'll need to ensure all of them behave in
the same reproducible and reliable manner (not impossible, just a TODO).

>You can still find the discussion using the NNTP interface. I
>attach the most relevant message (from Geoff Clare of the Austin
>group). I can send you the whole discussion as a mailbox file if
>you like.

I would appricate if you could send it to me - I'm interested
in multibyte processing for other gnu programs as well.


>From: Geoff Clare <gwc@HIDDEN>
>> GNU sed even went as far as defining a new command for emptying
>> the pattern space to work around that problem:
>> [...]
>> Is that claim (about it being a POSIX requirement) true?
>
>I think it's true for regexec(), but not for sed.
>
>(Perhaps we should add a REG_EILSEQ error return for regexec().)
>
>> I'd expect the behaviour to be unspecified if the input is not
>> text (as would be the case if there are invalid multi-byte
>> sequences).
>
>Exactly.

So the above somewhat confuses me (as my previous email):

Let's say I was to write a new simple 'sed' for POSIX systems.
If POSIX/OpenGroup encourages me (as a software writer for posix
systems) to use the POSIX regexec API, then implicitly my 'sed'
program wouldn't match invalid multibyte sequences.
But if OpenGroup wants me to match invalid multibyte sequences in 'sed'.
it means that in practical terms I shouldn't use POSIX API and
implement my own regex engine...

You compared it with LINE_MAX, but realistically, implementing support 
for lines longer than LINE_MAX is very different scale of effort than 
implementing a new regex  engine...

What am I missing ?

Thanks!
 - assaf






Information forwarded to bug-sed@HIDDEN:
bug#21251; Package sed. Full text available.

Message received at 21251 <at> debbugs.gnu.org:


Received: (at 21251) by debbugs.gnu.org; 28 Jan 2017 10:02:09 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jan 28 05:02:09 2017
Received: from localhost ([127.0.0.1]:48783 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cXPpd-0006IR-9h
	for submit <at> debbugs.gnu.org; Sat, 28 Jan 2017 05:02:09 -0500
Received: from mail-wm0-f67.google.com ([74.125.82.67]:33678)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1cXPpb-0006Hy-FY
 for 21251 <at> debbugs.gnu.org; Sat, 28 Jan 2017 05:02:04 -0500
Received: by mail-wm0-f67.google.com with SMTP id v77so2668459wmv.0
 for <21251 <at> debbugs.gnu.org>; Sat, 28 Jan 2017 02:02:03 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:content-transfer-encoding:in-reply-to
 :user-agent; bh=yvE5AxpF/QEd8Ea8T1K8YWaHPpHXhf3PFrakI686u4M=;
 b=a5Cp+bLOOghRp5PuqPt+HNdlN+shRk5UrIfdaD3afOGnzvvXsPkAam4lPKYrt4bz1x
 h8CYAphdG6JfvrzPS1qqiVdt6UloNfZcho/hR623eFsl67Jivvx4fwwZSkfMMhNofGgv
 ZUMlOCTbGBN3iAPNTy/FCJNh1ELLSyq3DrHR7Am8Ea4foI3uikfFn7EYynyRAwmYO0At
 EbuxEow5uZcJ2BZmniNVnPODLTmzKVMC4EHHfqhM31TZxWGWNaJ8E9mRCTgtVhXhcncl
 fouf3TDg5vSJ840VM/VsH/VzKQHmqETF7ECPDCMxzxoubmtwLiab8H0cpz4XjftAYQa/
 Ps0w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:content-transfer-encoding
 :in-reply-to:user-agent;
 bh=yvE5AxpF/QEd8Ea8T1K8YWaHPpHXhf3PFrakI686u4M=;
 b=ES15XSCTbtVx3yjQI7HG5JmbVJVtHLna9vdmKQfJwbr122H9vk8iLslatxdewz+Ens
 zwZBqUcxBDXW3a6BraP2UDhEwCclKa4FB1ZroMcSka9IymZD+77GQfKeAxU/2mpiJ6Rs
 snJcV4RSwCqs0GfGyKv9vYaU8vn0xAnMYkMm64B9cXG+vrE5jhEHTsAiGLGMchxnTDcs
 YSfoeNXvIYdL058fc+EqvP76z9hPmROAcWm0DZtHK0nyz7dzKHcTPtocN6/dGSTM7C6F
 syLVZ0boT/Ec7a5AnVgJHmvZIN3bbazNBrNA8S89stqFX2kJgKqwPSebheISP8/Zbgue
 /+9w==
X-Gm-Message-State: AIkVDXI6eSixCpInYdSsL9ekc3xruJaQL7+ZeL/LsAmmW1TPiG4djMetvZv6zubc4yj0aw==
X-Received: by 10.28.103.69 with SMTP id b66mr6213289wmc.73.1485597717526;
 Sat, 28 Jan 2017 02:01:57 -0800 (PST)
Received: from chaz.gmail.com ([90.198.140.127])
 by smtp.gmail.com with ESMTPSA id a72sm12147563wrc.48.2017.01.28.02.01.56
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Sat, 28 Jan 2017 02:01:56 -0800 (PST)
Date: Sat, 28 Jan 2017 10:01:55 +0000
From: Stephane Chazelas <stephane.chazelas@HIDDEN>
To: Assaf Gordon <assafgordon@HIDDEN>
Subject: Re: bug#21251: sed: POSIX and the z command
Message-ID: <20170128100155.GA5699@HIDDEN>
References: <20150813145520.GC4313@HIDDEN>
 <20170128014818.GA15326@HIDDEN>
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="PEIAKu/WMn1b1Hv9"
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20170128014818.GA15326@HIDDEN>
User-Agent: Mutt/1.5.24 (2015-08-30)
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 21251
Cc: 21251 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.5 (/)


--PEIAKu/WMn1b1Hv9
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

2017-01-28 01:48:19 +0000, Assaf Gordon:
[...]
> On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote:
> >[...] The behaviour
> >of sed on non-text input is unspecified, so it doesn't require
> >that . not match a byte that is not part of a valid character.
> >[...]
> >That POSIX requirement is true for regexec() but not for text
> >utilities.
> 
> I'm far from familiar with POSIX intricacies, but doesn't that sound a bit
> strange ?  I would naively think that POSIX would encourage POSIX-compliant
> test utilities to use the system's native regexec implenentation, instead of
> supporting slightl different semantics...

Hi Assaf,

It doesn't preclude the use of regexec. It just leaves the
behaviour unspecified when the input is not text, like when
lines are longer than LINE_MAX or when they contain NUL bytes or
when they contain sequences of bytes not forming valid
characters or when there are characters after the last newline
character.

Upon sequences of bytes that don't form valid characters, you're
free to exit with an error, shut down the computer, or whatever
you like, POSIX doesn't care.

What POSIX tells the user of the POSIX API (that is script
writers, sed user) is that they can't expect anything on
non-text input.

GNU sed already handles lines longer than LINE_MAX nicely, as
well as lines containing NUL bytes or an unterminated last line.

I'd argue that for sequences of bytes that don't form valid
characters, it would be nicer if "." or "[^anything]" matched
each of the individual bytes. It's what bash's * and ? and
[!anything] fnmatch() patterns do (even though in that case
POSIX seem to forbid it; that has been discussed on the austin
group mailing list as well). 

> >See that discussion on the Austin Group mailing list:
> >http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098
> 
> This link seems broken. Would you know where to find this discussion online
> ?
[...]

Yes. They relied on gmane for the mailing list archive. The web
interface has been discontinued
(https://lars.ingebrigtsen.no/2016/07/28/the-end-of-gmane/),
then taken over by somebody else, but not everything is back.
https://lars.ingebrigtsen.no/2016/09/06/gmane-alive/comment-page-1/

You can still find the discussion using the NNTP interface. I
attach the most relevant message (from Geoff Clare of the Austin
group). I can send you the whole discussion as a mailbox file if
you like.

-- 
Stephane

--PEIAKu/WMn1b1Hv9
Content-Type: message/rfc822
Content-Disposition: inline
Content-Transfer-Encoding: 8bit

Delivered-To: stephane.chazelas@HIDDEN
Received: by 10.25.62.214 with SMTP id l205csp430205lfa;
        Wed, 1 Jul 2015 02:55:50 -0700 (PDT)
X-Received: by 10.68.169.34 with SMTP id ab2mr52669631pbc.120.1435744550720;
        Wed, 01 Jul 2015 02:55:50 -0700 (PDT)
Return-Path: <austin-group-l-request@HIDDEN>
Received: from m4.opengroup.org (m4.opengroup.org. [64.79.149.154])
        by mx.google.com with SMTP id wb8si2655095pac.11.2015.07.01.02.55.49
        for <stephane.chazelas@HIDDEN>;
        Wed, 01 Jul 2015 02:55:50 -0700 (PDT)
Received-SPF: pass (google.com: domain of austin-group-l-request@HIDDEN designates 64.79.149.154 as permitted sender) client-ip=64.79.149.154;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of austin-group-l-request@HIDDEN designates 64.79.149.154 as permitted sender) smtp.mail=austin-group-l-request@HIDDEN
Received: (qmail 6768 invoked by uid 503); 1 Jul 2015 09:55:24 -0000
Resent-Date: 1 Jul 2015 09:55:24 -0000
X-Sender-Id: gwc@HIDDEN
Date: Wed, 1 Jul 2015 10:55:14 +0100
From: Geoff Clare <gwc@HIDDEN>
To: austin-group-l@HIDDEN
Subject: Re: UTF-8 and non-characters
Message-ID: <20150701095514.GA22396@HIDDEN>
References: <af4a72d56907da9667d250c1f27b231e@HIDDEN>
 <20150624194723.GB4187@HIDDEN>
 <20150625085001.GB9050@HIDDEN>
 <20150630202359.GI5093@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20150630202359.GI5093@HIDDEN>
User-Agent: Mutt/1.5.21 (2010-09-15)
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (m1.opengroup.org [172.20.55.20]); Wed, 01 Jul 2015 02:55:19 -0700 (PDT)
X-Virus-Scanned: clamav-milter 0.98 at m1.opengroup.org
X-Virus-Status: Clean
Content-Transfer-Encoding: 8bit
X-MIME-Autoconverted: from quoted-printable to 8bit by m4.opengroup.org id t619tJLj006443
Resent-Message-ID: <"dCpztB.A.DlB.Ik7kVB"@Phoebe.vpn.opengroup.org>
Resent-To: austin-group-l@HIDDEN
Resent-From: austin-group-l@HIDDEN
X-Mailing-List: austin-group-l:archive/latest/22769
X-Loop: austin-group-l@HIDDEN
Precedence: list
Resent-Sender: austin-group-l-request@HIDDEN

Stephane Chazelas <stephane.chazelas@HIDDEN> wrote, on 30 Jun 2015:
>
> Speaking of which, would a pseudo-UTF-8 locale where bytes that
> don't form valid characters are mapped to a character like
> U+FFFD (�) be POSIX compliant.
> 
> Like c3 a9 is é, but c3 41 a9 is �A�
> 
> or if not all mapped to a single character, mapped to dedicated
> unassigned code points (0x7fffff80 to 0x7fffffff for instance)? 
> 
> For instance, above c3 41 a9 being <U+7fffffc3>A<U+7fffffa9>
> 
> If allowed, would that not be desirable (I can see it
> potentially be a problem when processing partial input)?

I think this would cause inconsistency between btowc() and the various
multi-byte to wide-character conversion functions.

If btowc(0xc3) returns a wide character, then mbtowc() on c3 a9 ought
to convert the c3 to that wide character and return 1, instead of
converting c3 a9 to a wide é and returning 2.

Conversely, if btowc(0xc3) returns WEOF, then mbtowc() on c3 41 a9
ought not to convert the c3 to a wide character.

> A common source of bugs and security vulnerabilities with
> UTF-8 is that fact that not all sequences of bytes map to
> characters and in particular that they're not matched by RE's
> "." or ".*" or fnmatch()'s ? or *.
> 
> That's a common problem when you can't guarantee the input is
> valid text for instance for arbitrary file names from the file
> system. That's quite common when dealing with file names that
> were written in a single-byte character set in UTF-8 locales.
> 
> For instance,
> 
> find . -name '*'
> 
> With GNU find at least doesn't match on $'St\xe9phane.txt'
> (Stéphane.txt in the iso8859-1 charset).
> 
> An example of a more serious problem:
> 
> find . ! -name "* *" -exec cmd-that-would-break-with-spaces {} +

It looks like the pattern matching sections of the standard have
some problems with the use of the terms character and string.

2.13.1 says * matches "multiple characters", but 2.13.2 says it
matches "any string" in item 1 and then says it matches "a string
of zero or more characters" (i.e. any character string) in item 3.

> GNU sed even went as far as defining a new command for emptying
> the pattern space to work around that problem:
> 
> `z'
>      This command empties the content of pattern space.  It is usually
>      the same as `s/.*//', but is more efficient and works in the
>      presence of invalid multibyte sequences in the input stream.
>      POSIX mandates that such sequences are _not_ matched by `.', so
>      that there is no portable way to clear `sed''s buffers in the
>      middle of the script in most multibyte locales (including UTF-8
>      locales).
> 
> Is that claim (about it being a POSIX requirement) true?

I think it's true for regexec(), but not for sed.

(Perhaps we should add a REG_EILSEQ error return for regexec().)

> I'd expect the behaviour to be unspecified if the input is not
> text (as would be the case if there are invalid multi-byte
> sequences).

Exactly.

> See also
> http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8
> where we wondered whether grep -vx '.*' was required to report
> lines with invalid multi-byte sequences.

Unspecified, for the same reason as for sed.

> There was also a discussion earlier here about shells' ? and *
> on invalid byte sequences and most shells seem to match
> individual bytes from invalid multibyte sequences as one
> character (except for yash that won't deal with those at all)
> which seem to me like the safest thing to do.
> 
> What's the OpenGroup position on that?

2.13.1 is clear that ? matches a character.

The requirements for * are ambiguous because of the conflicting text
I pointed out above.

-- 
Geoff Clare <g.clare@HIDDEN>
The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England


--PEIAKu/WMn1b1Hv9--




Information forwarded to bug-sed@HIDDEN:
bug#21251; Package sed. Full text available.

Message received at 21251 <at> debbugs.gnu.org:


Received: (at 21251) by debbugs.gnu.org; 28 Jan 2017 01:48:57 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Fri Jan 27 20:48:57 2017
Received: from localhost ([127.0.0.1]:48698 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.84_2)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1cXI8P-0003RK-2A
	for submit <at> debbugs.gnu.org; Fri, 27 Jan 2017 20:48:57 -0500
Received: from mail-qk0-f193.google.com ([209.85.220.193]:35845)
 by debbugs.gnu.org with esmtp (Exim 4.84_2)
 (envelope-from <assafgordon@HIDDEN>) id 1cXI8N-0003R7-7K
 for 21251 <at> debbugs.gnu.org; Fri, 27 Jan 2017 20:48:55 -0500
Received: by mail-qk0-f193.google.com with SMTP id i34so8738847qkh.3
 for <21251 <at> debbugs.gnu.org>; Fri, 27 Jan 2017 17:48:55 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;
 h=date:from:to:cc:subject:message-id:references:mime-version
 :content-disposition:in-reply-to:user-agent;
 bh=9QDbvckyRzYT+3CGb/5Ki9Zg9tJD336SLEftYZXvE3k=;
 b=gsIyyn+YLfilwj7Q1TsK/KO0i1GNn3gqDNH6nENeVuZQKTLeherAVsIdpQJA5LZlHx
 +R6Q9FQfuPTbhM/PHHm4V5Z8cnsFjHuYJEt7R2Ey9GTnLfsvbXpvcokuTbiix9e4y4gU
 KPZb3Q0PHIzZ0tLpPyVyiYfM/aN69zLnFvRurgi3yxQFh5W6myS75fqDK9KVQD7vJnCz
 IfqXh+OK1GhApExK/OdUfSrw0p/kXJW9ke53DUmuiyjWMVA/ePZgwNcJN46pxAJaXHeh
 QUb43O9ceWnlLfjneoK7eVG1EflbwAGv3GdEsg9nU7X3kx25tjVoClf5lPkZRsznimgE
 9uGw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20161025;
 h=x-gm-message-state:date:from:to:cc:subject:message-id:references
 :mime-version:content-disposition:in-reply-to:user-agent;
 bh=9QDbvckyRzYT+3CGb/5Ki9Zg9tJD336SLEftYZXvE3k=;
 b=ru8xBh2x5gPXuyD3WfZ/RsCoVYRHIVedQLw9Igc20eDECATgqL8PtvrjgFXIN2RN+p
 K5KyQybRZBi9At3mBNRBU87O7+LwUpM2VNgj8PVVMgYwx47ZY5PVCU0uvOvYGz6IsZOe
 VmLCS7G6l1I3p0i4n+lxD+DVAes9m4+M9xorEx9MIctOFtXzIRqCFnbsy5YhQdyitVjz
 AFZ5ZBR7+h8F81c0X4pOfzCzo0jY4TkPc3Hr3IER/A30EfhtUB6w3zr8DbB2Ive5QCLQ
 B0rpMvnboYadu9ki2gjw8HJWikcbHkQ2OcHBRTTLvYJnmwfB7MiWqU5y2CdVepvNk6UT
 +BkA==
X-Gm-Message-State: AIkVDXKGE7YphE49rxAwjtSOoYROavi4R+Xp9tT1xQ1AxsOKWwuUZ45mwILD35kgDE9lcw==
X-Received: by 10.55.75.134 with SMTP id y128mr10599252qka.134.1485568129475; 
 Fri, 27 Jan 2017 17:48:49 -0800 (PST)
Received: from gmail.com (housegordon.org. [104.236.108.240])
 by smtp.gmail.com with ESMTPSA id g13sm5673977qtg.8.2017.01.27.17.48.48
 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Fri, 27 Jan 2017 17:48:48 -0800 (PST)
Date: Sat, 28 Jan 2017 01:48:19 +0000
From: Assaf Gordon <assafgordon@HIDDEN>
To: Stephane Chazelas <stephane.chazelas@HIDDEN>
Subject: Re: bug#21251: sed: POSIX and the z command
Message-ID: <20170128014818.GA15326@HIDDEN>
References: <20150813145520.GC4313@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Disposition: inline
In-Reply-To: <20150813145520.GC4313@HIDDEN>
User-Agent: Mutt/1.5.23 (2014-03-12)
X-Spam-Score: 0.5 (/)
X-Debbugs-Envelope-To: 21251
Cc: 21251 <at> debbugs.gnu.org
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.18
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: 0.5 (/)

Hello Stephane,

Sorry for the delayed response. I'm triaging old sed bugs.

On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote:
> [...] The behaviour
> of sed on non-text input is unspecified, so it doesn't require
> that . not match a byte that is not part of a valid character.
> [...]
> That POSIX requirement is true for regexec() but not for text
> utilities.

I'm far from familiar with POSIX intricacies, but doesn't that sound a 
bit strange ?  I would naively think that POSIX would encourage 
POSIX-compliant test utilities to use the system's native regexec 
implenentation, instead of supporting slightl different semantics... 

> See that discussion on the Austin Group mailing list:
> http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098

This link seems broken. Would you know where to find this discussion 
online ?


thanks,
 - assaf




Information forwarded to bug-sed@HIDDEN:
bug#21251; Package sed. Full text available.

Message received at submit <at> debbugs.gnu.org:


Received: (at submit) by debbugs.gnu.org; 13 Aug 2015 14:55:41 +0000
From debbugs-submit-bounces <at> debbugs.gnu.org Thu Aug 13 10:55:41 2015
Received: from localhost ([127.0.0.1]:55142 helo=debbugs.gnu.org)
	by debbugs.gnu.org with esmtp (Exim 4.80)
	(envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>)
	id 1ZPtut-0006UX-1A
	for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:40 -0400
Received: from eggs.gnu.org ([208.118.235.92]:43177)
 by debbugs.gnu.org with esmtp (Exim 4.80)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtur-0006UL-Iy
 for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:38 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtuq-0003ft-2Y
 for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:36 -0400
X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org
X-Spam-Level: 
X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM,
 T_DKIM_INVALID autolearn=disabled version=3.3.2
Received: from lists.gnu.org ([2001:4830:134:3::11]:35478)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtup-0003fo-QV
 for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:35 -0400
Received: from eggs.gnu.org ([2001:4830:134:3::10]:53149)
 by lists.gnu.org with esmtp (Exim 4.71)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtuj-0005fi-4U
 for bug-sed@HIDDEN; Thu, 13 Aug 2015 10:55:35 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtud-0003Yc-88
 for bug-sed@HIDDEN; Thu, 13 Aug 2015 10:55:28 -0400
Received: from mail-wi0-x229.google.com ([2a00:1450:400c:c05::229]:36625)
 by eggs.gnu.org with esmtp (Exim 4.71)
 (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtud-0003YQ-17
 for bug-sed@HIDDEN; Thu, 13 Aug 2015 10:55:23 -0400
Received: by wicja10 with SMTP id ja10so154646206wic.1
 for <bug-sed@HIDDEN>; Thu, 13 Aug 2015 07:55:22 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=date:from:to:subject:message-id:mime-version:content-type
 :content-disposition:user-agent;
 bh=9CKUa1so1Eh8s31CNAR1W+AZfWenscQ4zq9P1Q0SdWI=;
 b=GJHm7wfZH0ueb3UX7I3MBOubL/Rv2+wTcBfLUGs/vu76R/zZZtW7s5Z7n7o+OlUxBA
 858iw0PpCUTKE74YVMMUWriS0boR6a3m0GRHa7KacJz/3mQLUsF7ZrZ0BaY+RPR7q+Xt
 l6EhRMTXTLzoJCaaINzdnFcR2JpSWAVhzLqYEUuBNIoCqb4Ixk68CmoIVI8gMVB3LevH
 uXlF8LuAKGs9iBcREaaz8uPU8rcLwKtXKROkoz9mb9tEZuudjN4UbQHoWV2mlBhXTxrR
 M3BMU4xXniY9cwGCDLRHzfN4Ut9xyfXqEpfLy7wuauEFsHtXECoqt/4ZgLf9D6XoqQgI
 tGwg==
X-Received: by 10.180.211.11 with SMTP id my11mr54793412wic.51.1439477722228; 
 Thu, 13 Aug 2015 07:55:22 -0700 (PDT)
Received: from chaz.gmail.com (05448dab.skybroadband.com. [5.68.141.171])
 by smtp.gmail.com with ESMTPSA id by17sm3759280wib.18.2015.08.13.07.55.21
 for <bug-sed@HIDDEN> (version=TLSv1.2 cipher=RC4-SHA bits=128/128);
 Thu, 13 Aug 2015 07:55:21 -0700 (PDT)
Date: Thu, 13 Aug 2015 15:55:20 +0100
From: Stephane Chazelas <stephane.chazelas@HIDDEN>
To: bug-sed@HIDDEN
Subject: sed: POSIX and the z command
Message-ID: <20150813145520.GC4313@HIDDEN>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.21 (2010-09-15)
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address
 (bad octet value).
X-Received-From: 2001:4830:134:3::11
X-Spam-Score: -4.0 (----)
X-Debbugs-Envelope-To: submit
X-BeenThere: debbugs-submit <at> debbugs.gnu.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: <debbugs-submit.debbugs.gnu.org>
List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe>
List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/>
List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org>
List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help>
List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, 
 <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe>
Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org
Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org>
X-Spam-Score: -4.0 (----)

Last one for today ;)

The GNU sed documentation has:

`z'
     This command empties the content of pattern space.  It is usually
     the same as `s/.*//', but is more efficient and works in the
     presence of invalid multibyte sequences in the input stream.
     POSIX mandates that such sequences are _not_ matched by `.', so
     that there is no portable way to clear `sed''s buffers in the
     middle of the script in most multibyte locales (including UTF-8
     locales).

The part about the POSIX requirement is not true. The behaviour
of sed on non-text input is unspecified, so it doesn't require
that . not match a byte that is not part of a valid character.

GNU sed's (or grep's for that matters) . (or [^[:alnum:]]...)
could just as well match every byte that doesn't otherwise form
part of a valid character (which would be a much better
behaviour IMO) and still be POSIX compliant.

That POSIX requirement is true for regexec() but not for text
utilities.

See that discussion on the Austin Group mailing list:
http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098

-- 
Stephane




Acknowledgement sent to Stephane Chazelas <stephane.chazelas@HIDDEN>:
New bug report received and forwarded. Copy sent to bug-sed@HIDDEN. Full text available.
Report forwarded to bug-sed@HIDDEN:
bug#21251; Package sed. Full text available.
Please note: This is a static page, with minimal formatting, updated once a day.
Click here to see this page with the latest information and nicer formatting.
Last modified: Tue, 9 Oct 2018 11:30:02 UTC

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997 nCipher Corporation Ltd, 1994-97 Ian Jackson.