Assaf Gordon <assafgordon@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Received: (at 21251) by debbugs.gnu.org; 31 Jan 2017 21:50:05 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Tue Jan 31 16:50:05 2017 Received: from localhost ([127.0.0.1]:52619 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1cYgJR-0006a0-5p for submit <at> debbugs.gnu.org; Tue, 31 Jan 2017 16:50:05 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:34940) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <stephane.chazelas@HIDDEN>) id 1cYgJP-0006ZR-8J for 21251 <at> debbugs.gnu.org; Tue, 31 Jan 2017 16:50:03 -0500 Received: by mail-wm0-f67.google.com with SMTP id u63so1178209wmu.2 for <21251 <at> debbugs.gnu.org>; Tue, 31 Jan 2017 13:50:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=DObdO6vTubie4fkeJUt1dr6j7joI646UkaOtQET8Pek=; b=CjBPVXeDeb2bSc+YZ+JBO0dvheJQ2KerIQSE5NKyjCf9cRsLy/u662tjX5zJtQgofm 17DHNRNOjJ/PBrnO1VHEwU0WFqGGZW5H7ezEhQJd/OJqAp/8EOPHibcjaUYoir3YJ2ma 2M/r3NW6BPfjPuTPkjH+nhAenulRCcIEq5NAbewE4iBCica2edgA0Qwbh+E1ns4G4d3d IjZuoszWnJpCjYmqLGVAFcUeamMcuxXVScNVHHoXbK+nKHQaNHKVxKNBXl8O4EQa/5sm 7kU6bOnmqV+jl6CwvjaBXG2gwsetpOm2b1TTgg5iujf6p2OCF8eldC7ey7tRtq2K97gG HXDg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=DObdO6vTubie4fkeJUt1dr6j7joI646UkaOtQET8Pek=; b=NP9w7K79hVH8dI76ILh5oIfZ8Yz2ne/EFdRNe0cRqcKFj3ivtKk+FcZSQtLlQThqsD tkWHCNPRk9xAkh+y1wISZcHseScBhdvbPOOToeiQn6XeEgxqiqZSmW8FXeuowawJD2NF QZEOTl0P524FZi4D87tqTaYsk863KUUoDbqJQOBNLpkV2u8Ckfzjiv5CNayjuLAUoxIE o4mut+YoZseItJDgC8mQtQg3pCbkffROZV5znSIWVqhNAkc2aSlTFK3T5ZTrnJ51rosb kubZofBDwz9cPSYMSXdQRYChf6VH4dsE1yDi6NkMyD3o+OPL0UP4iOO9yqzGe/d9FPSd OJtw== X-Gm-Message-State: AIkVDXIkhG4YRZCUM1EvLf0JnvecEUjNX4fLga+gqL84BpmzGJJcnsK87Uufd3pfoEAd7g== X-Received: by 10.223.135.163 with SMTP id b32mr25443372wrb.184.1485899397193; Tue, 31 Jan 2017 13:49:57 -0800 (PST) Received: from chaz.gmail.com ([90.198.140.127]) by smtp.gmail.com with ESMTPSA id k70sm26066437wmc.3.2017.01.31.13.49.55 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 31 Jan 2017 13:49:56 -0800 (PST) Date: Tue, 31 Jan 2017 21:49:55 +0000 From: Stephane Chazelas <stephane.chazelas@HIDDEN> To: Assaf Gordon <assafgordon@HIDDEN> Subject: Re: bug#21251: sed: POSIX and the z command Message-ID: <20170131214955.GB11631@HIDDEN> References: <20150813145520.GC4313@HIDDEN> <20170128014818.GA15326@HIDDEN> <20170128100155.GA5699@HIDDEN> <20170128210424.GB8951@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20170128210424.GB8951@HIDDEN> User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 21251 Cc: 21251 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.5 (/) 2017-01-28 21:04:25 +0000, Assaf Gordon: [...] > >>I'd expect the behaviour to be unspecified if the input is not > >>text (as would be the case if there are invalid multi-byte > >>sequences). > > > >Exactly. > > So the above somewhat confuses me (as my previous email): > > Let's say I was to write a new simple 'sed' for POSIX systems. > If POSIX/OpenGroup encourages me (as a software writer for posix > systems) to use the POSIX regexec API, then implicitly my 'sed' > program wouldn't match invalid multibyte sequences. > But if OpenGroup wants me to match invalid multibyte sequences in 'sed'. > it means that in practical terms I shouldn't use POSIX API and > implement my own regex engine... [...] Just to clear what I think might be the source of the confusion, this bug is not about GNU sed not being POSIX compliant in this instance (it is compliant), but a documentation bug about the claim that POSIX mandates s/.*// to not empty the pattern space if it contains invalid characters being wrong. POSIX doesn't mandate that, it mandates nothing of sed when the input is not text. The current sed behaviour is compliant. When the input is not text, *anything* is compliant as POSIX leaves the behaviour of sed unspecified then. That's an area not covered by POSIX, you're on your own. In particular, you're free to ensure that s/.*// empties the pattern space if you like. That "simple sed" can do fgets() on a statically allocated buffer of LINE_MAX length and use POSIX regexec() on it and still be conformant. Now, though that would be the subject of another "feature request" bug and as you say one that would cover all the text utilities, not just "sed", I (not POSIX) argue that it would be better if individual bytes that don't form part of valid characters would be treated as a character of their own rather than pretend they're not there. That could be done by adding a (non-POSIX) flag to regcomp() and fnmatch() to enable that behaviour. Or like python does in some cases, work with APIs that work on some wchar_t* instead of char* but for the translation from char* UTF-8 to wchar_t*, use a reserved range for byte values that don't form part of valid characters.Like python that uses code points U+DC80 to U+DCFF for bytes 0x80 to 0xff that don't form part of valid characters (U+D800 to U+DFFF are not characters, they are code points which are otherwise reserved for UTF-16 encoding). Without having to change the APIs, another approach (in UTF-8 locales) could be to preprocess the input to change for instance a standalone 0x80 into the would-be UTF-8 encoding of U+DC80 before calling regexec() (for which at the moment "." matches on even though it's not a character) and do the reverse on output. That would have some performance impact though. Note that at the moment there's some discrepency between GNU tools on the treatment of the would-be UTF-8 encoding of those D800-DF00 non-characters (the UTF-16 surrogate pairs). For instance, some treat "ed b2 80" (the would-be-UTF-8-encoding of DC80) as 0 character, some as 1, some as 3, some as 1 and 3 at the same time: $ export C=$'\xed\xb2\x80' $ bash -c '[[ $C = ??? ]]' && echo yes yes For bash (and zsh and ksh93), those 3 bytes don't form part of a valid character, so are considered as characters which IMO is the best thing to do. $ printf %s "$C" | wc -m 0 That's not a character, so we print 0 (as required by POSIX I beleive, wc is _not_ a text utility). $ touch "$C"; find "$C" -name '*' $ touch "$C"; find "$C" -name '?' $ touch "$C"; find "$C" -name '???' $ That file can't be matched by name! $ printf '%s\n' "$C" | grep -xl . (standard input) $ printf '%s\n' "$C" | sed 's/^.$/yes/' yes But: $ printf '%s\n' "$C" | grep -xPl . $ printf '%s\n' "$C" | ./grep -Plx '.*' (standard input) $ printf '@%s@\n' "$C" | ./grep -Plx '@.*@' $ Worse: it can be one character and three at the same time: $ expr "$C" : '^.$' 3 $ printf '%s\n' "$C" | awk '/^.$/ {print length}' 3 (note that's on Linux-Mint 18.1, so not with the latest versions of those utilities, one would have to check with the latest versions). (again, that's not a POSIX compliance issue for text utilities). -- Stephane
bug-sed@HIDDEN
:bug#21251
; Package sed
.
Full text available.Assaf Gordon <assafgordon@HIDDEN>
to control <at> debbugs.gnu.org
.
Full text available.Received: (at 21251) by debbugs.gnu.org; 28 Jan 2017 21:05:03 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jan 28 16:05:03 2017 Received: from localhost ([127.0.0.1]:49449 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1cXaBD-00031m-BM for submit <at> debbugs.gnu.org; Sat, 28 Jan 2017 16:05:03 -0500 Received: from mail-qt0-f194.google.com ([209.85.216.194]:33582) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <assafgordon@HIDDEN>) id 1cXaBA-00031E-Rs for 21251 <at> debbugs.gnu.org; Sat, 28 Jan 2017 16:05:01 -0500 Received: by mail-qt0-f194.google.com with SMTP id n13so51338140qtc.0 for <21251 <at> debbugs.gnu.org>; Sat, 28 Jan 2017 13:05:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=PelaogIPUVbmEmpSLCfwbrgfIvgCfKiLX2e1pRzXtJE=; b=fj3GUUUkFjdpbNYD/ylWjVLJ+C4K8aiGZZLohTr+2N4K4CKWsHAeFaVYiIdh+H/uyv 49rUNjvzVkuk9JH4gBCB/AVjJKAwzLmKh6xQCr/pImZTXpmdG3gb/i1Qf5Jwi+VDW6G/ ccvwQsgbj2Z75bz4mpAdbXP8pDYksbHgGeMhcg9Z7Xl+o3pFRBJiE9rrwbQ08EXkpxw4 kcSsD1oIRBTAd7+/vN+7uS8hErMnRhbQwPSEUWZafbZNFkC48y3xp+RyWIijREnaKlx6 BmPcMrUQLCSngkROchI/dS3HhqOos2BIKfQgvWbJ0TZDLhdaGjMgLmP4IQx9WmDvg6D4 XC8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=PelaogIPUVbmEmpSLCfwbrgfIvgCfKiLX2e1pRzXtJE=; b=AruAFSYFGTB9DwYSSEnJLTTrVMRA9M0a79wz8oL1c95LcQxtFTpUClcj+Qu5TlhSG/ ojpIkSgcuzNmqnKNPDMePdt8svd3CwG7/3mcoqNzaFfLLQHvRkhox8vTZkRnZtu3LJkS tfoTcnVxqEs4OxWf4U7Qkbd4csbNdd28/i5kiLMXBIBKopnudsySeXtFbeDHWLtSUoww yBLESUm6HHv5OhukdOYeNCHDrH8UYGsWKKGe0OFfhQahkHxWg0HfAn+EB6nwd7IPlbIc BY5ofbiGZDX+z2JDEWYJzSubo5CIoCStE6ZBn4AQLUhCzUAPA6GDHGgiza51YiJM46pb zMiA== X-Gm-Message-State: AIkVDXJCQooIzwYdLY4mDP73EGni6q5WWI4WkTUooljK6Px0UDDouysljwmJ8q7mnmqyGQ== X-Received: by 10.200.2.8 with SMTP id k8mr13450941qtg.163.1485637495378; Sat, 28 Jan 2017 13:04:55 -0800 (PST) Received: from gmail.com (housegordon.org. [104.236.108.240]) by smtp.gmail.com with ESMTPSA id u54sm7543103qtu.35.2017.01.28.13.04.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 28 Jan 2017 13:04:54 -0800 (PST) Date: Sat, 28 Jan 2017 21:04:25 +0000 From: Assaf Gordon <assafgordon@HIDDEN> To: Stephane Chazelas <stephane.chazelas@HIDDEN> Subject: Re: bug#21251: sed: POSIX and the z command Message-ID: <20170128210424.GB8951@HIDDEN> References: <20150813145520.GC4313@HIDDEN> <20170128014818.GA15326@HIDDEN> <20170128100155.GA5699@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <20170128100155.GA5699@HIDDEN> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 21251 Cc: 21251 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.5 (/) Hello Stephane, On Sat, Jan 28, 2017 at 10:01:55AM +0000, Stephane Chazelas wrote: >It doesn't preclude the use of regexec. It just leaves the >behaviour unspecified when the input is not text Thanks for the clarification. >I'd argue that for sequences of bytes that don't form valid >characters, it would be nicer if "." or "[^anything]" matched >each of the individual bytes. Concretely, GNU sed uses several regex engines now (gnulib's dfa for fast matching, then either glibc's or gnulib's RE for general matching and substitution). To support this behaviour we'll need to ensure all of them behave in the same reproducible and reliable manner (not impossible, just a TODO). >You can still find the discussion using the NNTP interface. I >attach the most relevant message (from Geoff Clare of the Austin >group). I can send you the whole discussion as a mailbox file if >you like. I would appricate if you could send it to me - I'm interested in multibyte processing for other gnu programs as well. >From: Geoff Clare <gwc@HIDDEN> >> GNU sed even went as far as defining a new command for emptying >> the pattern space to work around that problem: >> [...] >> Is that claim (about it being a POSIX requirement) true? > >I think it's true for regexec(), but not for sed. > >(Perhaps we should add a REG_EILSEQ error return for regexec().) > >> I'd expect the behaviour to be unspecified if the input is not >> text (as would be the case if there are invalid multi-byte >> sequences). > >Exactly. So the above somewhat confuses me (as my previous email): Let's say I was to write a new simple 'sed' for POSIX systems. If POSIX/OpenGroup encourages me (as a software writer for posix systems) to use the POSIX regexec API, then implicitly my 'sed' program wouldn't match invalid multibyte sequences. But if OpenGroup wants me to match invalid multibyte sequences in 'sed'. it means that in practical terms I shouldn't use POSIX API and implement my own regex engine... You compared it with LINE_MAX, but realistically, implementing support for lines longer than LINE_MAX is very different scale of effort than implementing a new regex engine... What am I missing ? Thanks! - assaf
bug-sed@HIDDEN
:bug#21251
; Package sed
.
Full text available.Received: (at 21251) by debbugs.gnu.org; 28 Jan 2017 10:02:09 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Sat Jan 28 05:02:09 2017 Received: from localhost ([127.0.0.1]:48783 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1cXPpd-0006IR-9h for submit <at> debbugs.gnu.org; Sat, 28 Jan 2017 05:02:09 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:33678) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <stephane.chazelas@HIDDEN>) id 1cXPpb-0006Hy-FY for 21251 <at> debbugs.gnu.org; Sat, 28 Jan 2017 05:02:04 -0500 Received: by mail-wm0-f67.google.com with SMTP id v77so2668459wmv.0 for <21251 <at> debbugs.gnu.org>; Sat, 28 Jan 2017 02:02:03 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=yvE5AxpF/QEd8Ea8T1K8YWaHPpHXhf3PFrakI686u4M=; b=a5Cp+bLOOghRp5PuqPt+HNdlN+shRk5UrIfdaD3afOGnzvvXsPkAam4lPKYrt4bz1x h8CYAphdG6JfvrzPS1qqiVdt6UloNfZcho/hR623eFsl67Jivvx4fwwZSkfMMhNofGgv ZUMlOCTbGBN3iAPNTy/FCJNh1ELLSyq3DrHR7Am8Ea4foI3uikfFn7EYynyRAwmYO0At EbuxEow5uZcJ2BZmniNVnPODLTmzKVMC4EHHfqhM31TZxWGWNaJ8E9mRCTgtVhXhcncl fouf3TDg5vSJ840VM/VsH/VzKQHmqETF7ECPDCMxzxoubmtwLiab8H0cpz4XjftAYQa/ Ps0w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=yvE5AxpF/QEd8Ea8T1K8YWaHPpHXhf3PFrakI686u4M=; b=ES15XSCTbtVx3yjQI7HG5JmbVJVtHLna9vdmKQfJwbr122H9vk8iLslatxdewz+Ens zwZBqUcxBDXW3a6BraP2UDhEwCclKa4FB1ZroMcSka9IymZD+77GQfKeAxU/2mpiJ6Rs snJcV4RSwCqs0GfGyKv9vYaU8vn0xAnMYkMm64B9cXG+vrE5jhEHTsAiGLGMchxnTDcs YSfoeNXvIYdL058fc+EqvP76z9hPmROAcWm0DZtHK0nyz7dzKHcTPtocN6/dGSTM7C6F syLVZ0boT/Ec7a5AnVgJHmvZIN3bbazNBrNA8S89stqFX2kJgKqwPSebheISP8/Zbgue /+9w== X-Gm-Message-State: AIkVDXI6eSixCpInYdSsL9ekc3xruJaQL7+ZeL/LsAmmW1TPiG4djMetvZv6zubc4yj0aw== X-Received: by 10.28.103.69 with SMTP id b66mr6213289wmc.73.1485597717526; Sat, 28 Jan 2017 02:01:57 -0800 (PST) Received: from chaz.gmail.com ([90.198.140.127]) by smtp.gmail.com with ESMTPSA id a72sm12147563wrc.48.2017.01.28.02.01.56 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 28 Jan 2017 02:01:56 -0800 (PST) Date: Sat, 28 Jan 2017 10:01:55 +0000 From: Stephane Chazelas <stephane.chazelas@HIDDEN> To: Assaf Gordon <assafgordon@HIDDEN> Subject: Re: bug#21251: sed: POSIX and the z command Message-ID: <20170128100155.GA5699@HIDDEN> References: <20150813145520.GC4313@HIDDEN> <20170128014818.GA15326@HIDDEN> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="PEIAKu/WMn1b1Hv9" Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170128014818.GA15326@HIDDEN> User-Agent: Mutt/1.5.24 (2015-08-30) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 21251 Cc: 21251 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.5 (/) --PEIAKu/WMn1b1Hv9 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline 2017-01-28 01:48:19 +0000, Assaf Gordon: [...] > On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote: > >[...] The behaviour > >of sed on non-text input is unspecified, so it doesn't require > >that . not match a byte that is not part of a valid character. > >[...] > >That POSIX requirement is true for regexec() but not for text > >utilities. > > I'm far from familiar with POSIX intricacies, but doesn't that sound a bit > strange ? I would naively think that POSIX would encourage POSIX-compliant > test utilities to use the system's native regexec implenentation, instead of > supporting slightl different semantics... Hi Assaf, It doesn't preclude the use of regexec. It just leaves the behaviour unspecified when the input is not text, like when lines are longer than LINE_MAX or when they contain NUL bytes or when they contain sequences of bytes not forming valid characters or when there are characters after the last newline character. Upon sequences of bytes that don't form valid characters, you're free to exit with an error, shut down the computer, or whatever you like, POSIX doesn't care. What POSIX tells the user of the POSIX API (that is script writers, sed user) is that they can't expect anything on non-text input. GNU sed already handles lines longer than LINE_MAX nicely, as well as lines containing NUL bytes or an unterminated last line. I'd argue that for sequences of bytes that don't form valid characters, it would be nicer if "." or "[^anything]" matched each of the individual bytes. It's what bash's * and ? and [!anything] fnmatch() patterns do (even though in that case POSIX seem to forbid it; that has been discussed on the austin group mailing list as well). > >See that discussion on the Austin Group mailing list: > >http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098 > > This link seems broken. Would you know where to find this discussion online > ? [...] Yes. They relied on gmane for the mailing list archive. The web interface has been discontinued (https://lars.ingebrigtsen.no/2016/07/28/the-end-of-gmane/), then taken over by somebody else, but not everything is back. https://lars.ingebrigtsen.no/2016/09/06/gmane-alive/comment-page-1/ You can still find the discussion using the NNTP interface. I attach the most relevant message (from Geoff Clare of the Austin group). I can send you the whole discussion as a mailbox file if you like. -- Stephane --PEIAKu/WMn1b1Hv9 Content-Type: message/rfc822 Content-Disposition: inline Content-Transfer-Encoding: 8bit Delivered-To: stephane.chazelas@HIDDEN Received: by 10.25.62.214 with SMTP id l205csp430205lfa; Wed, 1 Jul 2015 02:55:50 -0700 (PDT) X-Received: by 10.68.169.34 with SMTP id ab2mr52669631pbc.120.1435744550720; Wed, 01 Jul 2015 02:55:50 -0700 (PDT) Return-Path: <austin-group-l-request@HIDDEN> Received: from m4.opengroup.org (m4.opengroup.org. [64.79.149.154]) by mx.google.com with SMTP id wb8si2655095pac.11.2015.07.01.02.55.49 for <stephane.chazelas@HIDDEN>; Wed, 01 Jul 2015 02:55:50 -0700 (PDT) Received-SPF: pass (google.com: domain of austin-group-l-request@HIDDEN designates 64.79.149.154 as permitted sender) client-ip=64.79.149.154; Authentication-Results: mx.google.com; spf=pass (google.com: domain of austin-group-l-request@HIDDEN designates 64.79.149.154 as permitted sender) smtp.mail=austin-group-l-request@HIDDEN Received: (qmail 6768 invoked by uid 503); 1 Jul 2015 09:55:24 -0000 Resent-Date: 1 Jul 2015 09:55:24 -0000 X-Sender-Id: gwc@HIDDEN Date: Wed, 1 Jul 2015 10:55:14 +0100 From: Geoff Clare <gwc@HIDDEN> To: austin-group-l@HIDDEN Subject: Re: UTF-8 and non-characters Message-ID: <20150701095514.GA22396@HIDDEN> References: <af4a72d56907da9667d250c1f27b231e@HIDDEN> <20150624194723.GB4187@HIDDEN> <20150625085001.GB9050@HIDDEN> <20150630202359.GI5093@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20150630202359.GI5093@HIDDEN> User-Agent: Mutt/1.5.21 (2010-09-15) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-3.0 (m1.opengroup.org [172.20.55.20]); Wed, 01 Jul 2015 02:55:19 -0700 (PDT) X-Virus-Scanned: clamav-milter 0.98 at m1.opengroup.org X-Virus-Status: Clean Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by m4.opengroup.org id t619tJLj006443 Resent-Message-ID: <"dCpztB.A.DlB.Ik7kVB"@Phoebe.vpn.opengroup.org> Resent-To: austin-group-l@HIDDEN Resent-From: austin-group-l@HIDDEN X-Mailing-List: austin-group-l:archive/latest/22769 X-Loop: austin-group-l@HIDDEN Precedence: list Resent-Sender: austin-group-l-request@HIDDEN Stephane Chazelas <stephane.chazelas@HIDDEN> wrote, on 30 Jun 2015: > > Speaking of which, would a pseudo-UTF-8 locale where bytes that > don't form valid characters are mapped to a character like > U+FFFD (�) be POSIX compliant. > > Like c3 a9 is é, but c3 41 a9 is �A� > > or if not all mapped to a single character, mapped to dedicated > unassigned code points (0x7fffff80 to 0x7fffffff for instance)? > > For instance, above c3 41 a9 being <U+7fffffc3>A<U+7fffffa9> > > If allowed, would that not be desirable (I can see it > potentially be a problem when processing partial input)? I think this would cause inconsistency between btowc() and the various multi-byte to wide-character conversion functions. If btowc(0xc3) returns a wide character, then mbtowc() on c3 a9 ought to convert the c3 to that wide character and return 1, instead of converting c3 a9 to a wide é and returning 2. Conversely, if btowc(0xc3) returns WEOF, then mbtowc() on c3 41 a9 ought not to convert the c3 to a wide character. > A common source of bugs and security vulnerabilities with > UTF-8 is that fact that not all sequences of bytes map to > characters and in particular that they're not matched by RE's > "." or ".*" or fnmatch()'s ? or *. > > That's a common problem when you can't guarantee the input is > valid text for instance for arbitrary file names from the file > system. That's quite common when dealing with file names that > were written in a single-byte character set in UTF-8 locales. > > For instance, > > find . -name '*' > > With GNU find at least doesn't match on $'St\xe9phane.txt' > (Stéphane.txt in the iso8859-1 charset). > > An example of a more serious problem: > > find . ! -name "* *" -exec cmd-that-would-break-with-spaces {} + It looks like the pattern matching sections of the standard have some problems with the use of the terms character and string. 2.13.1 says * matches "multiple characters", but 2.13.2 says it matches "any string" in item 1 and then says it matches "a string of zero or more characters" (i.e. any character string) in item 3. > GNU sed even went as far as defining a new command for emptying > the pattern space to work around that problem: > > `z' > This command empties the content of pattern space. It is usually > the same as `s/.*//', but is more efficient and works in the > presence of invalid multibyte sequences in the input stream. > POSIX mandates that such sequences are _not_ matched by `.', so > that there is no portable way to clear `sed''s buffers in the > middle of the script in most multibyte locales (including UTF-8 > locales). > > Is that claim (about it being a POSIX requirement) true? I think it's true for regexec(), but not for sed. (Perhaps we should add a REG_EILSEQ error return for regexec().) > I'd expect the behaviour to be unspecified if the input is not > text (as would be the case if there are invalid multi-byte > sequences). Exactly. > See also > http://unix.stackexchange.com/questions/6516/filtering-invalid-utf8 > where we wondered whether grep -vx '.*' was required to report > lines with invalid multi-byte sequences. Unspecified, for the same reason as for sed. > There was also a discussion earlier here about shells' ? and * > on invalid byte sequences and most shells seem to match > individual bytes from invalid multibyte sequences as one > character (except for yash that won't deal with those at all) > which seem to me like the safest thing to do. > > What's the OpenGroup position on that? 2.13.1 is clear that ? matches a character. The requirements for * are ambiguous because of the conflicting text I pointed out above. -- Geoff Clare <g.clare@HIDDEN> The Open Group, Apex Plaza, Forbury Road, Reading, RG1 1AX, England --PEIAKu/WMn1b1Hv9--
bug-sed@HIDDEN
:bug#21251
; Package sed
.
Full text available.Received: (at 21251) by debbugs.gnu.org; 28 Jan 2017 01:48:57 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Fri Jan 27 20:48:57 2017 Received: from localhost ([127.0.0.1]:48698 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1cXI8P-0003RK-2A for submit <at> debbugs.gnu.org; Fri, 27 Jan 2017 20:48:57 -0500 Received: from mail-qk0-f193.google.com ([209.85.220.193]:35845) by debbugs.gnu.org with esmtp (Exim 4.84_2) (envelope-from <assafgordon@HIDDEN>) id 1cXI8N-0003R7-7K for 21251 <at> debbugs.gnu.org; Fri, 27 Jan 2017 20:48:55 -0500 Received: by mail-qk0-f193.google.com with SMTP id i34so8738847qkh.3 for <21251 <at> debbugs.gnu.org>; Fri, 27 Jan 2017 17:48:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=9QDbvckyRzYT+3CGb/5Ki9Zg9tJD336SLEftYZXvE3k=; b=gsIyyn+YLfilwj7Q1TsK/KO0i1GNn3gqDNH6nENeVuZQKTLeherAVsIdpQJA5LZlHx +R6Q9FQfuPTbhM/PHHm4V5Z8cnsFjHuYJEt7R2Ey9GTnLfsvbXpvcokuTbiix9e4y4gU KPZb3Q0PHIzZ0tLpPyVyiYfM/aN69zLnFvRurgi3yxQFh5W6myS75fqDK9KVQD7vJnCz IfqXh+OK1GhApExK/OdUfSrw0p/kXJW9ke53DUmuiyjWMVA/ePZgwNcJN46pxAJaXHeh QUb43O9ceWnlLfjneoK7eVG1EflbwAGv3GdEsg9nU7X3kx25tjVoClf5lPkZRsznimgE 9uGw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=9QDbvckyRzYT+3CGb/5Ki9Zg9tJD336SLEftYZXvE3k=; b=ru8xBh2x5gPXuyD3WfZ/RsCoVYRHIVedQLw9Igc20eDECATgqL8PtvrjgFXIN2RN+p K5KyQybRZBi9At3mBNRBU87O7+LwUpM2VNgj8PVVMgYwx47ZY5PVCU0uvOvYGz6IsZOe VmLCS7G6l1I3p0i4n+lxD+DVAes9m4+M9xorEx9MIctOFtXzIRqCFnbsy5YhQdyitVjz AFZ5ZBR7+h8F81c0X4pOfzCzo0jY4TkPc3Hr3IER/A30EfhtUB6w3zr8DbB2Ive5QCLQ B0rpMvnboYadu9ki2gjw8HJWikcbHkQ2OcHBRTTLvYJnmwfB7MiWqU5y2CdVepvNk6UT +BkA== X-Gm-Message-State: AIkVDXKGE7YphE49rxAwjtSOoYROavi4R+Xp9tT1xQ1AxsOKWwuUZ45mwILD35kgDE9lcw== X-Received: by 10.55.75.134 with SMTP id y128mr10599252qka.134.1485568129475; Fri, 27 Jan 2017 17:48:49 -0800 (PST) Received: from gmail.com (housegordon.org. [104.236.108.240]) by smtp.gmail.com with ESMTPSA id g13sm5673977qtg.8.2017.01.27.17.48.48 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 27 Jan 2017 17:48:48 -0800 (PST) Date: Sat, 28 Jan 2017 01:48:19 +0000 From: Assaf Gordon <assafgordon@HIDDEN> To: Stephane Chazelas <stephane.chazelas@HIDDEN> Subject: Re: bug#21251: sed: POSIX and the z command Message-ID: <20170128014818.GA15326@HIDDEN> References: <20150813145520.GC4313@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Disposition: inline In-Reply-To: <20150813145520.GC4313@HIDDEN> User-Agent: Mutt/1.5.23 (2014-03-12) X-Spam-Score: 0.5 (/) X-Debbugs-Envelope-To: 21251 Cc: 21251 <at> debbugs.gnu.org X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.18 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: 0.5 (/) Hello Stephane, Sorry for the delayed response. I'm triaging old sed bugs. On Thu, Aug 13, 2015 at 03:55:20PM +0100, Stephane Chazelas wrote: > [...] The behaviour > of sed on non-text input is unspecified, so it doesn't require > that . not match a byte that is not part of a valid character. > [...] > That POSIX requirement is true for regexec() but not for text > utilities. I'm far from familiar with POSIX intricacies, but doesn't that sound a bit strange ? I would naively think that POSIX would encourage POSIX-compliant test utilities to use the system's native regexec implenentation, instead of supporting slightl different semantics... > See that discussion on the Austin Group mailing list: > http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098 This link seems broken. Would you know where to find this discussion online ? thanks, - assaf
bug-sed@HIDDEN
:bug#21251
; Package sed
.
Full text available.Received: (at submit) by debbugs.gnu.org; 13 Aug 2015 14:55:41 +0000 From debbugs-submit-bounces <at> debbugs.gnu.org Thu Aug 13 10:55:41 2015 Received: from localhost ([127.0.0.1]:55142 helo=debbugs.gnu.org) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <debbugs-submit-bounces <at> debbugs.gnu.org>) id 1ZPtut-0006UX-1A for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:40 -0400 Received: from eggs.gnu.org ([208.118.235.92]:43177) by debbugs.gnu.org with esmtp (Exim 4.80) (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtur-0006UL-Iy for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtuq-0003ft-2Y for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:36 -0400 X-Spam-Checker-Version: SpamAssassin 3.3.2 (2011-06-06) on eggs.gnu.org X-Spam-Level: X-Spam-Status: No, score=0.8 required=5.0 tests=BAYES_50,FREEMAIL_FROM, T_DKIM_INVALID autolearn=disabled version=3.3.2 Received: from lists.gnu.org ([2001:4830:134:3::11]:35478) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtup-0003fo-QV for submit <at> debbugs.gnu.org; Thu, 13 Aug 2015 10:55:35 -0400 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53149) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtuj-0005fi-4U for bug-sed@HIDDEN; Thu, 13 Aug 2015 10:55:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtud-0003Yc-88 for bug-sed@HIDDEN; Thu, 13 Aug 2015 10:55:28 -0400 Received: from mail-wi0-x229.google.com ([2a00:1450:400c:c05::229]:36625) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from <stephane.chazelas@HIDDEN>) id 1ZPtud-0003YQ-17 for bug-sed@HIDDEN; Thu, 13 Aug 2015 10:55:23 -0400 Received: by wicja10 with SMTP id ja10so154646206wic.1 for <bug-sed@HIDDEN>; Thu, 13 Aug 2015 07:55:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:subject:message-id:mime-version:content-type :content-disposition:user-agent; bh=9CKUa1so1Eh8s31CNAR1W+AZfWenscQ4zq9P1Q0SdWI=; b=GJHm7wfZH0ueb3UX7I3MBOubL/Rv2+wTcBfLUGs/vu76R/zZZtW7s5Z7n7o+OlUxBA 858iw0PpCUTKE74YVMMUWriS0boR6a3m0GRHa7KacJz/3mQLUsF7ZrZ0BaY+RPR7q+Xt l6EhRMTXTLzoJCaaINzdnFcR2JpSWAVhzLqYEUuBNIoCqb4Ixk68CmoIVI8gMVB3LevH uXlF8LuAKGs9iBcREaaz8uPU8rcLwKtXKROkoz9mb9tEZuudjN4UbQHoWV2mlBhXTxrR M3BMU4xXniY9cwGCDLRHzfN4Ut9xyfXqEpfLy7wuauEFsHtXECoqt/4ZgLf9D6XoqQgI tGwg== X-Received: by 10.180.211.11 with SMTP id my11mr54793412wic.51.1439477722228; Thu, 13 Aug 2015 07:55:22 -0700 (PDT) Received: from chaz.gmail.com (05448dab.skybroadband.com. [5.68.141.171]) by smtp.gmail.com with ESMTPSA id by17sm3759280wib.18.2015.08.13.07.55.21 for <bug-sed@HIDDEN> (version=TLSv1.2 cipher=RC4-SHA bits=128/128); Thu, 13 Aug 2015 07:55:21 -0700 (PDT) Date: Thu, 13 Aug 2015 15:55:20 +0100 From: Stephane Chazelas <stephane.chazelas@HIDDEN> To: bug-sed@HIDDEN Subject: sed: POSIX and the z command Message-ID: <20150813145520.GC4313@HIDDEN> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.21 (2010-09-15) X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-detected-operating-system: by eggs.gnu.org: Error: Malformed IPv6 address (bad octet value). X-Received-From: 2001:4830:134:3::11 X-Spam-Score: -4.0 (----) X-Debbugs-Envelope-To: submit X-BeenThere: debbugs-submit <at> debbugs.gnu.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: <debbugs-submit.debbugs.gnu.org> List-Unsubscribe: <https://debbugs.gnu.org/cgi-bin/mailman/options/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=unsubscribe> List-Archive: <https://debbugs.gnu.org/cgi-bin/mailman/private/debbugs-submit/> List-Post: <mailto:debbugs-submit <at> debbugs.gnu.org> List-Help: <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=help> List-Subscribe: <https://debbugs.gnu.org/cgi-bin/mailman/listinfo/debbugs-submit>, <mailto:debbugs-submit-request <at> debbugs.gnu.org?subject=subscribe> Errors-To: debbugs-submit-bounces <at> debbugs.gnu.org Sender: "Debbugs-submit" <debbugs-submit-bounces <at> debbugs.gnu.org> X-Spam-Score: -4.0 (----) Last one for today ;) The GNU sed documentation has: `z' This command empties the content of pattern space. It is usually the same as `s/.*//', but is more efficient and works in the presence of invalid multibyte sequences in the input stream. POSIX mandates that such sequences are _not_ matched by `.', so that there is no portable way to clear `sed''s buffers in the middle of the script in most multibyte locales (including UTF-8 locales). The part about the POSIX requirement is not true. The behaviour of sed on non-text input is unspecified, so it doesn't require that . not match a byte that is not part of a valid character. GNU sed's (or grep's for that matters) . (or [^[:alnum:]]...) could just as well match every byte that doesn't otherwise form part of a valid character (which would be a much better behaviour IMO) and still be POSIX compliant. That POSIX requirement is true for regexec() but not for text utilities. See that discussion on the Austin Group mailing list: http://thread.gmane.org/gmane.comp.standards.posix.austin.general/11059/focus=11098 -- Stephane
Stephane Chazelas <stephane.chazelas@HIDDEN>
:bug-sed@HIDDEN
.
Full text available.bug-sed@HIDDEN
:bug#21251
; Package sed
.
Full text available.
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997 nCipher Corporation Ltd,
1994-97 Ian Jackson.