GNU bug report logs - #40242
n as delimiter alias

Previous Next

Package: sed;

Reported by: Oğuz <oguzismailuysal <at> gmail.com>

Date: Thu, 26 Mar 2020 15:31:02 UTC

Severity: normal

Tags: confirmed

Merged with 40239

Done: Jim Meyering <jim <at> meyering.net>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 40242 in the body.
You can then email your comments to 40242 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#40242; Package sed. (Thu, 26 Mar 2020 15:31:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Oğuz <oguzismailuysal <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Thu, 26 Mar 2020 15:31:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Oğuz <oguzismailuysal <at> gmail.com>
To: bug-sed <at> gnu.org
Subject: n as delimiter alias
Date: Thu, 26 Mar 2020 07:30:16 +0200
[Message part 1 (text/plain, inline)]
$ sed --version
sed (GNU sed) 4.7
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <
https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed <at> gnu.org>.

While '\t' matches a literal 't' when 't' is the delimiter, '\n' does not
match 'n' when 'n' is the delimiter. See:

$ echo t | sed 'st\ttt' | xxd
00000000: 0a                                       .
$
$ echo n | sed 'sn\nnn' | xxd
00000000: 6e0a

Is this a bug or is there a sound logic behind this?


-- 
Oğuz
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#40242; Package sed. (Tue, 31 Mar 2020 04:43:01 GMT) Full text and rfc822 format available.

Message #8 received at 40242 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: Oğuz <oguzismailuysal <at> gmail.com>, 40242 <at> debbugs.gnu.org
Subject: Re: bug#40242: n as delimiter alias
Date: Mon, 30 Mar 2020 22:42:09 -0600
tags 40242 confirmed
stop

Hello,

On 2020-03-25 11:30 p.m., Oğuz wrote:
> While '\t' matches a literal 't' when 't' is the delimiter, '\n' does not
> match 'n' when 'n' is the delimiter. See:
> 
> $ echo t | sed 'st\ttt' | xxd
> 00000000: 0a                                       .
> $
> $ echo n | sed 'sn\nnn' | xxd
> 00000000: 6e0a
> 
> Is this a bug or is there a sound logic behind this?

Thank you for finding this interesting edge-case.

I think it is a (very old) bug. I'm not sure about its origin,
perhaps Jim or Paolo can comment.

First,
let's start with what's expected (slightly modifying your examples):

The canonical usage, here "\t" becomes a TAB, and "t" is not replaced:

   $ printf t | sed 's/\t//' | od -a -An
      t

Then, using a different character "q" instead of "/", works the same:

   $ printf t | sed 'sq\tqq' | od -a -An
      t

The sed manual says (in section "3.3 The s command"):
      "
      The / characters may be uniformly replaced by any other single
      character within any given s command.

      The / character (or whatever other character is used in its
      stead) can appear in the regexp or replacement only if it is
      preceded by a \ character.
      "

This is the reason "\t" represents a regular "t" (not TAB)
*if* the substitute command's delimiter is "t" as well:

      $ printf t | sed 'st\ttt' | od -a -An
      [no output, as expected]

And similarly for other characters:

      printf x | sed 'sx\xxx' | od -a -An
      printf a | sed 'sa\aaa' | od -a -An
      printf z | sed 'sz\zzz' | od -a -An
      [no output, as expected]

---

Second,
The "\n" case behaves differently, regardless of which
separator is used. It is always treated as "\n" (new line),
never literal "n", even if the separator is "n":

These are correct, as expected:
    $ printf n | sed 's/\n//' | od -a -An
       n
    $ printf n | sed 's/\n//' | od -a -An
       n
    $ printf n | sed 'sx\nxx' | od -a -An
       n

Here, we'd expect "\n" to be treated as a literal "n" character,
not "\n", but it is not (as you've found):

    $ printf n | sed 'sn\nnn' | od -a -An
       n

----

In the code, the "match_slash" function [1] is used to find
the delimiters of the "s" command (typically "slashes").
Special handling happens if a slash is found [2],
And in lines 557-8 there's this conditional:

              else if (ch == 'n' && regex)
                ch = '\n';

Which forces any "\n" to be a new-line, regardless if the
delimiter itself was an "n".

[1] https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c#n531
[2] https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c#n552

In older sed versions, these two lines where protected by
"#ifndef REG_PERL" [3] so perhaps it had something to do with regex 
variants. But the origin of this line predates the git history.
Jim/Paolo - any ideas what this relates to?

https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c?id=41a169a9a14b5bdc736313eb411f02bcbe1c046d#n551

---

Interestingly, removing these two lines does not cause
any test failures, so this might be easy to fix without causing
any regressions.


For now I'm leaving this item open until we decide how to deal with it.

regards,
 - assaf








Merged 40239 40242. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 31 Mar 2020 04:48:02 GMT) Full text and rfc822 format available.

Added tag(s) confirmed. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 31 Mar 2020 04:48:03 GMT) Full text and rfc822 format available.

Information forwarded to bug-sed <at> gnu.org:
bug#40242; Package sed. (Tue, 31 Mar 2020 07:37:02 GMT) Full text and rfc822 format available.

Message #15 received at 40242 <at> debbugs.gnu.org (full text, mbox):

From: Oğuz <oguzismailuysal <at> gmail.com>
To: Assaf Gordon <assafgordon <at> gmail.com>
Cc: "40242 <at> debbugs.gnu.org" <40242 <at> debbugs.gnu.org>
Subject: Re: bug#40242: n as delimiter alias
Date: Tue, 31 Mar 2020 10:00:02 +0300
[Message part 1 (text/plain, inline)]
Thanks for the reply. This might not be a bug though; I sent a similar mail
(https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05881.html)
to Austin Group mailing list asking what's the expected behavior in this
case, and I was told (
https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05891.html)
both behaviors -yielding n or empty line- are correct and standard should
*probably* be amended to explicitly state that this is unspecified. And
apparently (
https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05893.html)
some other UNIXes adopted the same practice as GNU sed (or vice versa, I
don't know which one is older).

Regards

31 Mart 2020 Salı tarihinde Assaf Gordon <assafgordon <at> gmail.com> yazdı:

> tags 40242 confirmed
> stop
>
> Hello,
>
> On 2020-03-25 11:30 p.m., Oğuz wrote:
>
>> While '\t' matches a literal 't' when 't' is the delimiter, '\n' does not
>> match 'n' when 'n' is the delimiter. See:
>>
>> $ echo t | sed 'st\ttt' | xxd
>> 00000000: 0a                                       .
>> $
>> $ echo n | sed 'sn\nnn' | xxd
>> 00000000: 6e0a
>>
>> Is this a bug or is there a sound logic behind this?
>>
>
> Thank you for finding this interesting edge-case.
>
> I think it is a (very old) bug. I'm not sure about its origin,
> perhaps Jim or Paolo can comment.
>
> First,
> let's start with what's expected (slightly modifying your examples):
>
> The canonical usage, here "\t" becomes a TAB, and "t" is not replaced:
>
>    $ printf t | sed 's/\t//' | od -a -An
>       t
>
> Then, using a different character "q" instead of "/", works the same:
>
>    $ printf t | sed 'sq\tqq' | od -a -An
>       t
>
> The sed manual says (in section "3.3 The s command"):
>       "
>       The / characters may be uniformly replaced by any other single
>       character within any given s command.
>
>       The / character (or whatever other character is used in its
>       stead) can appear in the regexp or replacement only if it is
>       preceded by a \ character.
>       "
>
> This is the reason "\t" represents a regular "t" (not TAB)
> *if* the substitute command's delimiter is "t" as well:
>
>       $ printf t | sed 'st\ttt' | od -a -An
>       [no output, as expected]
>
> And similarly for other characters:
>
>       printf x | sed 'sx\xxx' | od -a -An
>       printf a | sed 'sa\aaa' | od -a -An
>       printf z | sed 'sz\zzz' | od -a -An
>       [no output, as expected]
>
> ---
>
> Second,
> The "\n" case behaves differently, regardless of which
> separator is used. It is always treated as "\n" (new line),
> never literal "n", even if the separator is "n":
>
> These are correct, as expected:
>     $ printf n | sed 's/\n//' | od -a -An
>        n
>     $ printf n | sed 's/\n//' | od -a -An
>        n
>     $ printf n | sed 'sx\nxx' | od -a -An
>        n
>
> Here, we'd expect "\n" to be treated as a literal "n" character,
> not "\n", but it is not (as you've found):
>
>     $ printf n | sed 'sn\nnn' | od -a -An
>        n
>
> ----
>
> In the code, the "match_slash" function [1] is used to find
> the delimiters of the "s" command (typically "slashes").
> Special handling happens if a slash is found [2],
> And in lines 557-8 there's this conditional:
>
>               else if (ch == 'n' && regex)
>                 ch = '\n';
>
> Which forces any "\n" to be a new-line, regardless if the
> delimiter itself was an "n".
>
> [1] https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c#n531
> [2] https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c#n552
>
> In older sed versions, these two lines where protected by
> "#ifndef REG_PERL" [3] so perhaps it had something to do with regex
> variants. But the origin of this line predates the git history.
> Jim/Paolo - any ideas what this relates to?
>
> https://git.savannah.gnu.org/cgit/sed.git/tree/sed/compile.c
> ?id=41a169a9a14b5bdc736313eb411f02bcbe1c046d#n551
>
> ---
>
> Interestingly, removing these two lines does not cause
> any test failures, so this might be easy to fix without causing
> any regressions.
>
>
> For now I'm leaving this item open until we decide how to deal with it.
>
> regards,
>  - assaf
>
>
>
>
>

-- 
Oğuz
[Message part 2 (text/html, inline)]

Information forwarded to bug-sed <at> gnu.org:
bug#40242; Package sed. (Tue, 31 Mar 2020 13:27:01 GMT) Full text and rfc822 format available.

Message #18 received at 40242 <at> debbugs.gnu.org (full text, mbox):

From: Eric Blake <eblake <at> redhat.com>
To: Oğuz <oguzismailuysal <at> gmail.com>,
 Assaf Gordon <assafgordon <at> gmail.com>
Cc: "40242 <at> debbugs.gnu.org" <40242 <at> debbugs.gnu.org>
Subject: Re: bug#40242: n as delimiter alias
Date: Tue, 31 Mar 2020 08:26:01 -0500
On 3/31/20 2:00 AM, Oğuz wrote:
> Thanks for the reply. This might not be a bug though; I sent a similar mail
> (https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05881.html)
> to Austin Group mailing list asking what's the expected behavior in this
> case, and I was told (
> https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05891.html)
> both behaviors -yielding n or empty line- are correct and standard should
> *probably* be amended to explicitly state that this is unspecified. And
> apparently (
> https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05893.html)
> some other UNIXes adopted the same practice as GNU sed (or vice versa, I
> don't know which one is older).

The POSIX folks will probably declare that use of a \X sequence (for 
arbitrary X; 'n', 't', '1', and probably others all fit this category) 
inside a regex delimited by X is unspecified behavior.  But that still 
doesn't stop us from fixing GNU set to at least be consistent - we 
should either blindly declare that \X represents the special meaning of 
X when such a meaning is present regardless of X also being the regex 
delimiter (our current \n behavior - no way to represent the delimiter 
as a literal match), or that use of X as a delimiter renders the special 
meaning of \X useless for that regex (our \t behavior - no way to 
represent the special behavior as part of the match).  My personal 
preference is making things consistent to our \t behavior.

>> In the code, the "match_slash" function [1] is used to find
>> the delimiters of the "s" command (typically "slashes").
>> Special handling happens if a slash is found [2],
>> And in lines 557-8 there's this conditional:
>>
>>                else if (ch == 'n' && regex)
>>                  ch = '\n';
>>
>> Which forces any "\n" to be a new-line, regardless if the
>> delimiter itself was an "n".
>>

>> Interestingly, removing these two lines does not cause
>> any test failures, so this might be easy to fix without causing
>> any regressions.
>>
>>
>> For now I'm leaving this item open until we decide how to deal with it.

I'm thus in favor of removing that special-case of 'n'.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org





Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Mon, 24 Oct 2022 06:26:01 GMT) Full text and rfc822 format available.

Notification sent to Oğuz <oguzismailuysal <at> gmail.com>:
bug acknowledged by developer. (Mon, 24 Oct 2022 06:26:02 GMT) Full text and rfc822 format available.

Message #23 received at 40242-done <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Eric Blake <eblake <at> redhat.com>
Cc: 40242-done <at> debbugs.gnu.org, Assaf Gordon <assafgordon <at> gmail.com>,
 Oğuz <oguzismailuysal <at> gmail.com>
Subject: Re: bug#40242: n as delimiter alias
Date: Sun, 23 Oct 2022 23:25:05 -0700
[Message part 1 (text/plain, inline)]
On Tue, Mar 31, 2020 at 6:36 AM Eric Blake <eblake <at> redhat.com> wrote:
> On 3/31/20 2:00 AM, Oğuz wrote:
> > Thanks for the reply. This might not be a bug though; I sent a similar mail
> > (https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05881.html)
> > to Austin Group mailing list asking what's the expected behavior in this
> > case, and I was told (
> > https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05891.html)
> > both behaviors -yielding n or empty line- are correct and standard should
> > *probably* be amended to explicitly state that this is unspecified. And
> > apparently (
> > https://www.mail-archive.com/austin-group-l <at> opengroup.org/msg05893.html)
> > some other UNIXes adopted the same practice as GNU sed (or vice versa, I
> > don't know which one is older).
>
> The POSIX folks will probably declare that use of a \X sequence (for
> arbitrary X; 'n', 't', '1', and probably others all fit this category)
> inside a regex delimited by X is unspecified behavior.  But that still
> doesn't stop us from fixing GNU set to at least be consistent - we
> should either blindly declare that \X represents the special meaning of
> X when such a meaning is present regardless of X also being the regex
> delimiter (our current \n behavior - no way to represent the delimiter
> as a literal match), or that use of X as a delimiter renders the special
> meaning of \X useless for that regex (our \t behavior - no way to
> represent the special behavior as part of the match).  My personal
> preference is making things consistent to our \t behavior.
>
> >> In the code, the "match_slash" function [1] is used to find
> >> the delimiters of the "s" command (typically "slashes").
> >> Special handling happens if a slash is found [2],
> >> And in lines 557-8 there's this conditional:
> >>
> >>                else if (ch == 'n' && regex)
> >>                  ch = '\n';
> >>
> >> Which forces any "\n" to be a new-line, regardless if the
> >> delimiter itself was an "n".
> >>
>
> >> Interestingly, removing these two lines does not cause
> >> any test failures, so this might be easy to fix without causing
> >> any regressions.
> >>
> >>
> >> For now I'm leaving this item open until we decide how to deal with it.
>
> I'm thus in favor of removing that special-case of 'n'.

Thank you all. Sorry it's taken so long.
I expect to push the following tomorrow.
[sed-tweak.diff (application/octet-stream, attachment)]

Reply sent to Jim Meyering <jim <at> meyering.net>:
You have taken responsibility. (Mon, 24 Oct 2022 06:26:02 GMT) Full text and rfc822 format available.

Notification sent to Enrico Maria De Angelis <enricomaria.dean6elis <at> gmail.com>:
bug acknowledged by developer. (Mon, 24 Oct 2022 06:26:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Mon, 21 Nov 2022 12:24:09 GMT) Full text and rfc822 format available.

This bug report was last modified 1 year and 154 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.