GNU bug report logs - #42764
csplit does not suppress the last match when not using {*}

Previous Next

Package: coreutils;

Reported by: Emanuele Giacomelli <vpooldyn-linux <at> yahoo.it>

Date: Sat, 8 Aug 2020 14:52:02 UTC

Severity: normal

Done: Pádraig Brady <P <at> draigBrady.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 42764 in the body.
You can then email your comments to 42764 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-coreutils <at> gnu.org:
bug#42764; Package coreutils. (Sat, 08 Aug 2020 14:52:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Emanuele Giacomelli <vpooldyn-linux <at> yahoo.it>:
New bug report received and forwarded. Copy sent to bug-coreutils <at> gnu.org. (Sat, 08 Aug 2020 14:52:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Emanuele Giacomelli <vpooldyn-linux <at> yahoo.it>
To: "bug-coreutils <at> gnu.org" <bug-coreutils <at> gnu.org>
Subject: csplit does not suppress the last match when not using {*}
Date: Sat, 8 Aug 2020 09:12:51 +0000 (UTC)
[Message part 1 (text/plain, inline)]
Good day,

I am experiencing an odd behaviour in csplit which may actually be a
bug.

I am testing this against the code cloned from
https://github.com/coreutils/coreutils.git, on the commit described by
git as v8.32-52-gc0e5f8c59.

Suppose I have the following YAML file:

==> test.yaml <==
value1: 123
---
value2: 456
---
value3: 789

and I want to split it at '---' lines. First I would try the following:

    csplit -z --suppress-matched test.yaml '/^---$/' '{1}'

which outputs:

    12
    12
    16

and creates the following files:

    ==> xx00 <==
    value1: 123

    ==> xx01 <==
    value2: 456

    ==> xx02 <==
    ---
    value3: 789

The last portion still contains the '---', despite it being suppressed
from the second part.

Now, if I try again with:

    csplit -z --suppress-matched test.yaml '/^---$/' '{*}'

I get:

    12
    12
    12

and:

    ==> xx00 <==
    value1: 123

    ==> xx01 <==
    value2: 456

    ==> xx02 <==
    value3: 789

where the last part does not contain the matched line, as expected.

While trying to figure out the problem, I noticed that match suppression
is done at the beginning of process_regexp. For a match-twice scenario
like the first one, the function is called twice, then the rest of the
file is simply dumped by split_file.

This means that the two calls to process_regexp will:

* suppress nothing for call #1 because nothing has been matched yet;
* suppress the first match in call #2.

Then, the rest of the file is dumped but no one actually suppressed the
second match, which appears in the last segment. When using asterisk
repetition, the file is instead dumped by process_regexp, which gets its
chance to suppress the matched line.

I came up with the attached patch, which simply moves match suppression
at the end of process_regexp. With this modification, the invocation:

    csplit -z --suppress-matched test.yaml '/^---$/' '{1}'

now produces:

    12
    12
    12

and:

==> xx00 <==
value1: 123

==> xx01 <==
value2: 456

==> xx02 <==
value3: 789

which is what I would expect.

[Message part 2 (text/html, inline)]
[patch.patch (text/x-patch, attachment)]

Reply sent to Pádraig Brady <P <at> draigBrady.com>:
You have taken responsibility. (Sat, 08 Aug 2020 20:58:02 GMT) Full text and rfc822 format available.

Notification sent to Emanuele Giacomelli <vpooldyn-linux <at> yahoo.it>:
bug acknowledged by developer. (Sat, 08 Aug 2020 20:58:02 GMT) Full text and rfc822 format available.

Message #10 received at 42764-done <at> debbugs.gnu.org (full text, mbox):

From: Pádraig Brady <P <at> draigBrady.com>
To: Emanuele Giacomelli <vpooldyn-linux <at> yahoo.it>, 42764-done <at> debbugs.gnu.org
Subject: Re: bug#42764: csplit does not suppress the last match when not using
 {*}
Date: Sat, 8 Aug 2020 21:56:48 +0100
[Message part 1 (text/plain, inline)]
On 08/08/2020 10:12, Emanuele Giacomelli via GNU coreutils Bug Reports wrote:
> Good day,
> 
> I am experiencing an odd behaviour in csplit which may actually be a
> bug.
> 
> I am testing this against the code cloned from
> https://github.com/coreutils/coreutils.git, on the commit described by
> git as v8.32-52-gc0e5f8c59.
> 
> Suppose I have the following YAML file:
> 
> ==> test.yaml <==
> value1: 123
> ---
> value2: 456
> ---
> value3: 789
> 
> and I want to split it at '---' lines. First I would try the following:
> 
>      csplit -z --suppress-matched test.yaml '/^---$/' '{1}'
> 
> which outputs:
> 
>      12
>      12
>      16
> 
> and creates the following files:
> 
>      ==> xx00 <==
>      value1: 123
> 
>      ==> xx01 <==
>      value2: 456
> 
>      ==> xx02 <==
>      ---
>      value3: 789
> 
> The last portion still contains the '---', despite it being suppressed
> from the second part.
> 
> Now, if I try again with:
> 
>      csplit -z --suppress-matched test.yaml '/^---$/' '{*}'
> 
> I get:
> 
>      12
>      12
>      12
> 
> and:
> 
>      ==> xx00 <==
>      value1: 123
> 
>      ==> xx01 <==
>      value2: 456
> 
>      ==> xx02 <==
>      value3: 789
> 
> where the last part does not contain the matched line, as expected.
> 
> While trying to figure out the problem, I noticed that match suppression
> is done at the beginning of process_regexp. For a match-twice scenario
> like the first one, the function is called twice, then the rest of the
> file is simply dumped by split_file.
> 
> This means that the two calls to process_regexp will:
> 
> * suppress nothing for call #1 because nothing has been matched yet;
> * suppress the first match in call #2.
> 
> Then, the rest of the file is dumped but no one actually suppressed the
> second match, which appears in the last segment. When using asterisk
> repetition, the file is instead dumped by process_regexp, which gets its
> chance to suppress the matched line.
> 
> I came up with the attached patch, which simply moves match suppression
> at the end of process_regexp. With this modification, the invocation:
> 
>      csplit -z --suppress-matched test.yaml '/^---$/' '{1}'
> 
> now produces:
> 
>      12
>      12
>      12
> 
> and:
> 
> ==> xx00 <==
> value1: 123
> 
> ==> xx01 <==
> value2: 456
> 
> ==> xx02 <==
> value3: 789
> 
> which is what I would expect.
> 


I agree with this analysis.
The usual manifestation would probably be
when there was only a single match.
I.E. when not specifying a repetition count,
we were not suppressing the single match.

I'll apply the attached in your name later today
(which also adds a test).

Marking this as done.

thanks!
Pádraig
[csplit--suppress-last.patch (text/x-patch, attachment)]

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Sun, 06 Sep 2020 11:24:07 GMT) Full text and rfc822 format available.

This bug report was last modified 3 years and 231 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.