GNU bug report logs - #34316
sed misbehavior on BRE's

Previous Next

Package: sed;

Reported by: "Lange, Markus" <M.Lange <at> dnb.de>

Date: Mon, 4 Feb 2019 15:35:06 UTC

Severity: normal

Tags: moreinfo, notabug

Done: Assaf Gordon <assafgordon <at> gmail.com>

Bug is archived. No further changes may be made.

To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 34316 in the body.
You can then email your comments to 34316 AT debbugs.gnu.org in the normal way.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-sed <at> gnu.org:
bug#34316; Package sed. (Mon, 04 Feb 2019 15:35:06 GMT) Full text and rfc822 format available.

Acknowledgement sent to "Lange, Markus" <M.Lange <at> dnb.de>:
New bug report received and forwarded. Copy sent to bug-sed <at> gnu.org. (Mon, 04 Feb 2019 15:35:07 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Lange, Markus" <M.Lange <at> dnb.de>
To: "bug-sed <at> gnu.org" <bug-sed <at> gnu.org>
Subject: sed misbehavior on BRE's
Date: Mon, 4 Feb 2019 13:42:52 +0000
[Message part 1 (text/plain, inline)]
Hi,

I'm currently migrating processes from an old SuSE 9 Linux to an new
CentOS 7 Linux and observed some unexpected behavior changes on sed.

At first some information's about the systems:

old:~ # cat /etc/SuSE-release 
SuSE Linux 9.0 (i586)
VERSION = 9.0
old:~ # uname -a
Linux biblix 2.4.21-303-smp4G #1 SMP Tue Dec 6 12:33:10 UTC 2005 i686
i686 i386 GNU/Linux
old:~ # sed --version
GNU sed version 4.0.6
...

new:~ #cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"
new:~ # uname -a
Linux userWS0.dnb.de 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14
21:49:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
new:~ # sed --version
sed (GNU sed) 4.2.2
...

Now lets see how the behavior has changed, what I think is a bug:

old:~ # sed -n 's/^.*004K...\([0-
9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*$/\2 \1\3/p'
Fehlerpica.dat 
138742c156c1445f8bdc3a7845548c00 9783507435339020F a19.04.03
18290030a02544e6a451538b0e44f9e2 9783507435377020F a19.04.03
4c7ff6d790b34470852434f5ee41200b 9783034312189020F a12.12.11

while the new system does not output anything using this expression.

Removing the line end ($) from the expression solved the problem,
somehow:

old:~ # sed -n 's/^.*004K...\([0-
9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*/\2 \1\3/p'
Fehlerpica.dat 
138742c156c1445f8bdc3a7845548c00 9783507435339020F a19.04.03
18290030a02544e6a451538b0e44f9e2 9783507435377020F a19.04.03
4c7ff6d790b34470852434f5ee41200b 9783034312189020F a12.12.11

new:~ # sed -n 's/^.*004K...\([0-
9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*/\2 \1\3/p'
Fehlerpica.dat 
138742c156c1445f8bdc3a7845548c00 9783507435339020F a19.04.03�208@
a30-01-19bc
18290030a02544e6a451538b0e44f9e2 9783507435377020F a19.04.03�208@
a30-01-19bc
4c7ff6d790b34470852434f5ee41200b 9783034312189020F a12.12.11�208@
a30-01-19bc

For me this seems to be the first unexpected behavior. The second,
which i think is tightly related, is that the first match group get's
text from the end of line attached. Maybe the first match group
consumes the line end?

So I started breaking the expression down, using only the first match
group:

old:~ # sed -n 's/^.*004K...\([0-9xX]\{13\}\).*$/\1/p' Fehlerpica.dat 
9783507435339
9783507435377
9783034312189

The new system still doesn't output anything, leaving out the line end
in the expression end up in output on the new system:

old:~ # sed -n 's/^.*004K...\([0-9xX]\{13\}\).*/\1/p' Fehlerpica.dat 
9783507435339
9783507435377
9783034312189
new:~ # sed -n 's/^.*004K...\([0-9xX]\{13\}\).*/\1/p' Fehlerpica.dat  
9783507435339�208@ a30-01-19bc
9783507435377�208@ a30-01-19bc
9783034312189�208@ a30-01-19bc

However the output differs and is wrong on the new system. The line end
is still appended to the match group.

If I try using only the second match group, the string is appended
there:
old:~ # sed -n 's/^.*006V...\(.\{1,32\}\).*/\1/p' Fehlerpica.dat
138742c156c1445f8bdc3a7845548c00
18290030a02544e6a451538b0e44f9e2
4c7ff6d790b34470852434f5ee41200b
new:~ # sed -n 's/^.*006V...\(.\{1,32\}\).*/\1/p' Fehlerpica.dat 
138742c156c1445f8bdc3a7845548c00�208@ a30-01-19bc
18290030a02544e6a451538b0e44f9e2�208@ a30-01-19bc
4c7ff6d790b34470852434f5ee41200b�208@ a30-01-19bc

So it seems like the first match group consumes far to much text in an
non-linear way breaking the match of the line end.

I've attached the Fehlerpica.dat for you and hope you can reproduce the
misbehavior.

If I can provide further information please let me know.

Thank you and best regards,
Markus Lange
-- 
***Lesen. Hören. Wissen. Deutsche Nationalbibliothek***

Deutsche Nationalbibliothek               
Fachbereich IT, Informationsinfrastruktur
Adickesallee 1
60322 Frankfurt am Main
Tel: +49 69 1525 -1786
mailto:m.lange <at> dnb.de
http://www.dnb.de
[Fehlerpica.dat (application/octet-stream, attachment)]

Information forwarded to bug-sed <at> gnu.org:
bug#34316; Package sed. (Tue, 05 Feb 2019 23:14:01 GMT) Full text and rfc822 format available.

Message #8 received at 34316 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: "Lange, Markus" <M.Lange <at> dnb.de>, 34316 <at> debbugs.gnu.org
Subject: Re: bug#34316: sed misbehavior on BRE's
Date: Tue, 5 Feb 2019 16:12:58 -0700
tags 34316 moreinfo
stop

Hello,

On 2019-02-04 6:42 a.m., Lange, Markus wrote:
> I'm currently migrating processes from an old SuSE 9 Linux to an new
> CentOS 7 Linux and observed some unexpected behavior changes on sed.
[...]
> old:~ # sed --version
> GNU sed version 4.0.6
[...]
> new:~ # sed --version
> sed (GNU sed) 4.2.2

Please note that sed 4.2.2 is also very old (7 years old).
The latest sed is version 4.7, released in December 2018.

There's limited amount of support we can help with sed-4.2.2 .


Before digging further, I notice that the file you're dealing with
has non-ascii characters in it, evident by some of the example text
you pasted (and also in the attached file):

> 9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*/\2 \1\3/p'
> Fehlerpica.dat
> 138742c156c1445f8bdc3a7845548c00 9783507435339020F a19.04.03�208@
> a30-01-19bc
> 18290030a02544e6a451538b0e44f9e2 9783507435377020F a19.04.03�208@
> a30-01-19bc
> 4c7ff6d790b34470852434f5ee41200b 9783034312189020F a12.12.11�208@
> a30-01-19bc

And such characters can cause unexpected results, depending on the
active locale.

Can you please re-run the tests on the new machine with the same
locale as the old machine, and again with LC_ALL=C (forcing C/POSIX
locale), to ensure that locale and invalid characters are not the
problem ?

Also, even if you're 'stuck' with sed-4.2.2, can you try with
sed-4.7 (perhaps compiled from source code), to see if this is an
existing problem, or perhaps it was resolved in the meantime?


regards,
 - assaf






Added tag(s) moreinfo. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Tue, 05 Feb 2019 23:14:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-sed <at> gnu.org:
bug#34316; Package sed. (Wed, 06 Feb 2019 15:08:04 GMT) Full text and rfc822 format available.

Message #13 received at 34316 <at> debbugs.gnu.org (full text, mbox):

From: "Lange, Markus" <M.Lange <at> dnb.de>
To: "34316 <at> debbugs.gnu.org" <34316 <at> debbugs.gnu.org>, "assafgordon <at> gmail.com"
 <assafgordon <at> gmail.com>
Subject: Re: bug#34316: sed misbehavior on BRE's
Date: Wed, 6 Feb 2019 07:54:59 +0000
Hi,

thanks for your response.
Even if the sed version is quite old it might be very uncommon to deal
with pica formatted (format common in library-oriented environments) 
files, so it could be an problem not seen elsewhere.

Running the original command ( sed -n 's/^.*004K...\([0-
9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*$/\2 \1\3/p' )
with LC_ALL=C on the new system resolves the problem.

On the old system LC_* and LANG are not set at all (should default to
C, if I'm not wrong), on the new machine LANG and LC_CTYPE is set to
en_US.UTF-8, what I falsely assumed to be alike C.

I'm going to check if the behavior still exists in the current sed
version soon and call back afterwards.

Thank you for your help and best regards
Markus Lange
-- 
***Lesen. Hören. Wissen. Deutsche Nationalbibliothek***

Deutsche Nationalbibliothek               
Fachbereich IT, Informationsinfrastruktur
Adickesallee 1
60322 Frankfurt am Main
Tel: +49 69 1525 -1786
mailto:m.lange <at> dnb.de
http://www.dnb.de

On Tue, 2019-02-05 at 16:12 -0700, Assaf Gordon wrote:
> tags 34316 moreinfo
> stop
> 
> Hello,
> 
> On 2019-02-04 6:42 a.m., Lange, Markus wrote:
> > I'm currently migrating processes from an old SuSE 9 Linux to an
> > new
> > CentOS 7 Linux and observed some unexpected behavior changes on
> > sed.
> 
> [...]
> > old:~ # sed --version
> > GNU sed version 4.0.6
> 
> [...]
> > new:~ # sed --version
> > sed (GNU sed) 4.2.2
> 
> Please note that sed 4.2.2 is also very old (7 years old).
> The latest sed is version 4.7, released in December 2018.
> 
> There's limited amount of support we can help with sed-4.2.2 .
> 
> 
> Before digging further, I notice that the file you're dealing with
> has non-ascii characters in it, evident by some of the example text
> you pasted (and also in the attached file):
> 
> > 9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*/\2 \1\3/p'
> > Fehlerpica.dat
> > 138742c156c1445f8bdc3a7845548c00 9783507435339020F
> > a19.04.03�208@
> > a30-01-19bc
> > 18290030a02544e6a451538b0e44f9e2 9783507435377020F
> > a19.04.03�208@
> > a30-01-19bc
> > 4c7ff6d790b34470852434f5ee41200b 9783034312189020F
> > a12.12.11�208@
> > a30-01-19bc
> 
> And such characters can cause unexpected results, depending on the
> active locale.
> 
> Can you please re-run the tests on the new machine with the same
> locale as the old machine, and again with LC_ALL=C (forcing C/POSIX
> locale), to ensure that locale and invalid characters are not the
> problem ?
> 
> Also, even if you're 'stuck' with sed-4.2.2, can you try with
> sed-4.7 (perhaps compiled from source code), to see if this is an
> existing problem, or perhaps it was resolved in the meantime?
> 
> 
> regards,
>   - assaf
> 
> 

Information forwarded to bug-sed <at> gnu.org:
bug#34316; Package sed. (Mon, 11 Feb 2019 15:55:02 GMT) Full text and rfc822 format available.

Message #16 received at 34316 <at> debbugs.gnu.org (full text, mbox):

From: "Lange, Markus" <M.Lange <at> dnb.de>
To: "34316 <at> debbugs.gnu.org" <34316 <at> debbugs.gnu.org>
Subject: Re: bug#34316: sed misbehavior on BRE's
Date: Mon, 11 Feb 2019 07:45:33 +0000
Hi,

as said i've tested using sed 4.7 on an archlinux.

# sed --version
sed (GNU sed) 4.7
...

Using LANG=C (LC_* unset) works as expected:
# LANG=C sed -n 's/^.*004K...\([0-
9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*$/\2 \1\3/p'
Fehlerpica.dat
138742c156c1445f8bdc3a7845548c00 9783507435339020F a19.04.03
18290030a02544e6a451538b0e44f9e2 9783507435377020F a19.04.03
4c7ff6d790b34470852434f5ee41200b 9783034312189020F a12.12.11

Using LANG=en_us.utf8 don't get results.

Best regards,
Markus Lange


On Tue, 2019-02-05 at 16:12 -0700, Assaf Gordon wrote:
> tags 34316 moreinfo
> stop
> 
> Hello,
> 
> On 2019-02-04 6:42 a.m., Lange, Markus wrote:
> > I'm currently migrating processes from an old SuSE 9 Linux to an
> > new
> > CentOS 7 Linux and observed some unexpected behavior changes on
> > sed.
> 
> [...]
> > old:~ # sed --version
> > GNU sed version 4.0.6
> 
> [...]
> > new:~ # sed --version
> > sed (GNU sed) 4.2.2
> 
> Please note that sed 4.2.2 is also very old (7 years old).
> The latest sed is version 4.7, released in December 2018.
> 
> There's limited amount of support we can help with sed-4.2.2 .
> 
> 
> Before digging further, I notice that the file you're dealing with
> has non-ascii characters in it, evident by some of the example text
> you pasted (and also in the attached file):
> 
> > 9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*/\2 \1\3/p'
> > Fehlerpica.dat
> > 138742c156c1445f8bdc3a7845548c00 9783507435339020F
> > a19.04.03�208@
> > a30-01-19bc
> > 18290030a02544e6a451538b0e44f9e2 9783507435377020F
> > a19.04.03�208@
> > a30-01-19bc
> > 4c7ff6d790b34470852434f5ee41200b 9783034312189020F
> > a12.12.11�208@
> > a30-01-19bc
> 
> And such characters can cause unexpected results, depending on the
> active locale.
> 
> Can you please re-run the tests on the new machine with the same
> locale as the old machine, and again with LC_ALL=C (forcing C/POSIX
> locale), to ensure that locale and invalid characters are not the
> problem ?
> 
> Also, even if you're 'stuck' with sed-4.2.2, can you try with
> sed-4.7 (perhaps compiled from source code), to see if this is an
> existing problem, or perhaps it was resolved in the meantime?
> 
> 
> regards,
>   - assaf
> 
> 

Information forwarded to bug-sed <at> gnu.org:
bug#34316; Package sed. (Wed, 13 Feb 2019 23:16:02 GMT) Full text and rfc822 format available.

Message #19 received at 34316 <at> debbugs.gnu.org (full text, mbox):

From: Assaf Gordon <assafgordon <at> gmail.com>
To: "Lange, Markus" <M.Lange <at> dnb.de>,
 "34316 <at> debbugs.gnu.org" <34316 <at> debbugs.gnu.org>
Subject: Re: bug#34316: sed misbehavior on BRE's
Date: Wed, 13 Feb 2019 16:15:08 -0700
tags 34316 notabug
close 34316
stop

Hello,

On 2019-02-11 12:45 a.m., Lange, Markus wrote:
> # sed --version
> sed (GNU sed) 4.7
> ...
> 
> Using LANG=C (LC_* unset) works as expected:
> # LANG=C sed -n 's/^.*004K...\([0-
> 9xX]\{13\}\).*006V...\(.\{1,32\}\).*\(.020F.*\)021A.*$/\2 \1\3/p'
> Fehlerpica.dat
> 138742c156c1445f8bdc3a7845548c00 9783507435339020F a19.04.03
> 18290030a02544e6a451538b0e44f9e2 9783507435377020F a19.04.03
> 4c7ff6d790b34470852434f5ee41200b 9783034312189020F a12.12.11
> 
> Using LANG=en_us.utf8 don't get results.
> 

Thanks for confirming.

Since the file contains binary bytes, they would not match as valid
characters under UTF8 locale.

Using LC_ALL=C is indeed the solution.

As such, I'm closing this as "not a bug" but discussion can
continue by replying to this thread.

-assaf






Added tag(s) notabug. Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 13 Feb 2019 23:16:02 GMT) Full text and rfc822 format available.

bug closed, send any further explanations to 34316 <at> debbugs.gnu.org and "Lange, Markus" <M.Lange <at> dnb.de> Request was from Assaf Gordon <assafgordon <at> gmail.com> to control <at> debbugs.gnu.org. (Wed, 13 Feb 2019 23:16:02 GMT) Full text and rfc822 format available.

bug archived. Request was from Debbugs Internal Request <help-debbugs <at> gnu.org> to internal_control <at> debbugs.gnu.org. (Thu, 14 Mar 2019 11:24:04 GMT) Full text and rfc822 format available.

This bug report was last modified 5 years and 45 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.