GNU bug report logs -
#46048
split -n K/N loses data, sum of output files is smaller than input file.
Previous Next
Reported by: Paul Hirst <contact <at> phirst.org>
Date: Sat, 23 Jan 2021 08:26:02 UTC
Severity: normal
Done: Pádraig Brady <P <at> draigBrady.com>
Bug is archived. No further changes may be made.
To add a comment to this bug, you must first unarchive it, by sending
a message to control AT debbugs.gnu.org, with unarchive 46048 in the body.
You can then email your comments to 46048 AT debbugs.gnu.org in the normal way.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-coreutils <at> gnu.org
:
bug#46048
; Package
coreutils
.
(Sat, 23 Jan 2021 08:26:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Paul Hirst <contact <at> phirst.org>
:
New bug report received and forwarded. Copy sent to
bug-coreutils <at> gnu.org
.
(Sat, 23 Jan 2021 08:26:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
split --number K/N appears to lose data in, with the sum of the sizes of
the output files being smaller than the original input file by 131072 bytes.
$ split --version
split (GNU coreutils) 8.30
...
$ head -c 1000000 < /dev/urandom > test.dat
$ split --number=1/4 test.dat > t1
$ split --number=2/4 test.dat > t2
$ split --number=3/4 test.dat > t3
$ split --number=4/4 test.dat > t4
$ ls -l
-rw-r--r-- 1 user user 250000 Jan 22 18:36 t1
-rw-r--r-- 1 user user 250000 Jan 22 18:36 t2
-rw-r--r-- 1 user user 250000 Jan 22 18:36 t3
-rw-r--r-- 1 user user 118928 Jan 22 18:36 t4
-rw-r--r-- 1 user user 1000000 Jan 22 18:33 test.dat
Surely this should not be the case?
Paul
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#46048
; Package
coreutils
.
(Sun, 24 Jan 2021 16:54:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 46048 <at> debbugs.gnu.org (full text, mbox):
On 23/01/2021 04:58, Paul Hirst wrote:
> split --number K/N appears to lose data in, with the sum of the sizes of
> the output files being smaller than the original input file by 131072 bytes.
>
> $ split --version
> split (GNU coreutils) 8.30
> ...
>
> $ head -c 1000000 < /dev/urandom > test.dat
> $ split --number=1/4 test.dat > t1
> $ split --number=2/4 test.dat > t2
> $ split --number=3/4 test.dat > t3
> $ split --number=4/4 test.dat > t4
>
> $ ls -l
> -rw-r--r-- 1 user user 250000 Jan 22 18:36 t1
> -rw-r--r-- 1 user user 250000 Jan 22 18:36 t2
> -rw-r--r-- 1 user user 250000 Jan 22 18:36 t3
> -rw-r--r-- 1 user user 118928 Jan 22 18:36 t4
> -rw-r--r-- 1 user user 1000000 Jan 22 18:33 test.dat
>
> Surely this should not be the case?
Ugh. This functionality was broken for all files > 128KiB
due to adjustments for handling /dev/zero
$ truncate -s 1000000 test.dat
$ split --number=4/4 test.dat | wc -c
118928
The following patch fixes it here.
I need to do some more testing, before committing.
thanks!
diff --git a/src/split.c b/src/split.c
index 0660da13f..6aa8d50e9 100644
--- a/src/split.c
+++ b/src/split.c
@@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize,
}
else
{
- if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
+ if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
die (EXIT_FAILURE, errno, "%s", quotef (infile));
initial_read = SIZE_MAX;
}
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#46048
; Package
coreutils
.
(Sun, 24 Jan 2021 16:59:01 GMT)
Full text and
rfc822 format available.
Message #11 received at submit <at> debbugs.gnu.org (full text, mbox):
On 24/01/2021 16:52, Pádraig Brady wrote:
> diff --git a/src/split.c b/src/split.c
> index 0660da13f..6aa8d50e9 100644
> --- a/src/split.c
> +++ b/src/split.c
> @@ -1001,7 +1001,7 @@ bytes_chunk_extract (uintmax_t k, uintmax_t n, char *buf, size_t bufsize,
> }
> else
> {
> - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
> + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
> die (EXIT_FAILURE, errno, "%s", quotef (infile));
> initial_read = SIZE_MAX;
> }
The same adjustment is needed in lines_chunk_split()
I'll add a test also.
cheers,
Pádraig
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#46048
; Package
coreutils
.
(Sun, 24 Jan 2021 19:56:01 GMT)
Full text and
rfc822 format available.
Message #14 received at 46048 <at> debbugs.gnu.org (full text, mbox):
On 1/24/21 8:52 AM, Pádraig Brady wrote:
> - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
> + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
Dumb question: will this handle the case where you're splitting from
stdin and stdin is a seekable file and its initial file offset is nonzero?
Reply sent
to
Pádraig Brady <P <at> draigBrady.com>
:
You have taken responsibility.
(Mon, 25 Jan 2021 14:22:02 GMT)
Full text and
rfc822 format available.
Notification sent
to
Paul Hirst <contact <at> phirst.org>
:
bug acknowledged by developer.
(Mon, 25 Jan 2021 14:22:02 GMT)
Full text and
rfc822 format available.
Message #19 received at 46048-done <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 24/01/2021 19:55, Paul Eggert wrote:
> On 1/24/21 8:52 AM, Pádraig Brady wrote:
>> - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
>> + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
>
> Dumb question: will this handle the case where you're splitting from
> stdin and stdin is a seekable file and its initial file offset is nonzero?
Right. Following on the logic from input_file_size(),
I'm going with the attached, which I'll push later.
Marking this as done.
thanks,
Pádraig
[split-k_of_n.patch (text/x-patch, attachment)]
Information forwarded
to
bug-coreutils <at> gnu.org
:
bug#46048
; Package
coreutils
.
(Mon, 08 Feb 2021 13:55:01 GMT)
Full text and
rfc822 format available.
Message #22 received at 46048 <at> debbugs.gnu.org (full text, mbox):
On 25/01/2021 14:21, Pádraig Brady wrote:
> On 24/01/2021 19:55, Paul Eggert wrote:
>> On 1/24/21 8:52 AM, Pádraig Brady wrote:
>>> - if (lseek (STDIN_FILENO, start, SEEK_CUR) < 0)
>>> + if (lseek (STDIN_FILENO, start, SEEK_SET) < 0)
>>
>> Dumb question: will this handle the case where you're splitting from
>> stdin and stdin is a seekable file and its initial file offset is nonzero?
>
> Right. Following on the logic from input_file_size(),
> I'm going with the attached, which I'll push later.
> Marking this as done.
Note this fix has now propagated to Fedora builds,
and is in the process of propagating to RHEL/Centos.
I've just logged a debian bug also:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=982300
cheers,
Pádraig
bug archived.
Request was from
Debbugs Internal Request <help-debbugs <at> gnu.org>
to
internal_control <at> debbugs.gnu.org
.
(Tue, 09 Mar 2021 12:24:06 GMT)
Full text and
rfc822 format available.
This bug report was last modified 3 years and 20 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.