#69982 - Setting inodes to 0 leads to incorrect output when extracting with GNU cpio

GNU bug report logs - #69982
Setting inodes to 0 leads to incorrect output when extracting with GNU cpio

Package: guix;

Reported by: Skyler Ferris <skyvine <at> protonmail.com>

Date: Sun, 24 Mar 2024 16:19:01 UTC

Severity: normal

To reply to this bug, email your comments to 69982 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

Report forwarded to bug-guix <at> gnu.org:
bug#69982; Package guix. (Sun, 24 Mar 2024 16:19:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Skyler Ferris <skyvine <at> protonmail.com>:
New bug report received and forwarded. Copy sent to bug-guix <at> gnu.org. (Sun, 24 Mar 2024 16:19:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Skyler Ferris <skyvine <at> protonmail.com> To: bug-guix <at> gnu.org Subject: Setting inodes to 0 leads to incorrect output when extracting with GNU cpio Date: Sun, 24 Mar 2024 16:17:22 +0000

Hello, I have encountered a bug that is caused by the interaction of write-cpio-archive from (gnu build linux-initrd) writing all inodes as 0 and the way that GNU cpio processes file headers. I observed this bug while creating a custom initramfs where init is based on a bash script used by another distribution (but I will provide a minimal reproducer below). This bug only exhibits itself when there are multiple different hard links present in the input directory. This email will contain a short set of reproduction steps, an explanation of what I understand the cause of the bug to be, some possible paths forward, and a disclaimer about my limitations due to my background. To reproduce this bug, run the following commands: ```shell $ mkdir /tmp/source $ cd /tmp/source $ echo contents1 > file1.txt $ ln file1.txt link1.txt $ echo contents2 > file2.txt $ echo contents3 > file3.txt $ ln file3.txt link3.txt $ guix repl > (use-modules (gnu build linux-initrd)) > ; disable compression so we don't waste time on it while debugging, it does not impact reproduction > (write-cpio-archive "." "../archive.cpio" #:compress? #f) > ,q $ cd .. $ mkdir out $ cd out $ cat ../archive.cpio | cpio -i $ cat * ``` After running the final step you will see that all of file1.txt, link1.txt, file3.txt, and link3.txt have the contents "contents1": the files which should contain "contents3" have been created incorrectly. Now I will list the set of steps the relevant programs performed which caused this error, followed by a more verbose explanation with references to source code: 1. Guix creates the archive with the inode and major & minor device numbers set to 0. Number of hard links is reported accurately. 2. CPIO reads the archive and hard links files when the header indicates that there are multiple links. It uses the inode and major & minor device numbers to find the correct file to hard link to. 3. As file3.txt and link3.txt both have multiple links and share their inode and major & minor device numbers with file1.txt, they are all linked to file1.txt This error occurs when the cpio utility processes files with hard link. In `copyin_regular_file`, there is a code block which only runs if the file has multiple hard links and the newascii (or checksummed new ascii) format is in use (1). Within that code block there is a conditional to check if the file size is 0, with a comment explaining that the newascii format only records the data for the final file pointing to the relevant inode rather than repeating the data each time. The code in guix/cpio.scm does not actually do this, so this code block never executes. Instead, the other code block runs which simply calls `link_to_maj_min_ino` (and checks for an error code) (2). This uses `find_inode_file` which references a hash table that associates the inode/major device/minor device with a file path, and if it finds a match then it creates a hard link on the target file system. However, Guix's `file->cpio-header*` sets all of the inode and device numbers to 0 for reproducibility. This causes cpio to hard link every file with multiple links to the first file that has multiple links. I see 3 possible paths forward to address this issue: 1. Provide spoofed inode numbers, tracking hard link data. In (gnu build linux-initrd), the `write-cpio-archive` procedure sorts the files by name so we can provide inode numbers that increase sequentially. However, in order to make sure that the correct hard links are findable by the cpio utility we would need to track the real inode numbers as well and use the correct pseudonym in each place. This would noticeably increase the complexity of the code. 2. Provide spoofed inode numbers and spoofed hard link data. In order to avoid tracking the real hard link numbers we can just report all files as having only a single link, and still provide sequential inode numbers as above. This will not increase the size of the cpio archives we generate compared to current output because we are storing the data for each link anyway. This will add some complexity to the cpio code, but less than option 1. 3. Don't support inputs with multiple hard links and require callers to work around this issue. This avoids any changes to the cpio code. I am in favor of option 2 because I think it strikes a good balance between keeping the cpio code stable and supporting reasonable use cases. The cpio code is used to build the initramfs in Guix systems so a bug here could make some systems unbootable. Guix does provide transactional rollbacks which is helpful but it is still a frustrating experience to reboot and immediately see a crash; debugging issues in this early environment is significantly more difficult than debugging post-boot issues. Hard links are not common on many systems because they add complexity to filesystem analysis, but Guix makes good use of them to save space in the store, where it is common for many files to share data and creating symlinks would prevent the garbage collector from deleting otherwise unused outputs. The limitations I referred to in the beginning of the email are that I am inexperienced in this domain. I have only recently (over the past month or so) started looking at building a custom initramfs, and I have never worked with CPIO archives before. I think that my analysis makes sense based on the code I have read and the behavior I have observed, but take everything I say with a grain of salt. I would appreciate any thoughts that anyone has on this matter. Regards, Skyler (1) https://git.savannah.gnu.org/cgit/cpio.git/tree/src/copyin.c?id=900bab656ff24db5e3099941fb909c79c07962ed#n400 (2) https://git.savannah.gnu.org/cgit/cpio.git/tree/src/copypass.c?id=900bab656ff24db5e3099941fb909c79c07962ed#n341

This bug report was last modified 1 year and 123 days ago.

GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.

GNU bug report logs - #69982 Setting inodes to 0 leads to incorrect output when extracting with GNU cpio

GNU bug report logs - #69982
Setting inodes to 0 leads to incorrect output when extracting with GNU cpio