Bashcpio - pure bash (almost) cpio archive extraction

Sun 05 June 2016 by Fred Clift

Ok - let's get this out of the way. Is this important? No. Is it groundbreaking? No. Can I even explain how cool it is to non-technical friends? No. Could it ever possibly be useful? Maybe! see below.

I just spent a few evenings writing a pure-bash cpio extraction implementation. My target was centos/fedora/redhat rpms, which means gzip/xz compression, written with the so-called 'New-Ascii' header format. The other formats wouldn't be that hard to support but I would guess that 98% of any use of cpio in the year 2016 is related to some kind of rpm package.

And so is born bashcpio - check it out on github:

https://github.com/minektur/bashcpio

pure bash?

Also, it's not quite 'pure' bash - but no bash script ever is. I worked hard to ensure that the external dependencies were kept to a bare minimum.

In addition to an actual bash binary, you need a few external dependencies:

dd, mkdir, dirname, chown, chmod, date, touch, ln

There are a couple that might be considered optional. I could probably write some tortured complicated code to replace dirname that has no dependencies on external binaries, though I have not (yet?) done so. If you don't care about time-stamps on files and directories that you extract, date and touch could both go away. The rest though, are pretty difficult to avoid using if you want it to work at all.

Bash (and some other shell utilities that you might consider, like awk and sed) all have this nasty habit of "eating" null bytes in anything they read. So for instance, you can't place the the content of a binary file in a bash variable. You'll get everything up to the first null value. This makes dealing with arbitrary binary data with bash problematic at best. If however, you can be reasonably sure that some portions of your file will be ascii, then you can carefully operate on those parts with bash, using the assistance of dd.

In this case, the header format I'm trying to read is guaranteed to be 100 bytes, followed by a C-style string, and a null terminator, followed by some padding, a bunch of arbitrary data and some padding. The saving grace is that you know where the ascii header starts, how long it is, and inside there is data about how long the file name is, and how long the file is. With that, you can read a header and then safely read the filename, the sizes, and all the other header info.

How? I an pick apart the file with dd, carefully using the count= and skip= flags.

All the rest of the external dependencies are for use in creating and modifying the extracted files, setting ownership, group, permissions, mtime, etc.

Motivation?

I'm toying with the idea of adding a centos target to crouton

created/maintained by David Schneider, with many many other contributors.

Crouton is a tool for making (debian-derived) linux chroots that will run under chromeos in dev mode. (I'll talk more about this in some later blog post). Crouton is an awesome tool and it's quite useful, but I use a lot of yum rpm and not a lot of apt and deb.

On the chromeos devices there is a very minimal linux install that does little besides running chrome browser full time. Many common things you'd expect to be at your fingertips at a bashprompt are missing. Crouton fixes this. And it bootstraps itself using pretty much only bash and an 'ar' extractor written in bash, since that is close to all that is available to start with.

Seeing the clever ar.sh implementation, I got thinking about how one would do the same with rpm based distros... and bashcpio was born. I started with a careful inventory of what shell binaries I had access to, and bash + dependencies does the trick. This of course does not give you a centos chroot - it is only one of several necessary tools. But I had fun making it work, except when I was trying to get the padding right. Ugh, how irritating. I'm no expert bash programmer though I know a lot more now than I used to.

Performance?

My initial implementation only took about 80x as long as the x86_64 debian trusty binary that I got out of a crouton chroot. The current implementation is much faster now and it now only takes approximately 19 times as long at the C-binary. To get a feel for it's performance, I grabbed the 350ish packages that make up a current centos 7 minimal install. It takes less than 4 minutes to unpack them.

Please feel free to send me feedback on how crazy you think I am, or on what glaring deficiency my code has. I'm eager to talk about the ship I built in a bottle.