Emulating 'named' process substitutions

122 Views Asked by At

Let's say I have a big gzipped file data.txt.gz, but often the ungzipped version needs to be given to a program. Of course, instead of creating a standalone unpacked data.txt, one could use the process substitution syntax:

./program <(zcat data.txt.gz)

However, depending on the situation, this can be tiresome and error-prone.

Is there a way to emulate a named process substitution? That is, to create a pseudo-file data.txt that would 'unfold' into a process substitution zcat data.txt.gz whenever it is accessed. Not unlike a symbolic link forwards a read operation to another file, but, in this case, it needs to be a temporary named pipe.

Thanks.

PS. Somewhat similar question


Edit (from comments) The actual use-case is having a large gzipped corpus that, besides its usage in its raw form, also sometimes needs to be processed with a series of lightweight operations (tokenized, lowercased, etc.) and then fed to some "heavier" code. Storing a preprocessed copy wastes disk space and repeated retyping the full preprocessing pipeline can introduce errors. In the same time, running the pipeline on-the-fly incurs a tiny computational overhead, hence the idea of a long-lived pseudo-file that hides the details under the hood.

2

There are 2 best solutions below

1
On

As far as I know, what you are describing does not exist, although it's an intriguing idea. It would require kernel support so that opening the file would actually run an arbitrary command or script instead.

Your best bet is to just save the long command to a shell function or script to reduce the difficulty of invoking the process substitution.

0
On

There's a spectrum of options, depending on what you need and how much effort you're willing to put in.

If you need a single-use file, you can just use mkfifo to create the file, start up a redirection of your archive into the fifo, and and pass the fifo's filename to whoever needs to read from it.

If you need to repeatedly access the file (perhaps simultaneously), you can set up a socket using netcat that serves the decompressed file over and over.

With "traditional netcat" this is as simple as while true; do nc -l -p 1234 -c "zcat myfile.tar.gz"; done. With BSD netcat it's a little more annoying:

# Make a dummy FIFO
mkfifo foo

# Use the FIFO to track new connections
while true; do cat foo | zcat myfile.tar.gz | nc -l 127.0.0.1 1234 > foo; done

Anyway once the server (or file based domain socket) is up, you just do nc localhost 1234 to read the decompressed file. You can of course use nc localhost 1234 as part of a process substitution somewhere else.

It looks like this in action (image probably best viewed in separate tab):

netcat server demo

Depending on your needs, you may want to make the bash script more sophisticated for caching etc, or just dump this thing and go for a regular web server in some scripting language you're comfortable with.

Finally, and this is probably the most "exotic" solution, you can write a FUSE filesystem that presents virtual files backed by whatever logic your heart desires. At this point you should probably have a good hard think about whether the maintainability and complexity costs of where you're going really offset someone having to call zcat a few extra times.