Pipeline library

When I took over man-db in 2001, one of the major problems that became evident after maintaining it for a while was the way it handled subprocesses. The nature of man and friends means that it spends a lot of time calling sequences of programs such as zsoelim < input-file | tbl | nroff -mandoc -Tutf8. Back then, it was using C library facilities such as system and popen for all this, and I had to deal with several bugs where those functions were being called with untrusted input as arguments without properly escaping metacharacters. Of course it was possible to chase around every such call inserting appropriate escaping functions, but this was always bound to be error-prone and one of the tasks that rapidly became important to me was arranging to start subprocesses in a way that was fundamentally immune to this kind of bug.

In higher-level languages, there are usually standard constructs which are safer than just passing a command line to the shell. For example, in Perl you can use system([$command, $arg1, $arg2, ...]) to invoke a program with arguments without the interference of the shell, and perlipc(1) describes various facilities for connecting them together. In Python, the subprocess module allows you to create pipelines easily and safely (as long as you remember the SIGPIPE gotcha). C has the fork and execve primitives, but assembling these to construct full-blown pipelines correctly is difficult and error-prone, so many programmers don’t bother and use the simple but unsafe library facilities instead.

I wrote a couple of thousand lines of library code in man-db to address this problem, loosely and now quite distantly based on code in groff. In the following examples, function names starting with command_, pipeline_, or decompress_ are real functions in the library, while any other function names are pseudocode.

Constructing the simplified example pipeline from my first paragraph using this library looks like this:

pipeline *p;
int status;

p = pipeline_new ();
p->want_infile = "input-file";
pipeline_command_args (p, "zsoelim", NULL);
pipeline_command_args (p, "tbl", NULL);
pipeline_command_args (p, "nroff", "-mandoc", "-Tutf8", NULL);
pipeline_start (p);
status = pipeline_wait (p);
pipeline_free (p);

You might want to construct a command more dynamically:

command *manconv = command_new_args ("manconv", "-f", from_code,
                                     "-t", "UTF-8", NULL);
if (quiet)
        command_arg (manconv, "-q");
pipeline_command (p, manconv);

Perhaps you want an environment variable set only while running a certain command:

command *less = command_new ("less");
command_setenv (less, "LESSCHARSET", lesscharset);

You might find yourself needing to pass the output of one pipeline to several other pipelines, in a “tee” arrangement:

pipeline *source, *sink1, *sink2;

source = make_source ();
sink1 = make_sink1 ();
sink2 = make_sink2 ();
pipeline_connect (source, sink1, sink2, NULL);
/* Pump data among these pipelines until there's nothing left. */
pipeline_pump (source, sink1, sink2, NULL);
pipeline_free (sink2);
pipeline_free (sink1);
pipeline_free (source);

Maybe one of your commands is actually an in-process function, rather than an external program:

command *inproc = command_new_function ("in-process", &func, NULL, NULL);
pipeline_command (p, inproc);

Sometimes your program needs to consume the output of a pipeline, rather than sending it all to some other subprocess:

pipeline *p = make_pipeline ();
const char *line;

line = pipeline_peekline (p);
if (!strstr (line, "coding: UTF-8"))
        printf ("Unicode text follows:\n");
while (line = pipeline_readline (p))
        printf ("  %s", line);
pipeline_free (p);

man-db deals with compressed files a lot, so I wrote an add-on library for opening compressed files (which is somewhat man-db-specific, but the implementation wasn’t difficult given the underlying library):

pipeline *decomp_file = decompress_open (compressed_filename);
pipeline *decomp_stdin = decompress_fdopen (fileno (stdin));

This library has been in production in man-db for over five years now. The very careful signal handling code has been reviewed independently and the whole thing has been run through multiple static analysis tools, although I would always welcome more review; in particular I have no idea what it would take to make it safe for use in threaded programs since I generally avoid threading wherever possible. There have been a handful of bugs, which I’ve fixed promptly, and I’ve added various new features to support particular requirements of man-db (though in as general a way as possible). Every so often I see somebody asking about subprocess handling in C, and I wonder if I should split this library out into a standalone package so that it can be used elsewhere. Web searches for things like “pipeline library” and “libpipeline” don’t reveal anything that’s a particularly close match for what I have. The licensing would be GPLv2 or later; this isn’t likely to be negotiable since some of the original code wasn’t mine and in any case I don’t feel particularly bad about giving an advantage to GPLed programs. For more details on the interface, the header file is well-commented.

Is there enough interest in this to make the effort of producing a separate library package worthwhile? As well as the general effort of creating a new package, I’d need to do some work to disentangle it from a few bits and pieces specific to man-db. If you maintain a specific package that could use this and you’re interested, please contact me with details, mentioning any extensions you think you’d need. I intentionally haven’t enabled comments on my blog for various reasons, but you can e-mail me at cjwatson at debian.org or man-db-devel at nongnu.org.