Pipeline library
When I took over man-db in 2001, one of the
major problems that became evident after maintaining it for a while was the
way it handled subprocesses. The nature of man and friends means that it
spends a lot of time calling sequences of programs such as zsoelim <
input-file | tbl | nroff -mandoc -Tutf8
. Back then, it was using C library
facilities such as system
and popen
for all this, and I had to deal with
several bugs where those functions were being called with untrusted input as
arguments without properly escaping metacharacters. Of course it was
possible to chase around every such call inserting appropriate escaping
functions, but this was always bound to be error-prone and one of the tasks
that rapidly became important to me was arranging to start subprocesses in a
way that was fundamentally immune to this kind of bug.
In higher-level languages, there are usually standard constructs which are
safer than just passing a command line to the shell. For example, in Perl
you can use system([$command, $arg1, $arg2, ...])
to invoke a program with
arguments without the interference of the shell, and perlipc(1)
describes
various facilities for connecting them together. In Python, the
subprocess module allows
you to create pipelines easily and safely (as long as you remember the
SIGPIPE gotcha). C has the fork
and
execve
primitives, but assembling these to construct full-blown pipelines
correctly is difficult and error-prone, so many programmers don’t bother and
use the simple but unsafe library facilities instead.
I wrote a couple of thousand lines of library code in man-db to address this
problem, loosely and now quite distantly based on code in
groff. In the following examples,
function names starting with command_
, pipeline_
, or decompress_
are
real functions in the library, while any other function names are pseudocode.
Constructing the simplified example pipeline from my first paragraph using this library looks like this:
pipeline *p;
int status;
p = pipeline_new ();
p->want_infile = "input-file";
pipeline_command_args (p, "zsoelim", NULL);
pipeline_command_args (p, "tbl", NULL);
pipeline_command_args (p, "nroff", "-mandoc", "-Tutf8", NULL);
pipeline_start (p);
status = pipeline_wait (p);
pipeline_free (p);
You might want to construct a command more dynamically:
command *manconv = command_new_args ("manconv", "-f", from_code,
"-t", "UTF-8", NULL);
if (quiet)
command_arg (manconv, "-q");
pipeline_command (p, manconv);
Perhaps you want an environment variable set only while running a certain command:
command *less = command_new ("less");
command_setenv (less, "LESSCHARSET", lesscharset);
You might find yourself needing to pass the output of one pipeline to several other pipelines, in a “tee” arrangement:
pipeline *source, *sink1, *sink2;
source = make_source ();
sink1 = make_sink1 ();
sink2 = make_sink2 ();
pipeline_connect (source, sink1, sink2, NULL);
/* Pump data among these pipelines until there's nothing left. */
pipeline_pump (source, sink1, sink2, NULL);
pipeline_free (sink2);
pipeline_free (sink1);
pipeline_free (source);
Maybe one of your commands is actually an in-process function, rather than an external program:
command *inproc = command_new_function ("in-process", &func, NULL, NULL);
pipeline_command (p, inproc);
Sometimes your program needs to consume the output of a pipeline, rather than sending it all to some other subprocess:
pipeline *p = make_pipeline ();
const char *line;
line = pipeline_peekline (p);
if (!strstr (line, "coding: UTF-8"))
printf ("Unicode text follows:\n");
while (line = pipeline_readline (p))
printf (" %s", line);
pipeline_free (p);
man-db deals with compressed files a lot, so I wrote an add-on library for opening compressed files (which is somewhat man-db-specific, but the implementation wasn’t difficult given the underlying library):
pipeline *decomp_file = decompress_open (compressed_filename);
pipeline *decomp_stdin = decompress_fdopen (fileno (stdin));
This library has been in production in man-db for over five years now. The very careful signal handling code has been reviewed independently and the whole thing has been run through multiple static analysis tools, although I would always welcome more review; in particular I have no idea what it would take to make it safe for use in threaded programs since I generally avoid threading wherever possible. There have been a handful of bugs, which I’ve fixed promptly, and I’ve added various new features to support particular requirements of man-db (though in as general a way as possible). Every so often I see somebody asking about subprocess handling in C, and I wonder if I should split this library out into a standalone package so that it can be used elsewhere. Web searches for things like “pipeline library” and “libpipeline” don’t reveal anything that’s a particularly close match for what I have. The licensing would be GPLv2 or later; this isn’t likely to be negotiable since some of the original code wasn’t mine and in any case I don’t feel particularly bad about giving an advantage to GPLed programs. For more details on the interface, the header file is well-commented.
Is there enough interest in this to make the effort of producing a separate library package worthwhile? As well as the general effort of creating a new package, I’d need to do some work to disentangle it from a few bits and pieces specific to man-db. If you maintain a specific package that could use this and you’re interested, please contact me with details, mentioning any extensions you think you’d need. I intentionally haven’t enabled comments on my blog for various reasons, but you can e-mail me at cjwatson at debian.org or man-db-devel at nongnu.org.