doc/pod/uwildmat.pod

   1 =head1 NAME
   2
   3 uwildmat, uwildmat_simple, uwildmat_poison - Perform wildmat matching
   4
   5 =head1 SYNOPSIS
   6
   7 B<#include E<lt>libinn.hE<gt>>
   8
   9 B<bool uwildmat(const char *>I<text>B<, const char *>I<pattern>B<);>
  10
  11 B<bool uwildmat_simple(const char *>I<text>B<, const char *>I<pattern>B<);>
  12
  13 B<enum uwildmat uwildmat_poison(const char *>I<text>B<,
  14 const char *>I<pattern>B<);>
  15
  16 =head1 DESCRIPTION
  17
  18 B<uwildmat> compares I<text> against the wildmat expression I<pattern>,
  19 returning true if and only if the expression matches the text.  C<@> has
  20 no special meaning in I<pattern> when passed to B<uwildmat>.  Both I<text>
  21 and I<pattern> are assumed to be in the UTF-8 character encoding, although
  22 malformed UTF-8 sequences are treated in a way that attempts to be mostly
  23 compatible with single-octet character sets like ISO 8859-1.  (In other
  24 words, if you try to match ISO 8859-1 text with these routines everything
  25 should work as expected unless the ISO 8859-1 text contains valid UTF-8
  26 sequences, which thankfully is somewhat rare.)
  27
  28 B<uwildmat_simple> is identical to B<uwildmat> except that neither C<!>
  29 nor C<,> have any special meaning and I<pattern> is always treated as a
  30 single pattern.  This function exists solely to support legacy interfaces
  31 like NNTP's XPAT command, and should be avoided when implementing new
  32 features.
  33
  34 B<uwildmat_poison> works similarly to B<uwildmat>, except that C<@> as the
  35 first character of one of the patterns in the expression (see below)
  36 "poisons" the match if it matches.  B<uwildmat_poison> returns
  37 B<UWILDMAT_MATCH> if the expression matches the text, B<UWILDMAT_FAIL> if
  38 it doesn't, and B<UWILDMAT_POISON> if the expression doesn't match because
  39 a poisoned pattern matched the text.  These enumeration constants are
  40 defined in the B<libinn.h> header.
  41
  42 =head1 WILDMAT EXPRESSIONS
  43
  44 A wildmat expression follows rules similar to those of shell filename
  45 wildcards but with some additions and changes.  A wildmat I<expression> is
  46 composed of one or more wildmat I<patterns> separated by commas.  Each
  47 character in the wildmat pattern matches a literal occurance of that same
  48 character in the text, with the exception of the following metacharacters:
  49
  50 =over 8
  51
  52 =item ?
  53
  54 Matches any single character (including a single UTF-8 multibyte
  55 character, so C<?> can match more than one byte).
  56
  57 =item *Z<>
  58
  59 Matches any sequence of zero or more characters.
  60
  61 =item \
  62
  63 Turns off any special meaning of the following character; the following
  64 character will match itself in the text.  C<\> will escape any character,
  65 including another backslash or a comma that otherwise would separate a
  66 pattern from the next pattern in an expression.  Note that C<\> is not
  67 special inside a character range (no metacharacters are).
  68
  69 =item [...]
  70
  71 A character set, which matches any single character that falls within that
  72 set.  The presence of a character between the brackets adds that character
  73 to the set; for example, C<[amv]> specifies the set containing the
  74 characters C<a>, C<m>, and C<v>.  A range of characters may be specified
  75 using C<->; for example, C<[0-5abc]> is equivalent to C<[012345abc]>.  The
  76 order of characters is as defined in the UTF-8 character set, and if the
  77 start character of such a range falls after the ending character of the
  78 range in that ranking the results of attempting a match with that pattern
  79 are undefined.
  80
  81 In order to include a literal C<]> character in the set, it must be the
  82 first character of the set (possibly following C<^>); for example, C<[]a]>
  83 matches either C<]> or C<a>.  To include a literal C<-> character in the
  84 set, it must be either the first or the last character of the set.
  85 Backslashes have no special meaning inside a character set, nor do any
  86 other of the wildmat metacharacters.
  87
  88 =item [^...]
  89
  90 A negated character set.  Follows the same rules as a character set above,
  91 but matches any character B<not> contained in the set.  So, for example,
  92 C<[^]-]> matches any character except C<]> and C<->.
  93
  94 =back
  95
  96 In addition, C<!> (and possibly C<@>) have special meaning as the first
  97 character of a pattern; see below.
  98
  99 When matching a wildmat expression against some text, each comma-separated
 100 pattern is matched in order from left to right.  In order to match, the
 101 pattern must match the whole text; in regular expression terminology, it's
 102 implicitly anchored at both the beginning and the end.  For example, the
 103 pattern C<a> matches only the text C<a>; it doesn't match C<ab> or C<ba>
 104 or even C<aa>.  If none of the patterns match, the whole expression
 105 doesn't match.  Otherwise, whether the expression matches is determined
 106 entirely by the rightmost matching pattern; the expression matches the
 107 text if and only if the rightmost matching pattern is not negated.
 108
 109 For example, consider the text C<news.misc>.  The expression C<*> matches
 110 this text, of course, as does C<comp.*,news.*> (because the second pattern
 111 matches).  C<news.*,!news.misc> does not match this text because both
 112 patterns match, meaning that the rightmost takes precedence, and the
 113 rightmost matching pattern is negated.  C<news.*,!news.misc,*.misc> does
 114 match this text, since the rightmost matching pattern is not negated.
 115
 116 Note that the expression C<!news.misc> can't match anything.  Either the
 117 pattern doesn't match, in which case no patterns match and the expression
 118 doesn't match, or the pattern does match, in which case because it's
 119 negated the expression doesn't match.  C<*,!news.misc>, on the other hand,
 120 is a useful pattern that matches anything except C<news.misc>.
 121
 122 C<!> has significance only as the first character of a pattern; anywhere
 123 else in the pattern, it matches a literal C<!> in the text like any other
 124 non-metacharacter.
 125
 126 If the B<uwildmat_poison> interface is used, then C<@> behaves the same as
 127 C<!> except that if an expression fails to match because the rightmost
 128 matching pattern began with C<@>, B<UWILDMAT_POISON> is returned instead of
 129 B<UWILDMAT_FAIL>.
 130
 131 If the B<uwildmat_simple> interface is used, the matching rules are the
 132 same as above except that none of C<!>, C<@>, or C<,> have any special
 133 meaning at all and only match those literal characters.
 134
 135 =head1 BUGS
 136
 137 All of these functions internally convert the passed arguments to const
 138 unsigned char pointers.  The only reason why they take regular char
 139 pointers instead of unsigned char is for the convenience of INN and other
 140 callers that may not be using unsigned char everywhere they should.  In a
 141 future revision, the public interface should be changed to just take
 142 unsigned char pointers.
 143
 144 =head1 HISTORY
 145
 146 Written by Rich $alz <rsalz@uunet.uu.net> in 1986, and posted to Usenet
 147 several times since then, most notably in comp.sources.misc in
 148 March, 1991.
 149
 150 Lars Mathiesen <thorinn@diku.dk> enhanced the multi-asterisk failure
 151 mode in early 1991.
 152
 153 Rich and Lars increased the efficiency of star patterns and reposted it to
 154 comp.sources.misc in April, 1991.
 155
 156 Robert Elz <kre@munnari.oz.au> added minus sign and close bracket handling
 157 in June, 1991.
 158
 159 Russ Allbery <rra@stanford.edu> added support for comma-separated patterns
 160 and the C<!> and C<@> metacharacters to the core wildmat routines in July,
 161 2000.  He also added support for UTF-8 characters, changed the default
 162 behavior to assume that both the text and the pattern are in UTF-8, and
 163 largely rewrote this documentation to expand and clarify the description
 164 of how a wildmat expression matches.
 165
 166 Please note that the interfaces to these functions are named B<uwildmat>
 167 and the like rather than B<wildmat> to distinguish them from the
 168 B<wildmat> function provided by Rich $alz's original implementation.
 169 While this code is heavily based on Rich's original code, it has
 170 substantial differences, including the extension to support UTF-8
 171 characters, and has noticable functionality changes.  Any bugs present in
 172 it aren't Rich's fault.
 173
 174 $Id: uwildmat.pod 5533 2002-08-10 18:51:37Z rra $
 175
 176 =head1 SEE ALSO
 177
 178 grep(1), fnmatch(3), regex(3), regexp(3).
 179
 180 =cut