doc/pcresyntax.3

   1 .TH PCRESYNTAX 3 "08 January 2014" "PCRE 8.35"
   2 .SH NAME
   3 PCRE - Perl-compatible regular expressions
   4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
   5 .rs
   6 .sp
   7 The full syntax and semantics of the regular expressions that are supported by
   8 PCRE are described in the
   9 .\" HREF
  10 \fBpcrepattern\fP
  11 .\"
  12 documentation. This document contains a quick-reference summary of the syntax.
  13 .
  14 .
  15 .SH "QUOTING"
  16 .rs
  17 .sp
  18   \ex         where x is non-alphanumeric is a literal x
  19   \eQ...\eE    treat enclosed characters as literal
  20 .
  21 .
  22 .SH "CHARACTERS"
  23 .rs
  24 .sp
  25   \ea         alarm, that is, the BEL character (hex 07)
  26   \ecx        "control-x", where x is any ASCII character
  27   \ee         escape (hex 1B)
  28   \ef         form feed (hex 0C)
  29   \en         newline (hex 0A)
  30   \er         carriage return (hex 0D)
  31   \et         tab (hex 09)
  32   \e0dd       character with octal code 0dd
  33   \eddd       character with octal code ddd, or backreference
  34   \eo{ddd..}  character with octal code ddd..
  35   \exhh       character with hex code hh
  36   \ex{hhh..}  character with hex code hhh..
  37 .sp
  38 Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
  39 characters "8" and "9".
  40 .
  41 .
  42 .SH "CHARACTER TYPES"
  43 .rs
  44 .sp
  45   .          any character except newline;
  46                in dotall mode, any character whatsoever
  47   \eC         one data unit, even in UTF mode (best avoided)
  48   \ed         a decimal digit
  49   \eD         a character that is not a decimal digit
  50   \eh         a horizontal white space character
  51   \eH         a character that is not a horizontal white space character
  52   \eN         a character that is not a newline
  53   \ep{\fIxx\fP}     a character with the \fIxx\fP property
  54   \eP{\fIxx\fP}     a character without the \fIxx\fP property
  55   \eR         a newline sequence
  56   \es         a white space character
  57   \eS         a character that is not a white space character
  58   \ev         a vertical white space character
  59   \eV         a character that is not a vertical white space character
  60   \ew         a "word" character
  61   \eW         a "non-word" character
  62   \eX         a Unicode extended grapheme cluster
  63 .sp
  64 By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
  65 or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
  66 happening, \es and \ew may also match characters with code points in the range
  67 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
  68 is changed to use Unicode properties and they match many more characters.
  69 .
  70 .
  71 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
  72 .rs
  73 .sp
  74   C          Other
  75   Cc         Control
  76   Cf         Format
  77   Cn         Unassigned
  78   Co         Private use
  79   Cs         Surrogate
  80 .sp
  81   L          Letter
  82   Ll         Lower case letter
  83   Lm         Modifier letter
  84   Lo         Other letter
  85   Lt         Title case letter
  86   Lu         Upper case letter
  87   L&         Ll, Lu, or Lt
  88 .sp
  89   M          Mark
  90   Mc         Spacing mark
  91   Me         Enclosing mark
  92   Mn         Non-spacing mark
  93 .sp
  94   N          Number
  95   Nd         Decimal number
  96   Nl         Letter number
  97   No         Other number
  98 .sp
  99   P          Punctuation
 100   Pc         Connector punctuation
 101   Pd         Dash punctuation
 102   Pe         Close punctuation
 103   Pf         Final punctuation
 104   Pi         Initial punctuation
 105   Po         Other punctuation
 106   Ps         Open punctuation
 107 .sp
 108   S          Symbol
 109   Sc         Currency symbol
 110   Sk         Modifier symbol
 111   Sm         Mathematical symbol
 112   So         Other symbol
 113 .sp
 114   Z          Separator
 115   Zl         Line separator
 116   Zp         Paragraph separator
 117   Zs         Space separator
 118 .
 119 .
 120 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
 121 .rs
 122 .sp
 123   Xan        Alphanumeric: union of properties L and N
 124   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
 125   Xsp        Perl space: property Z or tab, NL, VT, FF, CR
 126   Xuc        Univerally-named character: one that can be
 127                represented by a Universal Character Name
 128   Xwd        Perl word: property Xan or underscore
 129 .sp
 130 Perl and POSIX space are now the same. Perl added VT to its space character set
 131 at release 5.18 and PCRE changed at release 8.34.
 132 .
 133 .
 134 .SH "SCRIPT NAMES FOR \ep AND \eP"
 135 .rs
 136 .sp
 137 Arabic,
 138 Armenian,
 139 Avestan,
 140 Balinese,
 141 Bamum,
 142 Batak,
 143 Bengali,
 144 Bopomofo,
 145 Brahmi,
 146 Braille,
 147 Buginese,
 148 Buhid,
 149 Canadian_Aboriginal,
 150 Carian,
 151 Chakma,
 152 Cham,
 153 Cherokee,
 154 Common,
 155 Coptic,
 156 Cuneiform,
 157 Cypriot,
 158 Cyrillic,
 159 Deseret,
 160 Devanagari,
 161 Egyptian_Hieroglyphs,
 162 Ethiopic,
 163 Georgian,
 164 Glagolitic,
 165 Gothic,
 166 Greek,
 167 Gujarati,
 168 Gurmukhi,
 169 Han,
 170 Hangul,
 171 Hanunoo,
 172 Hebrew,
 173 Hiragana,
 174 Imperial_Aramaic,
 175 Inherited,
 176 Inscriptional_Pahlavi,
 177 Inscriptional_Parthian,
 178 Javanese,
 179 Kaithi,
 180 Kannada,
 181 Katakana,
 182 Kayah_Li,
 183 Kharoshthi,
 184 Khmer,
 185 Lao,
 186 Latin,
 187 Lepcha,
 188 Limbu,
 189 Linear_B,
 190 Lisu,
 191 Lycian,
 192 Lydian,
 193 Malayalam,
 194 Mandaic,
 195 Meetei_Mayek,
 196 Meroitic_Cursive,
 197 Meroitic_Hieroglyphs,
 198 Miao,
 199 Mongolian,
 200 Myanmar,
 201 New_Tai_Lue,
 202 Nko,
 203 Ogham,
 204 Old_Italic,
 205 Old_Persian,
 206 Old_South_Arabian,
 207 Old_Turkic,
 208 Ol_Chiki,
 209 Oriya,
 210 Osmanya,
 211 Phags_Pa,
 212 Phoenician,
 213 Rejang,
 214 Runic,
 215 Samaritan,
 216 Saurashtra,
 217 Sharada,
 218 Shavian,
 219 Sinhala,
 220 Sora_Sompeng,
 221 Sundanese,
 222 Syloti_Nagri,
 223 Syriac,
 224 Tagalog,
 225 Tagbanwa,
 226 Tai_Le,
 227 Tai_Tham,
 228 Tai_Viet,
 229 Takri,
 230 Tamil,
 231 Telugu,
 232 Thaana,
 233 Thai,
 234 Tibetan,
 235 Tifinagh,
 236 Ugaritic,
 237 Vai,
 238 Yi.
 239 .
 240 .
 241 .SH "CHARACTER CLASSES"
 242 .rs
 243 .sp
 244   [...]       positive character class
 245   [^...]      negative character class
 246   [x-y]       range (can be used for hex characters)
 247   [[:xxx:]]   positive POSIX named set
 248   [[:^xxx:]]  negative POSIX named set
 249 .sp
 250   alnum       alphanumeric
 251   alpha       alphabetic
 252   ascii       0-127
 253   blank       space or tab
 254   cntrl       control character
 255   digit       decimal digit
 256   graph       printing, excluding space
 257   lower       lower case letter
 258   print       printing, including space
 259   punct       printing, excluding alphanumeric
 260   space       white space
 261   upper       upper case letter
 262   word        same as \ew
 263   xdigit      hexadecimal digit
 264 .sp
 265 In PCRE, POSIX character set names recognize only ASCII characters by default,
 266 but some of them use Unicode properties if PCRE_UCP is set. You can use
 267 \eQ...\eE inside a character class.
 268 .
 269 .
 270 .SH "QUANTIFIERS"
 271 .rs
 272 .sp
 273   ?           0 or 1, greedy
 274   ?+          0 or 1, possessive
 275   ??          0 or 1, lazy
 276   *           0 or more, greedy
 277   *+          0 or more, possessive
 278   *?          0 or more, lazy
 279   +           1 or more, greedy
 280   ++          1 or more, possessive
 281   +?          1 or more, lazy
 282   {n}         exactly n
 283   {n,m}       at least n, no more than m, greedy
 284   {n,m}+      at least n, no more than m, possessive
 285   {n,m}?      at least n, no more than m, lazy
 286   {n,}        n or more, greedy
 287   {n,}+       n or more, possessive
 288   {n,}?       n or more, lazy
 289 .
 290 .
 291 .SH "ANCHORS AND SIMPLE ASSERTIONS"
 292 .rs
 293 .sp
 294   \eb          word boundary
 295   \eB          not a word boundary
 296   ^           start of subject
 297                also after internal newline in multiline mode
 298   \eA          start of subject
 299   $           end of subject
 300                also before newline at end of subject
 301                also before internal newline in multiline mode
 302   \eZ          end of subject
 303                also before newline at end of subject
 304   \ez          end of subject
 305   \eG          first matching position in subject
 306 .
 307 .
 308 .SH "MATCH POINT RESET"
 309 .rs
 310 .sp
 311   \eK          reset start of match
 312 .sp
 313 \eK is honoured in positive assertions, but ignored in negative ones.
 314 .
 315 .
 316 .SH "ALTERNATION"
 317 .rs
 318 .sp
 319   expr|expr|expr...
 320 .
 321 .
 322 .SH "CAPTURING"
 323 .rs
 324 .sp
 325   (...)           capturing group
 326   (?<name>...)    named capturing group (Perl)
 327   (?'name'...)    named capturing group (Perl)
 328   (?P<name>...)   named capturing group (Python)
 329   (?:...)         non-capturing group
 330   (?|...)         non-capturing group; reset group numbers for
 331                    capturing groups in each alternative
 332 .
 333 .
 334 .SH "ATOMIC GROUPS"
 335 .rs
 336 .sp
 337   (?>...)         atomic, non-capturing group
 338 .
 339 .
 340 .
 341 .
 342 .SH "COMMENT"
 343 .rs
 344 .sp
 345   (?#....)        comment (not nestable)
 346 .
 347 .
 348 .SH "OPTION SETTING"
 349 .rs
 350 .sp
 351   (?i)            caseless
 352   (?J)            allow duplicate names
 353   (?m)            multiline
 354   (?s)            single line (dotall)
 355   (?U)            default ungreedy (lazy)
 356   (?x)            extended (ignore white space)
 357   (?-...)         unset option(s)
 358 .sp
 359 The following are recognized only at the very start of a pattern or after one
 360 of the newline or \eR options with similar syntax. More than one of them may
 361 appear.
 362 .sp
 363   (*LIMIT_MATCH=d) set the match limit to d (decimal number)
 364   (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
 365   (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
 366   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
 367   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
 368   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
 369   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
 370   (*UTF)          set appropriate UTF mode for the library in use
 371   (*UCP)          set PCRE_UCP (use Unicode properties for \ed etc)
 372 .sp
 373 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
 374 limits set by the caller of pcre_exec(), not increase them.
 375 .
 376 .
 377 .SH "NEWLINE CONVENTION"
 378 .rs
 379 .sp
 380 These are recognized only at the very start of the pattern or after option
 381 settings with a similar syntax.
 382 .sp
 383   (*CR)           carriage return only
 384   (*LF)           linefeed only
 385   (*CRLF)         carriage return followed by linefeed
 386   (*ANYCRLF)      all three of the above
 387   (*ANY)          any Unicode newline sequence
 388 .
 389 .
 390 .SH "WHAT \eR MATCHES"
 391 .rs
 392 .sp
 393 These are recognized only at the very start of the pattern or after option
 394 setting with a similar syntax.
 395 .sp
 396   (*BSR_ANYCRLF)  CR, LF, or CRLF
 397   (*BSR_UNICODE)  any Unicode newline sequence
 398 .
 399 .
 400 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
 401 .rs
 402 .sp
 403   (?=...)         positive look ahead
 404   (?!...)         negative look ahead
 405   (?<=...)        positive look behind
 406   (?<!...)        negative look behind
 407 .sp
 408 Each top-level branch of a look behind must be of a fixed length.
 409 .
 410 .
 411 .SH "BACKREFERENCES"
 412 .rs
 413 .sp
 414   \en              reference by number (can be ambiguous)
 415   \egn             reference by number
 416   \eg{n}           reference by number
 417   \eg{-n}          relative reference by number
 418   \ek<name>        reference by name (Perl)
 419   \ek'name'        reference by name (Perl)
 420   \eg{name}        reference by name (Perl)
 421   \ek{name}        reference by name (.NET)
 422   (?P=name)       reference by name (Python)
 423 .
 424 .
 425 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
 426 .rs
 427 .sp
 428   (?R)            recurse whole pattern
 429   (?n)            call subpattern by absolute number
 430   (?+n)           call subpattern by relative number
 431   (?-n)           call subpattern by relative number
 432   (?&name)        call subpattern by name (Perl)
 433   (?P>name)       call subpattern by name (Python)
 434   \eg<name>        call subpattern by name (Oniguruma)
 435   \eg'name'        call subpattern by name (Oniguruma)
 436   \eg<n>           call subpattern by absolute number (Oniguruma)
 437   \eg'n'           call subpattern by absolute number (Oniguruma)
 438   \eg<+n>          call subpattern by relative number (PCRE extension)
 439   \eg'+n'          call subpattern by relative number (PCRE extension)
 440   \eg<-n>          call subpattern by relative number (PCRE extension)
 441   \eg'-n'          call subpattern by relative number (PCRE extension)
 442 .
 443 .
 444 .SH "CONDITIONAL PATTERNS"
 445 .rs
 446 .sp
 447   (?(condition)yes-pattern)
 448   (?(condition)yes-pattern|no-pattern)
 449 .sp
 450   (?(n)...        absolute reference condition
 451   (?(+n)...       relative reference condition
 452   (?(-n)...       relative reference condition
 453   (?(<name>)...   named reference condition (Perl)
 454   (?('name')...   named reference condition (Perl)
 455   (?(name)...     named reference condition (PCRE)
 456   (?(R)...        overall recursion condition
 457   (?(Rn)...       specific group recursion condition
 458   (?(R&name)...   specific recursion condition
 459   (?(DEFINE)...   define subpattern for reference
 460   (?(assert)...   assertion condition
 461 .
 462 .
 463 .SH "BACKTRACKING CONTROL"
 464 .rs
 465 .sp
 466 The following act immediately they are reached:
 467 .sp
 468   (*ACCEPT)       force successful match
 469   (*FAIL)         force backtrack; synonym (*F)
 470   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
 471 .sp
 472 The following act only when a subsequent match failure causes a backtrack to
 473 reach them. They all force a match failure, but they differ in what happens
 474 afterwards. Those that advance the start-of-match point do so only if the
 475 pattern is not anchored.
 476 .sp
 477   (*COMMIT)       overall failure, no advance of starting point
 478   (*PRUNE)        advance to next starting character
 479   (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
 480   (*SKIP)         advance to current matching position
 481   (*SKIP:NAME)    advance to position corresponding to an earlier
 482                   (*MARK:NAME); if not found, the (*SKIP) is ignored
 483   (*THEN)         local failure, backtrack to next alternation
 484   (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
 485 .
 486 .
 487 .SH "CALLOUTS"
 488 .rs
 489 .sp
 490   (?C)      callout
 491   (?Cn)     callout with data n
 492 .
 493 .
 494 .SH "SEE ALSO"
 495 .rs
 496 .sp
 497 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
 498 \fBpcrematching\fP(3), \fBpcre\fP(3).
 499 .
 500 .
 501 .SH AUTHOR
 502 .rs
 503 .sp
 504 .nf
 505 Philip Hazel
 506 University Computing Service
 507 Cambridge CB2 3QH, England.
 508 .fi
 509 .
 510 .
 511 .SH REVISION
 512 .rs
 513 .sp
 514 .nf
 515 Last updated: 08 January 2014
 516 Copyright (c) 1997-2014 University of Cambridge.
 517 .fi