3 <title>pcresyntax specification</title>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcresyntax man page</h1>
8 Return to the <a href="index.html">PCRE index page</a>.
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
33 <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
34 <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
35 <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
36 <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
37 <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
38 <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42 <li><a name="TOC27" href="#SEC27">REVISION</a>
44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
46 The full syntax and semantics of the regular expressions that are supported by
47 PCRE are described in the
48 <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 documentation. This document contains a quick-reference summary of the syntax.
51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
54 \x where x is non-alphanumeric is a literal x
55 \Q...\E treat enclosed characters as literal
58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
61 \a alarm, that is, the BEL character (hex 07)
62 \cx "control-x", where x is any ASCII character
66 \r carriage return (hex 0D)
68 \0dd character with octal code 0dd
69 \ddd character with octal code ddd, or backreference
70 \o{ddd..} character with octal code ddd..
71 \xhh character with hex code hh
72 \x{hhh..} character with hex code hhh..
74 Note that \0dd is always an octal code, and that \8 and \9 are the literal
75 characters "8" and "9".
77 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
80 . any character except newline;
81 in dotall mode, any character whatsoever
82 \C one data unit, even in UTF mode (best avoided)
84 \D a character that is not a decimal digit
85 \h a horizontal white space character
86 \H a character that is not a horizontal white space character
87 \N a character that is not a newline
88 \p{<i>xx</i>} a character with the <i>xx</i> property
89 \P{<i>xx</i>} a character without the <i>xx</i> property
91 \s a white space character
92 \S a character that is not a white space character
93 \v a vertical white space character
94 \V a character that is not a vertical white space character
96 \W a "non-word" character
97 \X a Unicode extended grapheme cluster
99 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
100 or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
101 happening, \s and \w may also match characters with code points in the range
102 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
103 is changed to use Unicode properties and they match many more characters.
105 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
134 Pc Connector punctuation
138 Pi Initial punctuation
145 Sm Mathematical symbol
150 Zp Paragraph separator
154 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
157 Xan Alphanumeric: union of properties L and N
158 Xps POSIX space: property Z or tab, NL, VT, FF, CR
159 Xsp Perl space: property Z or tab, NL, VT, FF, CR
160 Xuc Univerally-named character: one that can be
161 represented by a Universal Character Name
162 Xwd Perl word: property Xan or underscore
164 Perl and POSIX space are now the same. Perl added VT to its space character set
165 at release 5.18 and PCRE changed at release 8.34.
167 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
193 Egyptian_Hieroglyphs,
208 Inscriptional_Pahlavi,
209 Inscriptional_Parthian,
229 Meroitic_Hieroglyphs,
272 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
275 [...] positive character class
276 [^...] negative character class
277 [x-y] range (can be used for hex characters)
278 [[:xxx:]] positive POSIX named set
279 [[:^xxx:]] negative POSIX named set
285 cntrl control character
287 graph printing, excluding space
288 lower lower case letter
289 print printing, including space
290 punct printing, excluding alphanumeric
292 upper upper case letter
294 xdigit hexadecimal digit
296 In PCRE, POSIX character set names recognize only ASCII characters by default,
297 but some of them use Unicode properties if PCRE_UCP is set. You can use
298 \Q...\E inside a character class.
300 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
304 ?+ 0 or 1, possessive
307 *+ 0 or more, possessive
310 ++ 1 or more, possessive
313 {n,m} at least n, no more than m, greedy
314 {n,m}+ at least n, no more than m, possessive
315 {n,m}? at least n, no more than m, lazy
316 {n,} n or more, greedy
317 {n,}+ n or more, possessive
318 {n,}? n or more, lazy
321 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
325 \B not a word boundary
327 also after internal newline in multiline mode
330 also before newline at end of subject
331 also before internal newline in multiline mode
333 also before newline at end of subject
335 \G first matching position in subject
338 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
341 \K reset start of match
343 \K is honoured in positive assertions, but ignored in negative ones.
345 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
351 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
354 (...) capturing group
355 (?<name>...) named capturing group (Perl)
356 (?'name'...) named capturing group (Perl)
357 (?P<name>...) named capturing group (Python)
358 (?:...) non-capturing group
359 (?|...) non-capturing group; reset group numbers for
360 capturing groups in each alternative
363 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
366 (?>...) atomic, non-capturing group
369 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
372 (?#....) comment (not nestable)
375 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
379 (?J) allow duplicate names
381 (?s) single line (dotall)
382 (?U) default ungreedy (lazy)
383 (?x) extended (ignore white space)
384 (?-...) unset option(s)
386 The following are recognized only at the very start of a pattern or after one
387 of the newline or \R options with similar syntax. More than one of them may
390 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
391 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
392 (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
393 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
394 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
395 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
396 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
397 (*UTF) set appropriate UTF mode for the library in use
398 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
400 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
401 limits set by the caller of pcre_exec(), not increase them.
403 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
405 These are recognized only at the very start of the pattern or after option
406 settings with a similar syntax.
408 (*CR) carriage return only
410 (*CRLF) carriage return followed by linefeed
411 (*ANYCRLF) all three of the above
412 (*ANY) any Unicode newline sequence
415 <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
417 These are recognized only at the very start of the pattern or after option
418 setting with a similar syntax.
420 (*BSR_ANYCRLF) CR, LF, or CRLF
421 (*BSR_UNICODE) any Unicode newline sequence
424 <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
427 (?=...) positive look ahead
428 (?!...) negative look ahead
429 (?<=...) positive look behind
430 (?<!...) negative look behind
432 Each top-level branch of a look behind must be of a fixed length.
434 <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
437 \n reference by number (can be ambiguous)
438 \gn reference by number
439 \g{n} reference by number
440 \g{-n} relative reference by number
441 \k<name> reference by name (Perl)
442 \k'name' reference by name (Perl)
443 \g{name} reference by name (Perl)
444 \k{name} reference by name (.NET)
445 (?P=name) reference by name (Python)
448 <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
451 (?R) recurse whole pattern
452 (?n) call subpattern by absolute number
453 (?+n) call subpattern by relative number
454 (?-n) call subpattern by relative number
455 (?&name) call subpattern by name (Perl)
456 (?P>name) call subpattern by name (Python)
457 \g<name> call subpattern by name (Oniguruma)
458 \g'name' call subpattern by name (Oniguruma)
459 \g<n> call subpattern by absolute number (Oniguruma)
460 \g'n' call subpattern by absolute number (Oniguruma)
461 \g<+n> call subpattern by relative number (PCRE extension)
462 \g'+n' call subpattern by relative number (PCRE extension)
463 \g<-n> call subpattern by relative number (PCRE extension)
464 \g'-n' call subpattern by relative number (PCRE extension)
467 <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
470 (?(condition)yes-pattern)
471 (?(condition)yes-pattern|no-pattern)
473 (?(n)... absolute reference condition
474 (?(+n)... relative reference condition
475 (?(-n)... relative reference condition
476 (?(<name>)... named reference condition (Perl)
477 (?('name')... named reference condition (Perl)
478 (?(name)... named reference condition (PCRE)
479 (?(R)... overall recursion condition
480 (?(Rn)... specific group recursion condition
481 (?(R&name)... specific recursion condition
482 (?(DEFINE)... define subpattern for reference
483 (?(assert)... assertion condition
486 <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
488 The following act immediately they are reached:
490 (*ACCEPT) force successful match
491 (*FAIL) force backtrack; synonym (*F)
492 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
494 The following act only when a subsequent match failure causes a backtrack to
495 reach them. They all force a match failure, but they differ in what happens
496 afterwards. Those that advance the start-of-match point do so only if the
497 pattern is not anchored.
499 (*COMMIT) overall failure, no advance of starting point
500 (*PRUNE) advance to next starting character
501 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
502 (*SKIP) advance to current matching position
503 (*SKIP:NAME) advance to position corresponding to an earlier
504 (*MARK:NAME); if not found, the (*SKIP) is ignored
505 (*THEN) local failure, backtrack to next alternation
506 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
509 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
513 (?Cn) callout with data n
516 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
518 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
519 <b>pcrematching</b>(3), <b>pcre</b>(3).
521 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
525 University Computing Service
527 Cambridge CB2 3QH, England.
530 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
532 Last updated: 08 January 2014
534 Copyright © 1997-2014 University of Cambridge.
537 Return to the <a href="index.html">PCRE index page</a>.