X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~ian/git?a=blobdiff_plain;f=doc%2Fhtml%2Fpcrepattern.html;h=55034a7edf6ad83f0084a412f07d3de611dfa333;hb=HEAD;hp=c06d1e03f11665068820e9c4232bd702e4f88963;hpb=e5f50570097752e9d8d68df700473362e385bda6;p=pcre3.git diff --git a/doc/html/pcrepattern.html b/doc/html/pcrepattern.html index c06d1e0..55034a7 100644 --- a/doc/html/pcrepattern.html +++ b/doc/html/pcrepattern.html @@ -329,7 +329,8 @@ A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner. There is no restriction on the appearance of non-printing characters, apart from the binary zero that terminates a pattern, but when a pattern is being prepared by text editing, it is often easier to use -one of the following escape sequences than the binary character it represents: +one of the following escape sequences than the binary character it represents. +In an ASCII or Unicode environment, these escapes are as follows:
   \a        alarm, that is, the BEL character (hex 07)
   \cx       "control-x", where x is any ASCII character
@@ -353,19 +354,33 @@ data item (byte or 16-bit value) following \c has a value greater than 127, a
 compile-time error occurs. This locks out non-ASCII characters in all modes.
 

-The \c facility was designed for use with ASCII characters, but with the -extension to Unicode it is even less useful than it once was. It is, however, -recognized when PCRE is compiled in EBCDIC mode, where data items are always -bytes. In this mode, all values are valid after \c. If the next character is a -lower case letter, it is converted to upper case. Then the 0xc0 bits of the -byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because -the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other -characters also generate different values. +When PCRE is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t +generate the appropriate EBCDIC code values. The \c escape is processed +as specified for Perl in the perlebcdic document. The only characters +that are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^, _, or ?. Any +other character provokes a compile-time error. The sequence \@ encodes +character code 0; the letters (in either case) encode characters 1-26 (hex 01 +to hex 1A); [, \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and +\? becomes either 255 (hex FF) or 95 (hex 5F). +

+

+Thus, apart from \?, these escapes generate the same character code values as +they do in an ASCII environment, though the meanings of the values mostly +differ. For example, \G always generates code value 7, which is BEL in ASCII +but DEL in EBCDIC. +

+

+The sequence \? generates DEL (127, hex 7F) in an ASCII environment, but +because 127 is not a control character in EBCDIC, Perl makes it generate the +APC character. Unfortunately, there are several variants of EBCDIC. In most of +them the APC character has the value 255 (hex FF), but in the one Perl calls +POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC +values, PCRE makes \? generate 95; otherwise it generates 255.

After \0 up to two further octal digits are read. If there are fewer than two -digits, just those that are present are used. Thus the sequence \0\x\07 -specifies two binary zeros followed by a BEL character (code value 7). Make +digits, just those that are present are used. Thus the sequence \0\x\015 +specifies two binary zeros followed by a CR character (code value 13). Make sure you supply two digits after the initial zero if the pattern character that follows is itself an octal digit.

@@ -703,6 +718,7 @@ Armenian, Avestan, Balinese, Bamum, +Bassa_Vah, Batak, Bengali, Bopomofo, @@ -712,6 +728,7 @@ Buginese, Buhid, Canadian_Aboriginal, Carian, +Caucasian_Albanian, Chakma, Cham, Cherokee, @@ -722,11 +739,14 @@ Cypriot, Cyrillic, Deseret, Devanagari, +Duployan, Egyptian_Hieroglyphs, +Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, +Grantha, Greek, Gujarati, Gurmukhi, @@ -746,40 +766,56 @@ Katakana, Kayah_Li, Kharoshthi, Khmer, +Khojki, +Khudawadi, Lao, Latin, Lepcha, Limbu, +Linear_A, Linear_B, Lisu, Lycian, Lydian, +Mahajani, Malayalam, Mandaic, +Manichaean, Meetei_Mayek, +Mende_Kikakui, Meroitic_Cursive, Meroitic_Hieroglyphs, Miao, +Modi, Mongolian, +Mro, Myanmar, +Nabataean, New_Tai_Lue, Nko, Ogham, +Ol_Chiki, Old_Italic, +Old_North_Arabian, +Old_Permic, Old_Persian, Old_South_Arabian, Old_Turkic, -Ol_Chiki, Oriya, Osmanya, +Pahawh_Hmong, +Palmyrene, +Pau_Cin_Hau, Phags_Pa, Phoenician, +Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Shavian, +Siddham, Sinhala, Sora_Sompeng, Sundanese, @@ -797,8 +833,10 @@ Thaana, Thai, Tibetan, Tifinagh, +Tirhuta, Ugaritic, Vai, +Warang_Citi, Yi.

@@ -3226,9 +3264,9 @@ Cambridge CB2 3QH, England.


REVISION

-Last updated: 08 January 2014 +Last updated: 14 June 2015
-Copyright © 1997-2014 University of Cambridge. +Copyright © 1997-2015 University of Cambridge.

Return to the PCRE index page.