Changelog for 2:8.38-2

[pcre3.git] / ChangeLog
diff --git a/ChangeLog b/ChangeLog

index 7801ef841179c7ca8030646d62af4ea85e6e50db..5e5bf188cea00c055c4e3216fbb512025214b683 100644 (file)
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,6 +1,447 @@
  ChangeLog for PCRE
  ------------------
  
+Note that the PCRE 8.xx series (PCRE1) is now in a bugfix-only state. All
+development is happening in the PCRE2 10.xx series.
+
+Version 8.38 23-November-2015
+-----------------------------
+
+1.  If a group that contained a recursive back reference also contained a
+    forward reference subroutine call followed by a non-forward-reference
+    subroutine call, for example /.((?2)(?R)\1)()/, pcre2_compile() failed to
+    compile correct code, leading to undefined behaviour or an internally
+    detected error. This bug was discovered by the LLVM fuzzer.
+
+2.  Quantification of certain items (e.g. atomic back references) could cause
+    incorrect code to be compiled when recursive forward references were
+    involved. For example, in this pattern: /(?1)()((((((\1++))\x85)+)|))/.
+    This bug was discovered by the LLVM fuzzer.
+
+3.  A repeated conditional group whose condition was a reference by name caused
+    a buffer overflow if there was more than one group with the given name.
+    This bug was discovered by the LLVM fuzzer.
+
+4.  A recursive back reference by name within a group that had the same name as
+    another group caused a buffer overflow. For example:
+    /(?J)(?'d'(?'d'\g{d}))/. This bug was discovered by the LLVM fuzzer.
+
+5.  A forward reference by name to a group whose number is the same as the
+    current group, for example in this pattern: /(?|(\k'Pm')|(?'Pm'))/, caused
+    a buffer overflow at compile time. This bug was discovered by the LLVM
+    fuzzer.
+
+6.  A lookbehind assertion within a set of mutually recursive subpatterns could
+    provoke a buffer overflow. This bug was discovered by the LLVM fuzzer.
+
+7.  Another buffer overflow bug involved duplicate named groups with a
+    reference between their definition, with a group that reset capture
+    numbers, for example: /(?J:(?|(?'R')(\k'R')|((?'R'))))/. This has been
+    fixed by always allowing for more memory, even if not needed. (A proper fix
+    is implemented in PCRE2, but it involves more refactoring.)
+
+8.  There was no check for integer overflow in subroutine calls such as (?123).
+
+9.  The table entry for \l in EBCDIC environments was incorrect, leading to its
+    being treated as a literal 'l' instead of causing an error.
+
+10. There was a buffer overflow if pcre_exec() was called with an ovector of
+    size 1. This bug was found by american fuzzy lop.
+
+11. If a non-capturing group containing a conditional group that could match
+    an empty string was repeated, it was not identified as matching an empty
+    string itself. For example: /^(?:(?(1)x|)+)+$()/.
+
+12. In an EBCDIC environment, pcretest was mishandling the escape sequences
+    \a and \e in test subject lines.
+
+13. In an EBCDIC environment, \a in a pattern was converted to the ASCII
+    instead of the EBCDIC value.
+
+14. The handling of \c in an EBCDIC environment has been revised so that it is
+    now compatible with the specification in Perl's perlebcdic page.
+
+15. The EBCDIC character 0x41 is a non-breaking space, equivalent to 0xa0 in
+    ASCII/Unicode. This has now been added to the list of characters that are
+    recognized as white space in EBCDIC.
+
+16. When PCRE was compiled without UCP support, the use of \p and \P gave an
+    error (correctly) when used outside a class, but did not give an error
+    within a class.
+
+17. \h within a class was incorrectly compiled in EBCDIC environments.
+
+18. A pattern with an unmatched closing parenthesis that contained a backward
+    assertion which itself contained a forward reference caused buffer
+    overflow. And example pattern is: /(?=di(?<=(?1))|(?=(.))))/.
+
+19. JIT should return with error when the compiled pattern requires more stack
+    space than the maximum.
+
+20. A possessively repeated conditional group that could match an empty string,
+    for example, /(?(R))*+/, was incorrectly compiled.
+
+21. Fix infinite recursion in the JIT compiler when certain patterns such as
+    /(?:|a|){100}x/ are analysed.
+
+22. Some patterns with character classes involving [: and \\ were incorrectly
+    compiled and could cause reading from uninitialized memory or an incorrect
+    error diagnosis.
+
+23. Pathological patterns containing many nested occurrences of [: caused
+    pcre_compile() to run for a very long time.
+
+24. A conditional group with only one branch has an implicit empty alternative
+    branch and must therefore be treated as potentially matching an empty
+    string.
+
+25. If (?R was followed by - or + incorrect behaviour happened instead of a
+    diagnostic.
+
+26. Arrange to give up on finding the minimum matching length for overly
+    complex patterns.
+
+27. Similar to (4) above: in a pattern with duplicated named groups and an
+    occurrence of (?| it is possible for an apparently non-recursive back
+    reference to become recursive if a later named group with the relevant
+    number is encountered. This could lead to a buffer overflow. Wen Guanxing
+    from Venustech ADLAB discovered this bug.
+
+28. If pcregrep was given the -q option with -c or -l, or when handling a
+    binary file, it incorrectly wrote output to stdout.
+
+29. The JIT compiler did not restore the control verb head in case of *THEN
+    control verbs. This issue was found by Karl Skomski with a custom LLVM
+    fuzzer.
+
+30. Error messages for syntax errors following \g and \k were giving inaccurate
+    offsets in the pattern.
+
+31. Added a check for integer overflow in conditions (?(<digits>) and
+    (?(R<digits>). This omission was discovered by Karl Skomski with the LLVM
+    fuzzer.
+
+32. Handling recursive references such as (?2) when the reference is to a group
+    later in the pattern uses code that is very hacked about and error-prone.
+    It has been re-written for PCRE2. Here in PCRE1, a check has been added to
+    give an internal error if it is obvious that compiling has gone wrong.
+
+33. The JIT compiler should not check repeats after a {0,1} repeat byte code.
+    This issue was found by Karl Skomski with a custom LLVM fuzzer.
+
+34. The JIT compiler should restore the control chain for empty possessive
+    repeats. This issue was found by Karl Skomski with a custom LLVM fuzzer.
+
+35. Match limit check added to JIT recursion. This issue was found by Karl
+    Skomski with a custom LLVM fuzzer.
+
+36. Yet another case similar to 27 above has been circumvented by an
+    unconditional allocation of extra memory. This issue is fixed "properly" in
+    PCRE2 by refactoring the way references are handled. Wen Guanxing
+    from Venustech ADLAB discovered this bug.
+
+37. Fix two assertion fails in JIT. These issues were found by Karl Skomski
+    with a custom LLVM fuzzer.
+
+38. Fixed a corner case of range optimization in JIT.
+
+39. An incorrect error "overran compiling workspace" was given if there were
+    exactly enough group forward references such that the last one extended
+    into the workspace safety margin. The next one would have expanded the
+    workspace. The test for overflow was not including the safety margin.
+
+40. A match limit issue is fixed in JIT which was found by Karl Skomski
+    with a custom LLVM fuzzer.
+
+41. Remove the use of /dev/null in testdata/testinput2, because it doesn't
+    work under Windows. (Why has it taken so long for anyone to notice?)
+
+42. In a character class such as [\W\p{Any}] where both a negative-type escape
+    ("not a word character") and a property escape were present, the property
+    escape was being ignored.
+
+43. Fix crash caused by very long (*MARK) or (*THEN) names.
+
+44. A sequence such as [[:punct:]b] that is, a POSIX character class followed
+    by a single ASCII character in a class item, was incorrectly compiled in
+    UCP mode. The POSIX class got lost, but only if the single character
+    followed it.
+
+45. [:punct:] in UCP mode was matching some characters in the range 128-255
+    that should not have been matched.
+
+46. If [:^ascii:] or [:^xdigit:] or [:^cntrl:] are present in a non-negated
+    class, all characters with code points greater than 255 are in the class.
+    When a Unicode property was also in the class (if PCRE_UCP is set, escapes
+    such as \w are turned into Unicode properties), wide characters were not
+    correctly handled, and could fail to match.
+
+
+Version 8.37 28-April-2015
+--------------------------
+
+1.  When an (*ACCEPT) is triggered inside capturing parentheses, it arranges
+    for those parentheses to be closed with whatever has been captured so far.
+    However, it was failing to mark any other groups between the hightest
+    capture so far and the currrent group as "unset". Thus, the ovector for
+    those groups contained whatever was previously there. An example is the
+    pattern /(x)|((*ACCEPT))/ when matched against "abcd".
+
+2.  If an assertion condition was quantified with a minimum of zero (an odd
+    thing to do, but it happened), SIGSEGV or other misbehaviour could occur.
+
+3.  If a pattern in pcretest input had the P (POSIX) modifier followed by an
+    unrecognized modifier, a crash could occur.
+
+4.  An attempt to do global matching in pcretest with a zero-length ovector
+    caused a crash.
+
+5.  Fixed a memory leak during matching that could occur for a subpattern
+    subroutine call (recursive or otherwise) if the number of captured groups
+    that had to be saved was greater than ten.
+
+6.  Catch a bad opcode during auto-possessification after compiling a bad UTF
+    string with NO_UTF_CHECK. This is a tidyup, not a bug fix, as passing bad
+    UTF with NO_UTF_CHECK is documented as having an undefined outcome.
+
+7.  A UTF pattern containing a "not" match of a non-ASCII character and a
+    subroutine reference could loop at compile time. Example: /[^\xff]((?1))/.
+
+8. When a pattern is compiled, it remembers the highest back reference so that
+   when matching, if the ovector is too small, extra memory can be obtained to
+   use instead. A conditional subpattern whose condition is a check on a
+   capture having happened, such as, for example in the pattern
+   /^(?:(a)|b)(?(1)A|B)/, is another kind of back reference, but it was not
+   setting the highest backreference number. This mattered only if pcre_exec()
+   was called with an ovector that was too small to hold the capture, and there
+   was no other kind of back reference (a situation which is probably quite
+   rare). The effect of the bug was that the condition was always treated as
+   FALSE when the capture could not be consulted, leading to a incorrect
+   behaviour by pcre_exec(). This bug has been fixed.
+
+9. A reference to a duplicated named group (either a back reference or a test
+   for being set in a conditional) that occurred in a part of the pattern where
+   PCRE_DUPNAMES was not set caused the amount of memory needed for the pattern
+   to be incorrectly calculated, leading to overwriting.
+
+10. A mutually recursive set of back references such as (\2)(\1) caused a
+    segfault at study time (while trying to find the minimum matching length).
+    The infinite loop is now broken (with the minimum length unset, that is,
+    zero).
+
+11. If an assertion that was used as a condition was quantified with a minimum
+    of zero, matching went wrong. In particular, if the whole group had
+    unlimited repetition and could match an empty string, a segfault was
+    likely. The pattern (?(?=0)?)+ is an example that caused this. Perl allows
+    assertions to be quantified, but not if they are being used as conditions,
+    so the above pattern is faulted by Perl. PCRE has now been changed so that
+    it also rejects such patterns.
+
+12. A possessive capturing group such as (a)*+ with a minimum repeat of zero
+    failed to allow the zero-repeat case if pcre2_exec() was called with an
+    ovector too small to capture the group.
+
+13. Fixed two bugs in pcretest that were discovered by fuzzing and reported by
+    Red Hat Product Security:
+
+    (a) A crash if /K and /F were both set with the option to save the compiled
+    pattern.
+
+    (b) Another crash if the option to print captured substrings in a callout
+    was combined with setting a null ovector, for example \O\C+ as a subject
+    string.
+
+14. A pattern such as "((?2){0,1999}())?", which has a group containing a
+    forward reference repeated a large (but limited) number of times within a
+    repeated outer group that has a zero minimum quantifier, caused incorrect
+    code to be compiled, leading to the error "internal error:
+    previously-checked referenced subpattern not found" when an incorrect
+    memory address was read. This bug was reported as "heap overflow",
+    discovered by Kai Lu of Fortinet's FortiGuard Labs and given the CVE number
+    CVE-2015-2325.
+
+23. A pattern such as "((?+1)(\1))/" containing a forward reference subroutine
+    call within a group that also contained a recursive back reference caused
+    incorrect code to be compiled. This bug was reported as "heap overflow",
+    discovered by Kai Lu of Fortinet's FortiGuard Labs, and given the CVE
+    number CVE-2015-2326.
+
+24. Computing the size of the JIT read-only data in advance has been a source
+    of various issues, and new ones are still appear unfortunately. To fix
+    existing and future issues, size computation is eliminated from the code,
+    and replaced by on-demand memory allocation.
+
+25. A pattern such as /(?i)[A-`]/, where characters in the other case are
+    adjacent to the end of the range, and the range contained characters with
+    more than one other case, caused incorrect behaviour when compiled in UTF
+    mode. In that example, the range a-j was left out of the class.
+
+26. Fix JIT compilation of conditional blocks, which assertion
+    is converted to (*FAIL). E.g: /(?(?!))/.
+
+27. The pattern /(?(?!)^)/ caused references to random memory. This bug was
+    discovered by the LLVM fuzzer.
+
+28. The assertion (?!) is optimized to (*FAIL). This was not handled correctly
+    when this assertion was used as a condition, for example (?(?!)a|b). In
+    pcre2_match() it worked by luck; in pcre2_dfa_match() it gave an incorrect
+    error about an unsupported item.
+
+29. For some types of pattern, for example /Z*(|d*){216}/, the auto-
+    possessification code could take exponential time to complete. A recursion
+    depth limit of 1000 has been imposed to limit the resources used by this
+    optimization.
+
+30. A pattern such as /(*UTF)[\S\V\H]/, which contains a negated special class
+    such as \S in non-UCP mode, explicit wide characters (> 255) can be ignored
+    because \S ensures they are all in the class. The code for doing this was
+    interacting badly with the code for computing the amount of space needed to
+    compile the pattern, leading to a buffer overflow. This bug was discovered
+    by the LLVM fuzzer.
+
+31. A pattern such as /((?2)+)((?1))/ which has mutual recursion nested inside
+    other kinds of group caused stack overflow at compile time. This bug was
+    discovered by the LLVM fuzzer.
+
+32. A pattern such as /(?1)(?#?'){8}(a)/ which had a parenthesized comment
+    between a subroutine call and its quantifier was incorrectly compiled,
+    leading to buffer overflow or other errors. This bug was discovered by the
+    LLVM fuzzer.
+
+33. The illegal pattern /(?(?<E>.*!.*)?)/ was not being diagnosed as missing an
+    assertion after (?(. The code was failing to check the character after
+    (?(?< for the ! or = that would indicate a lookbehind assertion. This bug
+    was discovered by the LLVM fuzzer.
+
+34. A pattern such as /X((?2)()*+){2}+/ which has a possessive quantifier with
+    a fixed maximum following a group that contains a subroutine reference was
+    incorrectly compiled and could trigger buffer overflow. This bug was
+    discovered by the LLVM fuzzer.
+
+35. A mutual recursion within a lookbehind assertion such as (?<=((?2))((?1)))
+    caused a stack overflow instead of the diagnosis of a non-fixed length
+    lookbehind assertion. This bug was discovered by the LLVM fuzzer.
+
+36. The use of \K in a positive lookbehind assertion in a non-anchored pattern
+    (e.g. /(?<=\Ka)/) could make pcregrep loop.
+
+37. There was a similar problem to 36 in pcretest for global matches.
+
+38. If a greedy quantified \X was preceded by \C in UTF mode (e.g. \C\X*),
+    and a subsequent item in the pattern caused a non-match, backtracking over
+    the repeated \X did not stop, but carried on past the start of the subject,
+    causing reference to random memory and/or a segfault. There were also some
+    other cases where backtracking after \C could crash. This set of bugs was
+    discovered by the LLVM fuzzer.
+
+39. The function for finding the minimum length of a matching string could take
+    a very long time if mutual recursion was present many times in a pattern,
+    for example, /((?2){73}(?2))((?1))/. A better mutual recursion detection
+    method has been implemented. This infelicity was discovered by the LLVM
+    fuzzer.
+
+40. Static linking against the PCRE library using the pkg-config module was
+    failing on missing pthread symbols.
+
+
+Version 8.36 26-September-2014
+------------------------------
+
+1.  Got rid of some compiler warnings in the C++ modules that were shown up by
+    -Wmissing-field-initializers and -Wunused-parameter.
+
+2.  The tests for quantifiers being too big (greater than 65535) were being
+    applied after reading the number, and stupidly assuming that integer
+    overflow would give a negative number. The tests are now applied as the
+    numbers are read.
+
+3.  Tidy code in pcre_exec.c where two branches that used to be different are
+    now the same.
+
+4.  The JIT compiler did not generate match limit checks for certain
+    bracketed expressions with quantifiers. This may lead to exponential
+    backtracking, instead of returning with PCRE_ERROR_MATCHLIMIT. This
+    issue should be resolved now.
+
+5.  Fixed an issue, which occures when nested alternatives are optimized
+    with table jumps.
+
+6.  Inserted two casts and changed some ints to size_t in the light of some
+    reported 64-bit compiler warnings (Bugzilla 1477).
+
+7.  Fixed a bug concerned with zero-minimum possessive groups that could match
+    an empty string, which sometimes were behaving incorrectly in the
+    interpreter (though correctly in the JIT matcher). This pcretest input is
+    an example:
+
+      '\A(?:[^"]++|"(?:[^"]*+|"")*+")++'
+      NON QUOTED "QUOT""ED" AFTER "NOT MATCHED
+
+    the interpreter was reporting a match of 'NON QUOTED ' only, whereas the
+    JIT matcher and Perl both matched 'NON QUOTED "QUOT""ED" AFTER '. The test
+    for an empty string was breaking the inner loop and carrying on at a lower
+    level, when possessive repeated groups should always return to a higher
+    level as they have no backtrack points in them. The empty string test now
+    occurs at the outer level.
+
+8.  Fixed a bug that was incorrectly auto-possessifying \w+ in the pattern
+    ^\w+(?>\s*)(?<=\w) which caused it not to match "test test".
+
+9.  Give a compile-time error for \o{} (as Perl does) and for \x{} (which Perl
+    doesn't).
+
+10. Change 8.34/15 introduced a bug that caused the amount of memory needed
+    to hold a pattern to be incorrectly computed (too small) when there were
+    named back references to duplicated names. This could cause "internal
+    error: code overflow" or "double free or corruption" or other memory
+    handling errors.
+
+11. When named subpatterns had the same prefixes, back references could be
+    confused. For example, in this pattern:
+
+      /(?P<Name>a)?(?P<Name2>b)?(?(<Name>)c|d)*l/
+
+    the reference to 'Name' was incorrectly treated as a reference to a
+    duplicate name.
+
+12. A pattern such as /^s?c/mi8 where the optional character has more than
+    one "other case" was incorrectly compiled such that it would only try to
+    match starting at "c".
+
+13. When a pattern starting with \s was studied, VT was not included in the
+    list of possible starting characters; this should have been part of the
+    8.34/18 patch.
+
+14. If a character class started [\Qx]... where x is any character, the class
+    was incorrectly terminated at the ].
+
+15. If a pattern that started with a caseless match for a character with more
+    than one "other case" was studied, PCRE did not set up the starting code
+    unit bit map for the list of possible characters. Now it does. This is an
+    optimization improvement, not a bug fix.
+
+16. The Unicode data tables have been updated to Unicode 7.0.0.
+
+17. Fixed a number of memory leaks in pcregrep.
+
+18. Avoid a compiler warning (from some compilers) for a function call with
+    a cast that removes "const" from an lvalue by using an intermediate
+    variable (to which the compiler does not object).
+
+19. Incorrect code was compiled if a group that contained an internal recursive
+    back reference was optional (had quantifier with a minimum of zero). This
+    example compiled incorrect code: /(((a\2)|(a*)\g<-1>))*/ and other examples
+    caused segmentation faults because of stack overflows at compile time.
+
+20. A pattern such as /((?(R)a|(?1)))+/, which contains a recursion within a
+    group that is quantified with an indefinite repeat, caused a compile-time
+    loop which used up all the system stack and provoked a segmentation fault.
+    This was not the same bug as 19 above.
+
+21. Add PCRECPP_EXP_DECL declaration to operator<< in pcre_stringpiece.h.
+    Patch by Mike Frysinger.
+
+
  Version 8.35 04-April-2014
  --------------------------
  
@@ -27,9 +468,9 @@ Version 8.35 04-April-2014
  
  6.  Improve character range checks in JIT. Characters are read by an inprecise
      function now, which returns with an unknown value if the character code is
-    above a certain treshold (e.g: 256). The only limitation is that the value
-    must be bigger than the treshold as well. This function is useful, when
-    the characters above the treshold are handled in the same way.
+    above a certain threshold (e.g: 256). The only limitation is that the value
+    must be bigger than the threshold as well. This function is useful when
+    the characters above the threshold are handled in the same way.
  
  7.  The macros whose names start with RAWUCHAR are placeholders for a future
      mode in which only the bottom 21 bits of 32-bit data items are used. To