chiark / gitweb /
utf32_word_split() and utf8_word_split() splits a string into words
authorRichard Kettlewell <rjk@greenend.org.uk>
Tue, 20 Nov 2007 18:13:56 +0000 (18:13 +0000)
committerRichard Kettlewell <rjk@greenend.org.uk>
Tue, 20 Nov 2007 18:13:56 +0000 (18:13 +0000)
commit8818b7fca12456e62410ef914a7bef250a0633c9
treedf1992a190796971fb46f160babd454010808851
parent7bbe944b70a8a904dd15905fbf351b5e906224ff
utf32_word_split() and utf8_word_split() splits a string into words
using the UAX #29 word boundary algorithm.  words() is therefore now a
wrapper around this.  There is scope for improvement in the use of
this function as currently we do some needless converting back and
forth between encoding forms.

casefold() now uses the compatibility case-folding algorithm, which
seems more appropriate for searching.

dbversions are now integers not strings.  Some dbversion=2
functionality can be selectively disabled for testing purposes.

README.dbversions documents the differences between the dbversions.
12 files changed:
lib/configuration.c
lib/configuration.h
lib/test.c
lib/unicode.c
lib/unicode.h
lib/vector.h
lib/words.c
server/Makefile.am
server/README.dbversions [new file with mode: 0644]
server/rescan.c
server/trackdb.c
server/trackdb.h