X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?p=ypp-sc-tools.db-test.git;a=blobdiff_plain;f=pctb%2FREADME.charset;h=c57f2f5187a537b85c77f793493eefd9f7f0a3e3;hp=e1fd3ff5f86ba8f41e9843d2255e11edefe120b2;hb=ac65228e40fa375c829b46607fb4941ff11376e9;hpb=6a3c0962283d32bc6e5f6c47c929baf37ddc642f diff --git a/pctb/README.charset b/pctb/README.charset index e1fd3ff..c57f2f5 100644 --- a/pctb/README.charset +++ b/pctb/README.charset @@ -38,7 +38,8 @@ the entry you have just made. Upper vs lower case - important note regarding `l' and `I' ---------------------------------------------------------- -We maintain separate dictionaries for upper and lower case. At the +We maintain separate dictionaries for upper case (Upper), lower case +(Lower), and (initial portions of) mid-phrase words (Word). At the beginning of each cell in the table, we expect uppercase; in the middle of a word we expect lowercase; and, unfortunately, after an inter-word gap, we are not sure. @@ -47,10 +48,16 @@ This is troublesome because `l' and `I' look identical on the screen. So any time we see a word starting with `l' or `I', the program has to ask about it. +After an interword gap, we first search for a Word entry in the +dictionary. If there is a match we use it. Otherwise we search both +the uppercase and lowercase dictionaries; if one matches and the other +doesn't, or one matches a wider character than the other, we use it. +If that fails to resolve the ambiguity we must ask. + *Do not* make an entry in the character set dictionary mapping `vertical stick' to `l' or `I'. Instead, select enough of the whole word in question that no word would start with the other letter, and enter the -whole word or part of it as a new glyph. +whole word or part of it as a new glyph as a new Word. For example, in the supplied dictionary there is already a glyph for `Iron'; this is OK because there are no words which start `lron'. @@ -72,12 +79,11 @@ for the uppercase letter in the lowercase dictionary. Instead, make a new glyph for the last letter of the previous word plus the (unusually narrow) inter-word space, and end that entry with -\x20 (yes, type \ x 20). +a literal space ` '. For example, you might find that `yG' is treated as `y' and the G doesn't get matched. Select the `y' -region of the bitmap and type `y\x20' into the string box. -Sorry for this rather poor UI! +region of the bitmap and type `y ' into the string box. Overlapping characters - ligatures @@ -101,9 +107,10 @@ However since you are not guaranteed to actually get an OCR query at all if the dictionary contains errors, you shouldn't rely on this. If you think you have made mistakes answering OCR queries (for -example, the recognised data is wrong), you should download a fresh -copy of charset-15.txt from - http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt +example, the recognised data is wrong), you should delete the file +#local-char*#.txt, which contains your local updates. It will then +only use the centrally provided (and vetted) master file (which is +automatically updated when you run the PCTB client, by default). Enabling interactive character set update