X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?a=blobdiff_plain;f=pctb%2FREADME.charset;h=4301a0a495a7a6c6f44e32bd360ef9d7cf5d7a34;hb=af1d80a949cfb4d17084f7b2fe52ac0b75c71473;hp=e1fd3ff5f86ba8f41e9843d2255e11edefe120b2;hpb=6a3c0962283d32bc6e5f6c47c929baf37ddc642f;p=ypp-sc-tools.main.git diff --git a/pctb/README.charset b/pctb/README.charset index e1fd3ff..4301a0a 100644 --- a/pctb/README.charset +++ b/pctb/README.charset @@ -38,7 +38,8 @@ the entry you have just made. Upper vs lower case - important note regarding `l' and `I' ---------------------------------------------------------- -We maintain separate dictionaries for upper and lower case. At the +We maintain separate dictionaries for upper case (Upper), lower case +(Lower), and (initial portions of) mid-phrase words (Word). At the beginning of each cell in the table, we expect uppercase; in the middle of a word we expect lowercase; and, unfortunately, after an inter-word gap, we are not sure. @@ -47,10 +48,16 @@ This is troublesome because `l' and `I' look identical on the screen. So any time we see a word starting with `l' or `I', the program has to ask about it. +After an interword gap, we first search for a Word entry in the +dictionary. If there is a match we use it. Otherwise we search both +the uppercase and lowercase dictionaries; if one matches and the other +doesn't, or one matches a wider character than the other, we use it. +If that fails to resolve the ambiguity we must ask. + *Do not* make an entry in the character set dictionary mapping `vertical stick' to `l' or `I'. Instead, select enough of the whole word in question that no word would start with the other letter, and enter the -whole word or part of it as a new glyph. +whole word or part of it as a new glyph as a new Word. For example, in the supplied dictionary there is already a glyph for `Iron'; this is OK because there are no words which start `lron'. @@ -72,12 +79,11 @@ for the uppercase letter in the lowercase dictionary. Instead, make a new glyph for the last letter of the previous word plus the (unusually narrow) inter-word space, and end that entry with -\x20 (yes, type \ x 20). +a literal space ` '. For example, you might find that `yG' is treated as `y' and the G doesn't get matched. Select the `y' -region of the bitmap and type `y\x20' into the string box. -Sorry for this rather poor UI! +region of the bitmap and type `y ' into the string box. Overlapping characters - ligatures @@ -96,14 +102,18 @@ containing `yw' and define it as `yw'. Fixing mistakes --------------- -The OCR query UI allows you to delete things from the glyph dictionary. -However since you are not guaranteed to actually get an OCR query at -all if the dictionary contains errors, you shouldn't rely on this. +The OCR query UI allows you to delete things from the glyph +dictionary. However since you are not guaranteed to actually get an +OCR query at all (and since it is not possible to override the +presence of an entry in the master database with the absence of one in +the local database), if the dictionary contains errors, you shouldn't +rely on this. If you think you have made mistakes answering OCR queries (for -example, the recognised data is wrong), you should download a fresh -copy of charset-15.txt from - http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt +example, the recognised data is wrong), you should delete the file +#local-char*#.txt, which contains your local updates. It will then +only use the centrally provided (and vetted) master file (which is +automatically updated when you run the PCTB client, by default). Enabling interactive character set update