X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?a=blobdiff_plain;f=pctb%2FREADME.charset;h=65aa51aee70e1cf85be1be3229ed8d3dd54df4cb;hb=b958771fa67513ba09630953ec91b9d21b3f42f9;hp=cd9d1c9fa7c55db7a612f2fd82f14469adb90144;hpb=6a47e3724aa9178d1e3d6fb6cda96c5baa1df5d0;p=ypp-sc-tools.web-live.git diff --git a/pctb/README.charset b/pctb/README.charset index cd9d1c9..65aa51a 100644 --- a/pctb/README.charset +++ b/pctb/README.charset @@ -15,7 +15,7 @@ Recognition algorithm --------------------- We recognise the text in the commodity screen by doing exact matching -of `glyph' bitmaps, against the bitmap in each cell in the commodity +of `glyph' images, against the image in each cell in the commodity table. We match from left to right. We do not insist that each glyph is followed by whitespace, and nor do @@ -35,18 +35,42 @@ correct character. The longest match rule will mean it will prefer the entry you have just made. -Upper vs lower case - important note regarding `l' and `I' ----------------------------------------------------------- +Matching context - Upper/Lower/Digit/Word dictionaries +------------------------------------------------------ -We maintain separate dictionaries for upper case (Upper), lower case -(Lower), and (initial portions of) mid-phrase words (Word). At the -beginning of each cell in the table, we expect uppercase; in the -middle of a word we expect lowercase; and, unfortunately, after an +We maintain separate dictionaries for the following types of glyph + + Upper: + Upper case letters and ligatures starting with an + uppercase letter. Punctuation excluding `>'. + Lower: + Lower case letters and ligatures starting with a + lowercase letter. + Digit: + Digits and the greater than sign `>' (which can also + appear in the quantity field in the commodity display) + Word: + Words (or unambigous initial chunks of words) starting with + `l' or `I' - see the note, below. + +When you add an entry, you should add it to the appropriate dictionary +for its matching context. You can do this by selecting the +appropriate radiobutton or by pressing one of letters U D L W (the +initial letters of the contexts) after moving the cursor to the +appropriate spot but before hitting `Return' to enter the text for the +new entry. + + +Note regarding `l' and `I' +-------------------------- + +At the beginning of each cell in the table, we expect uppercase; in +the middle of a word we expect lowercase; and, unfortunately, after an inter-word gap, we are not sure. This is troublesome because `l' and `I' look identical on the screen. -So any time we see a word starting with `l' or `I', the program has to -ask about it. +So any time we see an unfamiliar word starting with `l' or `I', the +program has to ask about it. After an interword gap, we first search for a Word entry in the dictionary. If there is a match we use it. Otherwise we search both @@ -54,11 +78,11 @@ the uppercase and lowercase dictionaries; if one matches and the other doesn't, or one matches a wider character than the other, we use it. If that fails to resolve the ambiguity we must ask. -*Do not* make an entry in the character set dictionary mapping -`vertical stick' to `l' or `I'. Instead, select enough of the whole -word in question that no word would start with the other letter, and -enter the whole word or part of it as a new glyph as a new entry in -the Word dictionary. +Don't try to make an entry in the character set dictionary mapping +`vertical stick' to `l' or `I'. Instead, select the whole word (or +enough of it that no different word would start with the other +letter), and enter the whole thing as a new glyph in the Word +dictionary. For example, in the supplied dictionary there is already a glyph for `Iron'; this is OK because there are no words which start `lron'. @@ -76,11 +100,11 @@ recognised as. *You should check the alleged context before entering a character*. If it is wrong, you should fix it, rather that just making an entry -for the uppercase letter in the lowercase dictionary. +in the wrong dictionary. -Instead, make a new glyph for the last letter of the previous word -plus the (unusually narrow) inter-word space, and end that entry with -a literal space ` '. +When this happens, instead, make a new glyph for the last letter of +the previous word plus the (unusually narrow) inter-word space, and +end that entry with a literal space ` '. For example, you might find that `yG' is treated as `y' and the G doesn't get matched. Select the `y' @@ -100,33 +124,45 @@ the top left corner of the `w' overlapping. This is dealt with by matching the whole merged thing - select the region of the screen containing `yw' and define it as `yw'. +Such a combined entry - a ligature - is only needed if the letters +cannot be separated at all. It's not needed if they merely abut. + Fixing mistakes --------------- -The OCR query UI allows you to delete things from the glyph -dictionary. However since you are not guaranteed to actually get an -OCR query at all (and since it is not possible to override the -presence of an entry in the master database with the absence of one in -the local database), if the dictionary contains errors, you shouldn't -rely on this. +The OCR query UI allows you to delete things from the local glyph +dictionary. However you are not guaranteed to actually get an OCR +query at all (and since it is not possible to override the presence of +an entry in the master database with the absence of one in the local +database). So this is not a reliable feature for being able to fix +errors. If you think you have made mistakes answering OCR queries (for example, the recognised data is wrong), you should delete the file -#local-char*#.txt, which contains your local updates. It will then +_local-char*.txt, which contains your local updates. It will then only use the centrally provided (and vetted) master file (which is automatically updated when you run the PCTB client, by default). +It is also possible to have the OCR system reject particular strings. +If you put a regexp in _local-reject.txt, any OCR result which +matches this string will instead cause an OCR failure, invoking the +OCR dictionary editor if appropriate. _master-reject.txt is the +centrally maintained version of this file. + +Alternatively you can edit _local-char*.txt with a text editor. The +format is not documented at the moment. + Enabling interactive character set update ----------------------------------------- Now that you have read this document, you should rerun your OCR job -with the --edit-charset option. You probably want to supply --same as -well, to avoid having to wait for it to page through and recapture all -the screenshots. So, this time, - ./ypp-commodities --edit-charset --same -and in future, just always run it with the --edit-charset option. +with the --edit-charset option. So run + ./ypp-commodities --edit-charset +In future, this option is not usually needed, because it is the +default if there is a local character set dictionary _local-.txt +for the relevant character height. With --edit-charset, when the OCR finds characters it does not understand, it will put up an OCR resolution query window. This will @@ -139,15 +175,9 @@ machinery works, and the possible mistakes you can make, before answering the program. So *Please read this documentation*, which explains the meaning of the entries you make. -Be sure to check or specify the dictionary to which the new glyph -should be added. Normally the default will be the Word dictionary -which is right if the match failure is a new word starting with l or I -(see above). You will need to change this to the Upper or Lower -dictionary for new ligatures. You should not need to add new Digits. - -Also, the character set updates you make will by default be submitted -to my server so that they can be checked by me and shared with other -users. See README.privacy. +The character set updates you make will by default be submitted to my +server so that they can be checked by me and shared with other users. +See README.privacy. If you need help please ask me (ijackson@chiark.greenend.org.uk, or Aristarchus on Midnight in game if I'm on line, or ask any pirate of