X-Git-Url: https://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?p=ypp-sc-tools.db-test.git;a=blobdiff_plain;f=pctb%2FREADME.charset;fp=pctb%2FREADME.charset;h=e1fd3ff5f86ba8f41e9843d2255e11edefe120b2;hp=0000000000000000000000000000000000000000;hb=6a3c0962283d32bc6e5f6c47c929baf37ddc642f;hpb=cde017ed6b76840ce2ae1aa5fc740a6e06352f92 diff --git a/pctb/README.charset b/pctb/README.charset new file mode 100644 index 0000000..e1fd3ff --- /dev/null +++ b/pctb/README.charset @@ -0,0 +1,137 @@ +Handing OCR failures +-------------------- + +Sometimes the OCR will not be able to recognise some text. By +default, when this happens, the program will stop with a fatal error +and refer you to this document. + +It is possible to fix this by editing the character set dictionary used +by the OCR algorithm. But, it is important to get these inputs right +or your client may misrecognise text in future. You *must* read the +documentation here first. + + +Recognition algorithm +--------------------- + +We recognise the text in the commodity screen by doing exact matching +of `glyph' bitmaps, against the bitmap in each cell in the commodity +table. We match from left to right. + +We do not insist that each glyph is followed by whitespace, and nor do +we insist that glyphs do not contain whitespace. Our glyph dictionary +can contain entries which are strict prefixes of other entries - that +is, a glyph for (say) `v' which is the leftmost part of another glyph +for (say) `w'. We resolve these ambiguities by taking the longest +(widest) glyph which matches. + +So you should not be surprised if the program has matched the +left-hand half of some letter and thinks it is a different letter. If +the part that it did recognise does look like the letter in question, +that isn't wrong. All you need to do is insert the whole of the +actual letter in the dictionary - move the LH cursor to the start of the +letter, and the RH cursor to its end, and hit `return' and enter the +correct character. The longest match rule will mean it will prefer +the entry you have just made. + + +Upper vs lower case - important note regarding `l' and `I' +---------------------------------------------------------- + +We maintain separate dictionaries for upper and lower case. At the +beginning of each cell in the table, we expect uppercase; in the +middle of a word we expect lowercase; and, unfortunately, after an +inter-word gap, we are not sure. + +This is troublesome because `l' and `I' look identical on the screen. +So any time we see a word starting with `l' or `I', the program has to +ask about it. + +*Do not* make an entry in the character set dictionary mapping `vertical +stick' to `l' or `I'. Instead, select enough of the whole word in +question that no word would start with the other letter, and enter the +whole word or part of it as a new glyph. + +For example, in the supplied dictionary there is already a glyph for +`Iron'; this is OK because there are no words which start `lron'. + + +Short inter-word gaps +--------------------- + +It can happen that the problem you are being asked about is caused by +the program failing to spot an inter-word gap and mistakenly thinks +that the next word is necessarily in lowercase, so fails to recognise +an uppercase letter. The context in which each glyph was recognised +is shown on the screen, underneath the text which shows what it was +recognised as. + +*You should check the alleged context before entering a character*. +If it is wrong, you should fix it, rather that just making an entry +for the uppercase letter in the lowercase dictionary. + +Instead, make a new glyph for the last letter of the previous word +plus the (unusually narrow) inter-word space, and end that entry with +\x20 (yes, type \ x 20). + +For example, you might find that `yG' is treated as +`y' and the G doesn't get matched. Select the `y' +region of the bitmap and type `y\x20' into the string box. +Sorry for this rather poor UI! + + +Overlapping characters - ligatures +---------------------------------- + +Some of the characters in the font used overlap with the next +character. When this happens, select both the characters and enter +them together as one glyph with a multi-character definition. + +For example `yw' is rendered with the top right corner of the `y' and +the top left corner of the `w' overlapping. This is dealt with by +matching the whole merged thing - select the region of the screen +containing `yw' and define it as `yw'. + + +Fixing mistakes +--------------- + +The OCR query UI allows you to delete things from the glyph dictionary. +However since you are not guaranteed to actually get an OCR query at +all if the dictionary contains errors, you shouldn't rely on this. + +If you think you have made mistakes answering OCR queries (for +example, the recognised data is wrong), you should download a fresh +copy of charset-15.txt from + http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt + + +Enabling interactive character set update +----------------------------------------- + +Now that you have read this document, you should rerun your OCR job +with the --edit-charset option. You probably want to supply --same as +well, to avoid having to wait for it to page through and recapture all +the screenshots. So, this time, + ./ypp-commodities --edit-charset --same +and in future, just always run it with the --edit-charset option. + +With --edit-charset, when the OCR finds characters it does not +understand, it will put up an OCR resolution query window. This will +display the part of the text it is having trouble with, showing where +it has got to, and allow you to edit the character set dictionary it +uses for recognising the text. + +*This is subtle* and it is important to understand the way the +machinery works, and the possible mistakes you can make, before +answering the program. *Please read this documentation*, which +explains the meaning of the entries you make. + +Also, the character set updates you make will by default be submitted +to my server so that they can be checked by me and shared with other +users. See README.privacy. + +If you need help please ask me (ijackson@chiark.greenend.org.uk, or +Aristarchus on Midnight in game if I'm on line, or ask any pirate of +the crew Special Circumstances if they happen to know where I am +and/or can get in touch).