X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?a=blobdiff_plain;f=pctb%2FREADME.charset;h=e1fd3ff5f86ba8f41e9843d2255e11edefe120b2;hb=3063e05a93fb97a5eca7f26c38da94fa4000406e;hp=bbabb057e6bedd1016f92dd1d2fc63acc41ac24f;hpb=2337ae5465a29659b44037dcbdaf6fa03eb46d84;p=ypp-sc-tools.web-live.git diff --git a/pctb/README.charset b/pctb/README.charset index bbabb05..e1fd3ff 100644 --- a/pctb/README.charset +++ b/pctb/README.charset @@ -1,19 +1,14 @@ -Character set query tool, and semantics of the glyphs ------------------------------------------------------ +Handing OCR failures +-------------------- -Sometimes the OCR will not be able to recognise some text and you will -have to help it out. It will display the part it is having trouble -with, showing where it has got to, and allow you to edit the character -set database it uses for recognising the text. +Sometimes the OCR will not be able to recognise some text. By +default, when this happens, the program will stop with a fatal error +and refer you to this document. -*This is subtle* and it is important to understand the way the -machinery works, and the possible mistakes you can make, before -answering the program. *Please read this documentation* - -If you need help please ask me (ijackson@chiark.greenend.org.uk, or -Aristarchus on Midnight in game if I'm on line, or ask any pirate of -the crew Special Circumstances if they happen to know where I am -and/or can get in touch). +It is possible to fix this by editing the character set dictionary used +by the OCR algorithm. But, it is important to get these inputs right +or your client may misrecognise text in future. You *must* read the +documentation here first. Recognition algorithm @@ -24,7 +19,7 @@ of `glyph' bitmaps, against the bitmap in each cell in the commodity table. We match from left to right. We do not insist that each glyph is followed by whitespace, and nor do -we insist that glyphs do not contain whitespace. Our glyph database +we insist that glyphs do not contain whitespace. Our glyph dictionary can contain entries which are strict prefixes of other entries - that is, a glyph for (say) `v' which is the leftmost part of another glyph for (say) `w'. We resolve these ambiguities by taking the longest @@ -34,7 +29,7 @@ So you should not be surprised if the program has matched the left-hand half of some letter and thinks it is a different letter. If the part that it did recognise does look like the letter in question, that isn't wrong. All you need to do is insert the whole of the -actual letter in the database - move the LH cursor to the start of the +actual letter in the dictionary - move the LH cursor to the start of the letter, and the RH cursor to its end, and hit `return' and enter the correct character. The longest match rule will mean it will prefer the entry you have just made. @@ -43,7 +38,7 @@ the entry you have just made. Upper vs lower case - important note regarding `l' and `I' ---------------------------------------------------------- -We maintain separate databases for upper and lower case. At the +We maintain separate dictionaries for upper and lower case. At the beginning of each cell in the table, we expect uppercase; in the middle of a word we expect lowercase; and, unfortunately, after an inter-word gap, we are not sure. @@ -52,18 +47,14 @@ This is troublesome because `l' and `I' look identical on the screen. So any time we see a word starting with `l' or `I', the program has to ask about it. -*Do not* make an entry in the character set database mapping `vertical +*Do not* make an entry in the character set dictionary mapping `vertical stick' to `l' or `I'. Instead, select enough of the whole word in question that no word would start with the other letter, and enter the whole word or part of it as a new glyph. -For example, in the supplied database there is already a glyph for +For example, in the supplied dictionary there is already a glyph for `Iron'; this is OK because there are no words which start `lron'. -Do not make an entry for a string more than 7 characters long; -currently we cannot cope (and you'll have to remove it manually from -the charset-15.txt file). - Short inter-word gaps --------------------- @@ -77,7 +68,7 @@ recognised as. *You should check the alleged context before entering a character*. If it is wrong, you should fix it, rather that just making an entry -for the uppercase letter in the lowercase database. +for the uppercase letter in the lowercase dictionary. Instead, make a new glyph for the last letter of the previous word plus the (unusually narrow) inter-word space, and end that entry with @@ -105,9 +96,9 @@ containing `yw' and define it as `yw'. Fixing mistakes --------------- -The OCR query UI allows you to delete things from the glyph database. +The OCR query UI allows you to delete things from the glyph dictionary. However since you are not guaranteed to actually get an OCR query at -all if the database contains errors, you shouldn't rely on this. +all if the dictionary contains errors, you shouldn't rely on this. If you think you have made mistakes answering OCR queries (for example, the recognised data is wrong), you should download a fresh @@ -115,11 +106,32 @@ copy of charset-15.txt from http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt -Send me your updates --------------------- +Enabling interactive character set update +----------------------------------------- -The character set is in the file `charset-15.txt'. When you enter new -characters, they are added there. If you do this, please email me -your charset file (ijackson@chiark.greenend.org.uk) so that I can -include your contributions in future versions. This will also let me -check that they seem right :-). +Now that you have read this document, you should rerun your OCR job +with the --edit-charset option. You probably want to supply --same as +well, to avoid having to wait for it to page through and recapture all +the screenshots. So, this time, + ./ypp-commodities --edit-charset --same +and in future, just always run it with the --edit-charset option. + +With --edit-charset, when the OCR finds characters it does not +understand, it will put up an OCR resolution query window. This will +display the part of the text it is having trouble with, showing where +it has got to, and allow you to edit the character set dictionary it +uses for recognising the text. + +*This is subtle* and it is important to understand the way the +machinery works, and the possible mistakes you can make, before +answering the program. *Please read this documentation*, which +explains the meaning of the entries you make. + +Also, the character set updates you make will by default be submitted +to my server so that they can be checked by me and shared with other +users. See README.privacy. + +If you need help please ask me (ijackson@chiark.greenend.org.uk, or +Aristarchus on Midnight in game if I'm on line, or ask any pirate of +the crew Special Circumstances if they happen to know where I am +and/or can get in touch).