X-Git-Url: https://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?p=ypp-sc-tools.db-test.git;a=blobdiff_plain;f=pctb%2FREADME.charset;h=0d7f1623d00dbb735e43a56e4e5a293008cb4e4f;hp=bbabb057e6bedd1016f92dd1d2fc63acc41ac24f;hb=21b1420b1f35ea2ae9440f9db9009093a8b6eae2;hpb=2337ae5465a29659b44037dcbdaf6fa03eb46d84 diff --git a/pctb/README.charset b/pctb/README.charset index bbabb05..0d7f162 100644 --- a/pctb/README.charset +++ b/pctb/README.charset @@ -1,19 +1,14 @@ -Character set query tool, and semantics of the glyphs ------------------------------------------------------ - -Sometimes the OCR will not be able to recognise some text and you will -have to help it out. It will display the part it is having trouble -with, showing where it has got to, and allow you to edit the character -set database it uses for recognising the text. +Handing OCR failures +-------------------- -*This is subtle* and it is important to understand the way the -machinery works, and the possible mistakes you can make, before -answering the program. *Please read this documentation* +Sometimes the OCR will not be able to recognise some text. By +default, when this happens, the program will stop with a fatal error +and refer you to this document. -If you need help please ask me (ijackson@chiark.greenend.org.uk, or -Aristarchus on Midnight in game if I'm on line, or ask any pirate of -the crew Special Circumstances if they happen to know where I am -and/or can get in touch). +It is possible to fix this by editing the character set dictionary used +by the OCR algorithm. But, it is important to get these inputs right +or your client may misrecognise text in future. You *must* read the +documentation here first. Recognition algorithm @@ -24,7 +19,7 @@ of `glyph' bitmaps, against the bitmap in each cell in the commodity table. We match from left to right. We do not insist that each glyph is followed by whitespace, and nor do -we insist that glyphs do not contain whitespace. Our glyph database +we insist that glyphs do not contain whitespace. Our glyph dictionary can contain entries which are strict prefixes of other entries - that is, a glyph for (say) `v' which is the leftmost part of another glyph for (say) `w'. We resolve these ambiguities by taking the longest @@ -34,7 +29,7 @@ So you should not be surprised if the program has matched the left-hand half of some letter and thinks it is a different letter. If the part that it did recognise does look like the letter in question, that isn't wrong. All you need to do is insert the whole of the -actual letter in the database - move the LH cursor to the start of the +actual letter in the dictionary - move the LH cursor to the start of the letter, and the RH cursor to its end, and hit `return' and enter the correct character. The longest match rule will mean it will prefer the entry you have just made. @@ -43,7 +38,8 @@ the entry you have just made. Upper vs lower case - important note regarding `l' and `I' ---------------------------------------------------------- -We maintain separate databases for upper and lower case. At the +We maintain separate dictionaries for upper case (Upper), lower case +(Lower), and (initial portions of) mid-phrase words (Word). At the beginning of each cell in the table, we expect uppercase; in the middle of a word we expect lowercase; and, unfortunately, after an inter-word gap, we are not sure. @@ -52,17 +48,20 @@ This is troublesome because `l' and `I' look identical on the screen. So any time we see a word starting with `l' or `I', the program has to ask about it. -*Do not* make an entry in the character set database mapping `vertical -stick' to `l' or `I'. Instead, select enough of the whole word in -question that no word would start with the other letter, and enter the -whole word or part of it as a new glyph. +After an interword gap, we first search for a Word entry in the +dictionary. If there is a match we use it. Otherwise we search both +the uppercase and lowercase dictionaries; if one matches and the other +doesn't, or one matches a wider character than the other, we use it. +If that fails to resolve the ambiguity we must ask. -For example, in the supplied database there is already a glyph for -`Iron'; this is OK because there are no words which start `lron'. +*Do not* make an entry in the character set dictionary mapping +`vertical stick' to `l' or `I'. Instead, select enough of the whole +word in question that no word would start with the other letter, and +enter the whole word or part of it as a new glyph as a new entry in +the Word dictionary. -Do not make an entry for a string more than 7 characters long; -currently we cannot cope (and you'll have to remove it manually from -the charset-15.txt file). +For example, in the supplied dictionary there is already a glyph for +`Iron'; this is OK because there are no words which start `lron'. Short inter-word gaps @@ -77,16 +76,15 @@ recognised as. *You should check the alleged context before entering a character*. If it is wrong, you should fix it, rather that just making an entry -for the uppercase letter in the lowercase database. +for the uppercase letter in the lowercase dictionary. Instead, make a new glyph for the last letter of the previous word plus the (unusually narrow) inter-word space, and end that entry with -\x20 (yes, type \ x 20). +a literal space ` '. For example, you might find that `yG' is treated as `y' and the G doesn't get matched. Select the `y' -region of the bitmap and type `y\x20' into the string box. -Sorry for this rather poor UI! +region of the bitmap and type `y ' into the string box. Overlapping characters - ligatures @@ -94,7 +92,8 @@ Overlapping characters - ligatures Some of the characters in the font used overlap with the next character. When this happens, select both the characters and enter -them together as one glyph with a multi-character definition. +them together as one glyph with a multi-character definition, as a new +entry in the Lower or Upper dictionary. For example `yw' is rendered with the top right corner of the `y' and the top left corner of the `w' overlapping. This is dealt with by @@ -105,21 +104,54 @@ containing `yw' and define it as `yw'. Fixing mistakes --------------- -The OCR query UI allows you to delete things from the glyph database. -However since you are not guaranteed to actually get an OCR query at -all if the database contains errors, you shouldn't rely on this. +The OCR query UI allows you to delete things from the glyph +dictionary. However since you are not guaranteed to actually get an +OCR query at all (and since it is not possible to override the +presence of an entry in the master database with the absence of one in +the local database), if the dictionary contains errors, you shouldn't +rely on this. If you think you have made mistakes answering OCR queries (for -example, the recognised data is wrong), you should download a fresh -copy of charset-15.txt from - http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt +example, the recognised data is wrong), you should delete the file +#local-char*#.txt, which contains your local updates. It will then +only use the centrally provided (and vetted) master file (which is +automatically updated when you run the PCTB client, by default). -Send me your updates --------------------- +Enabling interactive character set update +----------------------------------------- -The character set is in the file `charset-15.txt'. When you enter new -characters, they are added there. If you do this, please email me -your charset file (ijackson@chiark.greenend.org.uk) so that I can -include your contributions in future versions. This will also let me -check that they seem right :-). +Now that you have read this document, you should rerun your OCR job +with the --edit-charset option. You probably want to supply --same as +well, to avoid having to wait for it to page through and recapture all +the screenshots. So, this time, + ./ypp-commodities --edit-charset --same +and in future, just always run it with the --edit-charset option. + +With --edit-charset, when the OCR finds characters it does not +understand, it will put up an OCR resolution query window. This will +display the part of the text it is having trouble with, showing where +it has got to, and allow you to edit the character set dictionary it +uses for recognising the text. + +The process is subtle and it is important to understand the way the +machinery works, and the possible mistakes you can make, before +answering the program. So *Please read this documentation*, which +explains the meaning of the entries you make. + +You must specify the dictionary to which the new glyph should be +added, by selecting the appropriate radiobutton or by pressing one of +U D L W for Upper, Digit, Lower, Word. Word is only correct +right if the match failure is a new word starting with l or I (see +above). Upper or Lower is correct for single letters and ligatures. +for new ligatures. Use Upper for punctuation and Digit for `>' and +digits. + +The character set updates you make will by default be submitted to my +server so that they can be checked by me and shared with other users. +See README.privacy. + +If you need help please ask me (ijackson@chiark.greenend.org.uk, or +Aristarchus on Midnight in game if I'm on line, or ask any pirate of +the crew Special Circumstances if they happen to know where I am +and/or can get in touch).