X-Git-Url: https://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?a=blobdiff_plain;f=pctb%2FREADME.charset;h=e1fd3ff5f86ba8f41e9843d2255e11edefe120b2;hb=cde017ed6b76840ce2ae1aa5fc740a6e06352f92;hp=bbabb057e6bedd1016f92dd1d2fc63acc41ac24f;hpb=2337ae5465a29659b44037dcbdaf6fa03eb46d84;p=ypp-sc-tools.web-live.git diff --git a/pctb/README.charset b/pctb/README.charset deleted file mode 100644 index bbabb05..0000000 --- a/pctb/README.charset +++ /dev/null @@ -1,125 +0,0 @@ -Character set query tool, and semantics of the glyphs ------------------------------------------------------ - -Sometimes the OCR will not be able to recognise some text and you will -have to help it out. It will display the part it is having trouble -with, showing where it has got to, and allow you to edit the character -set database it uses for recognising the text. - -*This is subtle* and it is important to understand the way the -machinery works, and the possible mistakes you can make, before -answering the program. *Please read this documentation* - -If you need help please ask me (ijackson@chiark.greenend.org.uk, or -Aristarchus on Midnight in game if I'm on line, or ask any pirate of -the crew Special Circumstances if they happen to know where I am -and/or can get in touch). - - -Recognition algorithm ---------------------- - -We recognise the text in the commodity screen by doing exact matching -of `glyph' bitmaps, against the bitmap in each cell in the commodity -table. We match from left to right. - -We do not insist that each glyph is followed by whitespace, and nor do -we insist that glyphs do not contain whitespace. Our glyph database -can contain entries which are strict prefixes of other entries - that -is, a glyph for (say) `v' which is the leftmost part of another glyph -for (say) `w'. We resolve these ambiguities by taking the longest -(widest) glyph which matches. - -So you should not be surprised if the program has matched the -left-hand half of some letter and thinks it is a different letter. If -the part that it did recognise does look like the letter in question, -that isn't wrong. All you need to do is insert the whole of the -actual letter in the database - move the LH cursor to the start of the -letter, and the RH cursor to its end, and hit `return' and enter the -correct character. The longest match rule will mean it will prefer -the entry you have just made. - - -Upper vs lower case - important note regarding `l' and `I' ----------------------------------------------------------- - -We maintain separate databases for upper and lower case. At the -beginning of each cell in the table, we expect uppercase; in the -middle of a word we expect lowercase; and, unfortunately, after an -inter-word gap, we are not sure. - -This is troublesome because `l' and `I' look identical on the screen. -So any time we see a word starting with `l' or `I', the program has to -ask about it. - -*Do not* make an entry in the character set database mapping `vertical -stick' to `l' or `I'. Instead, select enough of the whole word in -question that no word would start with the other letter, and enter the -whole word or part of it as a new glyph. - -For example, in the supplied database there is already a glyph for -`Iron'; this is OK because there are no words which start `lron'. - -Do not make an entry for a string more than 7 characters long; -currently we cannot cope (and you'll have to remove it manually from -the charset-15.txt file). - - -Short inter-word gaps ---------------------- - -It can happen that the problem you are being asked about is caused by -the program failing to spot an inter-word gap and mistakenly thinks -that the next word is necessarily in lowercase, so fails to recognise -an uppercase letter. The context in which each glyph was recognised -is shown on the screen, underneath the text which shows what it was -recognised as. - -*You should check the alleged context before entering a character*. -If it is wrong, you should fix it, rather that just making an entry -for the uppercase letter in the lowercase database. - -Instead, make a new glyph for the last letter of the previous word -plus the (unusually narrow) inter-word space, and end that entry with -\x20 (yes, type \ x 20). - -For example, you might find that `yG' is treated as -`y' and the G doesn't get matched. Select the `y' -region of the bitmap and type `y\x20' into the string box. -Sorry for this rather poor UI! - - -Overlapping characters - ligatures ----------------------------------- - -Some of the characters in the font used overlap with the next -character. When this happens, select both the characters and enter -them together as one glyph with a multi-character definition. - -For example `yw' is rendered with the top right corner of the `y' and -the top left corner of the `w' overlapping. This is dealt with by -matching the whole merged thing - select the region of the screen -containing `yw' and define it as `yw'. - - -Fixing mistakes ---------------- - -The OCR query UI allows you to delete things from the glyph database. -However since you are not guaranteed to actually get an OCR query at -all if the database contains errors, you shouldn't rely on this. - -If you think you have made mistakes answering OCR queries (for -example, the recognised data is wrong), you should download a fresh -copy of charset-15.txt from - http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt - - -Send me your updates --------------------- - -The character set is in the file `charset-15.txt'. When you enter new -characters, they are added there. If you do this, please email me -your charset file (ijackson@chiark.greenend.org.uk) so that I can -include your contributions in future versions. This will also let me -check that they seem right :-).