X-Git-Url: https://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?a=blobdiff_plain;f=pctb%2FREADME.charset;fp=pctb%2FREADME.charset;h=0000000000000000000000000000000000000000;hb=c68fb80a6bbf7acbcac4b2cb2143f5fea745cd2b;hp=65aa51aee70e1cf85be1be3229ed8d3dd54df4cb;hpb=b9cce976550d000f15e5a8f2b690740bdae1e468;p=ypp-sc-tools.db-test.git diff --git a/pctb/README.charset b/pctb/README.charset deleted file mode 100644 index 65aa51a..0000000 --- a/pctb/README.charset +++ /dev/null @@ -1,185 +0,0 @@ -Handing OCR failures --------------------- - -Sometimes the OCR will not be able to recognise some text. By -default, when this happens, the program will stop with a fatal error -and refer you to this document. - -It is possible to fix this by editing the character set dictionary used -by the OCR algorithm. But, it is important to get these inputs right -or your client may misrecognise text in future. You *must* read the -documentation here first. - - -Recognition algorithm ---------------------- - -We recognise the text in the commodity screen by doing exact matching -of `glyph' images, against the image in each cell in the commodity -table. We match from left to right. - -We do not insist that each glyph is followed by whitespace, and nor do -we insist that glyphs do not contain whitespace. Our glyph dictionary -can contain entries which are strict prefixes of other entries - that -is, a glyph for (say) `v' which is the leftmost part of another glyph -for (say) `w'. We resolve these ambiguities by taking the longest -(widest) glyph which matches. - -So you should not be surprised if the program has matched the -left-hand half of some letter and thinks it is a different letter. If -the part that it did recognise does look like the letter in question, -that isn't wrong. All you need to do is insert the whole of the -actual letter in the dictionary - move the LH cursor to the start of the -letter, and the RH cursor to its end, and hit `return' and enter the -correct character. The longest match rule will mean it will prefer -the entry you have just made. - - -Matching context - Upper/Lower/Digit/Word dictionaries ------------------------------------------------------- - -We maintain separate dictionaries for the following types of glyph - - Upper: - Upper case letters and ligatures starting with an - uppercase letter. Punctuation excluding `>'. - Lower: - Lower case letters and ligatures starting with a - lowercase letter. - Digit: - Digits and the greater than sign `>' (which can also - appear in the quantity field in the commodity display) - Word: - Words (or unambigous initial chunks of words) starting with - `l' or `I' - see the note, below. - -When you add an entry, you should add it to the appropriate dictionary -for its matching context. You can do this by selecting the -appropriate radiobutton or by pressing one of letters U D L W (the -initial letters of the contexts) after moving the cursor to the -appropriate spot but before hitting `Return' to enter the text for the -new entry. - - -Note regarding `l' and `I' --------------------------- - -At the beginning of each cell in the table, we expect uppercase; in -the middle of a word we expect lowercase; and, unfortunately, after an -inter-word gap, we are not sure. - -This is troublesome because `l' and `I' look identical on the screen. -So any time we see an unfamiliar word starting with `l' or `I', the -program has to ask about it. - -After an interword gap, we first search for a Word entry in the -dictionary. If there is a match we use it. Otherwise we search both -the uppercase and lowercase dictionaries; if one matches and the other -doesn't, or one matches a wider character than the other, we use it. -If that fails to resolve the ambiguity we must ask. - -Don't try to make an entry in the character set dictionary mapping -`vertical stick' to `l' or `I'. Instead, select the whole word (or -enough of it that no different word would start with the other -letter), and enter the whole thing as a new glyph in the Word -dictionary. - -For example, in the supplied dictionary there is already a glyph for -`Iron'; this is OK because there are no words which start `lron'. - - -Short inter-word gaps ---------------------- - -It can happen that the problem you are being asked about is caused by -the program failing to spot an inter-word gap and mistakenly thinks -that the next word is necessarily in lowercase, so fails to recognise -an uppercase letter. The context in which each glyph was recognised -is shown on the screen, underneath the text which shows what it was -recognised as. - -*You should check the alleged context before entering a character*. -If it is wrong, you should fix it, rather that just making an entry -in the wrong dictionary. - -When this happens, instead, make a new glyph for the last letter of -the previous word plus the (unusually narrow) inter-word space, and -end that entry with a literal space ` '. - -For example, you might find that `yG' is treated as -`y' and the G doesn't get matched. Select the `y' -region of the bitmap and type `y ' into the string box. - - -Overlapping characters - ligatures ----------------------------------- - -Some of the characters in the font used overlap with the next -character. When this happens, select both the characters and enter -them together as one glyph with a multi-character definition, as a new -entry in the Lower or Upper dictionary. - -For example `yw' is rendered with the top right corner of the `y' and -the top left corner of the `w' overlapping. This is dealt with by -matching the whole merged thing - select the region of the screen -containing `yw' and define it as `yw'. - -Such a combined entry - a ligature - is only needed if the letters -cannot be separated at all. It's not needed if they merely abut. - - -Fixing mistakes ---------------- - -The OCR query UI allows you to delete things from the local glyph -dictionary. However you are not guaranteed to actually get an OCR -query at all (and since it is not possible to override the presence of -an entry in the master database with the absence of one in the local -database). So this is not a reliable feature for being able to fix -errors. - -If you think you have made mistakes answering OCR queries (for -example, the recognised data is wrong), you should delete the file -_local-char*.txt, which contains your local updates. It will then -only use the centrally provided (and vetted) master file (which is -automatically updated when you run the PCTB client, by default). - -It is also possible to have the OCR system reject particular strings. -If you put a regexp in _local-reject.txt, any OCR result which -matches this string will instead cause an OCR failure, invoking the -OCR dictionary editor if appropriate. _master-reject.txt is the -centrally maintained version of this file. - -Alternatively you can edit _local-char*.txt with a text editor. The -format is not documented at the moment. - - -Enabling interactive character set update ------------------------------------------ - -Now that you have read this document, you should rerun your OCR job -with the --edit-charset option. So run - ./ypp-commodities --edit-charset -In future, this option is not usually needed, because it is the -default if there is a local character set dictionary _local-.txt -for the relevant character height. - -With --edit-charset, when the OCR finds characters it does not -understand, it will put up an OCR resolution query window. This will -display the part of the text it is having trouble with, showing where -it has got to, and allow you to edit the character set dictionary it -uses for recognising the text. - -The process is subtle and it is important to understand the way the -machinery works, and the possible mistakes you can make, before -answering the program. So *Please read this documentation*, which -explains the meaning of the entries you make. - -The character set updates you make will by default be submitted to my -server so that they can be checked by me and shared with other users. -See README.privacy. - -If you need help please ask me (ijackson@chiark.greenend.org.uk, or -Aristarchus on Midnight in game if I'm on line, or ask any pirate of -the crew Special Circumstances if they happen to know where I am -and/or can get in touch).