Handing OCR failures -------------------- Sometimes the OCR will not be able to recognise some text. By default, when this happens, the program will stop with a fatal error and refer you to this document. It is possible to fix this by editing the character set dictionary used by the OCR algorithm. But, it is important to get these inputs right or your client may misrecognise text in future. You *must* read the documentation here first. Recognition algorithm --------------------- We recognise the text in the commodity screen by doing exact matching of `glyph' images, against the image in each cell in the commodity table. We match from left to right. We do not insist that each glyph is followed by whitespace, and nor do we insist that glyphs do not contain whitespace. Our glyph dictionary can contain entries which are strict prefixes of other entries - that is, a glyph for (say) `v' which is the leftmost part of another glyph for (say) `w'. We resolve these ambiguities by taking the longest (widest) glyph which matches. So you should not be surprised if the program has matched the left-hand half of some letter and thinks it is a different letter. If the part that it did recognise does look like the letter in question, that isn't wrong. All you need to do is insert the whole of the actual letter in the dictionary - move the LH cursor to the start of the letter, and the RH cursor to its end, and hit `return' and enter the correct character. The longest match rule will mean it will prefer the entry you have just made. Matching context - Upper/Lower/Digit/Word dictionaries ------------------------------------------------------ We maintain separate dictionaries for the following types of glyph Upper: Upper case letters and ligatures starting with an uppercase letter. Punctuation excluding `>'. Lower: Lower case letters and ligatures starting with a lowercase letter. Digit: Digits and the greater than sign `>' (which can also appear in the quantity field in the commodity display) Word: Words (or unambigous initial chunks of words) starting with `l' or `I' - see the note, below. When you add an entry, you should add it to the appropriate dictionary for its matching context. You can do this by selecting the appropriate radiobutton or by pressing one of letters U D L W (the initial letters of the contexts) after moving the cursor to the appropriate spot but before hitting `Return' to enter the text for the new entry. Note regarding `l' and `I' -------------------------- At the beginning of each cell in the table, we expect uppercase; in the middle of a word we expect lowercase; and, unfortunately, after an inter-word gap, we are not sure. This is troublesome because `l' and `I' look identical on the screen. So any time we see an unfamiliar word starting with `l' or `I', the program has to ask about it. After an interword gap, we first search for a Word entry in the dictionary. If there is a match we use it. Otherwise we search both the uppercase and lowercase dictionaries; if one matches and the other doesn't, or one matches a wider character than the other, we use it. If that fails to resolve the ambiguity we must ask. Don't try to make an entry in the character set dictionary mapping `vertical stick' to `l' or `I'. Instead, select the whole word (or enough of it that no different word would start with the other letter), and enter the whole thing as a new glyph in the Word dictionary. For example, in the supplied dictionary there is already a glyph for `Iron'; this is OK because there are no words which start `lron'. Short inter-word gaps --------------------- It can happen that the problem you are being asked about is caused by the program failing to spot an inter-word gap and mistakenly thinks that the next word is necessarily in lowercase, so fails to recognise an uppercase letter. The context in which each glyph was recognised is shown on the screen, underneath the text which shows what it was recognised as. *You should check the alleged context before entering a character*. If it is wrong, you should fix it, rather that just making an entry in the wrong dictionary. When this happens, instead, make a new glyph for the last letter of the previous word plus the (unusually narrow) inter-word space, and end that entry with a literal space ` '. For example, you might find that `yG' is treated as `y' and the G doesn't get matched. Select the `y' region of the bitmap and type `y ' into the string box. Overlapping characters - ligatures ---------------------------------- Some of the characters in the font used overlap with the next character. When this happens, select both the characters and enter them together as one glyph with a multi-character definition, as a new entry in the Lower or Upper dictionary. For example `yw' is rendered with the top right corner of the `y' and the top left corner of the `w' overlapping. This is dealt with by matching the whole merged thing - select the region of the screen containing `yw' and define it as `yw'. Such a combined entry - a ligature - is only needed if the letters cannot be separated at all. It's not needed if they merely abut. Fixing mistakes --------------- The OCR query UI allows you to delete things from the local glyph dictionary. However you are not guaranteed to actually get an OCR query at all (and since it is not possible to override the presence of an entry in the master database with the absence of one in the local database). So this is not a reliable feature for being able to fix errors. If you think you have made mistakes answering OCR queries (for example, the recognised data is wrong), you should delete the file _local-char*.txt, which contains your local updates. It will then only use the centrally provided (and vetted) master file (which is automatically updated when you run the yarrg client, by default). It is also possible to have the OCR system reject particular strings. If you put a regexp in _local-reject.txt, any OCR result which matches this string will instead cause an OCR failure, invoking the OCR dictionary editor if appropriate. _master-reject.txt is the centrally maintained version of this file. Alternatively you can edit _local-char*.txt with a text editor. The format is not documented at the moment. Enabling interactive character set update ----------------------------------------- Now that you have read this document, you should rerun your OCR job with the --edit-charset option. So run ./yarrg --edit-charset In future, this option is not usually needed, because it is the default if there is a local character set dictionary _local-.txt for the relevant character height. With --edit-charset, when the OCR finds characters it does not understand, it will put up an OCR resolution query window. This will display the part of the text it is having trouble with, showing where it has got to, and allow you to edit the character set dictionary it uses for recognising the text. The process is subtle and it is important to understand the way the machinery works, and the possible mistakes you can make, before answering the program. So *Please read this documentation*, which explains the meaning of the entries you make. The character set updates you make will by default be submitted to my server so that they can be checked by me and shared with other users. See README.privacy. If you need help please ask me (ijackson@chiark.greenend.org.uk, or Aristarchus on Midnight in game if I'm on line, or ask any pirate of the crew Special Circumstances if they happen to know where I am and/or can get in touch).