4 Sometimes the OCR will not be able to recognise some text. By
5 default, when this happens, the program will stop with a fatal error
6 and refer you to this document.
8 It is possible to fix this by editing the character set dictionary used
9 by the OCR algorithm. But, it is important to get these inputs right
10 or your client may misrecognise text in future. You *must* read the
11 documentation here first.
17 We recognise the text in the commodity screen by doing exact matching
18 of `glyph' images, against the image in each cell in the commodity
19 table. We match from left to right.
21 We do not insist that each glyph is followed by whitespace, and nor do
22 we insist that glyphs do not contain whitespace. Our glyph dictionary
23 can contain entries which are strict prefixes of other entries - that
24 is, a glyph for (say) `v' which is the leftmost part of another glyph
25 for (say) `w'. We resolve these ambiguities by taking the longest
26 (widest) glyph which matches.
28 So you should not be surprised if the program has matched the
29 left-hand half of some letter and thinks it is a different letter. If
30 the part that it did recognise does look like the letter in question,
31 that isn't wrong. All you need to do is insert the whole of the
32 actual letter in the dictionary - move the LH cursor to the start of the
33 letter, and the RH cursor to its end, and hit `return' and enter the
34 correct character. The longest match rule will mean it will prefer
35 the entry you have just made.
38 Matching context - Upper/Lower/Digit/Word dictionaries
39 ------------------------------------------------------
41 We maintain separate dictionaries for the following types of glyph
44 Upper case letters and ligatures starting with an
45 uppercase letter. Punctuation excluding `>'.
47 Lower case letters and ligatures starting with a
50 Digits and the greater than sign `>' (which can also
51 appear in the quantity field in the commodity display)
53 Words (or unambigous initial chunks of words) starting with
54 `l' or `I' - see the note, below.
56 When you add an entry, you should add it to the appropriate dictionary
57 for its matching context. You can do this by selecting the
58 appropriate radiobutton or by pressing one of letters U D L W (the
59 initial letters of the contexts) after moving the cursor to the
60 appropriate spot but before hitting `Return' to enter the text for the
64 Note regarding `l' and `I'
65 --------------------------
67 At the beginning of each cell in the table, we expect uppercase; in
68 the middle of a word we expect lowercase; and, unfortunately, after an
69 inter-word gap, we are not sure.
71 This is troublesome because `l' and `I' look identical on the screen.
72 So any time we see an unfamiliar word starting with `l' or `I', the
73 program has to ask about it.
75 After an interword gap, we first search for a Word entry in the
76 dictionary. If there is a match we use it. Otherwise we search both
77 the uppercase and lowercase dictionaries; if one matches and the other
78 doesn't, or one matches a wider character than the other, we use it.
79 If that fails to resolve the ambiguity we must ask.
81 Don't try to make an entry in the character set dictionary mapping
82 `vertical stick' to `l' or `I'. Instead, select the whole word (or
83 enough of it that no different word would start with the other
84 letter), and enter the whole thing as a new glyph in the Word
87 For example, in the supplied dictionary there is already a glyph for
88 `Iron'; this is OK because there are no words which start `lron'.
94 It can happen that the problem you are being asked about is caused by
95 the program failing to spot an inter-word gap and mistakenly thinks
96 that the next word is necessarily in lowercase, so fails to recognise
97 an uppercase letter. The context in which each glyph was recognised
98 is shown on the screen, underneath the text which shows what it was
101 *You should check the alleged context before entering a character*.
102 If it is wrong, you should fix it, rather that just making an entry
103 in the wrong dictionary.
105 When this happens, instead, make a new glyph for the last letter of
106 the previous word plus the (unusually narrow) inter-word space, and
107 end that entry with a literal space ` '.
109 For example, you might find that `y<space>G' is treated as
110 `y<??lowercase>' and the G doesn't get matched. Select the `y<space>'
111 region of the bitmap and type `y ' into the string box.
114 Overlapping characters - ligatures
115 ----------------------------------
117 Some of the characters in the font used overlap with the next
118 character. When this happens, select both the characters and enter
119 them together as one glyph with a multi-character definition, as a new
120 entry in the Lower or Upper dictionary.
122 For example `yw' is rendered with the top right corner of the `y' and
123 the top left corner of the `w' overlapping. This is dealt with by
124 matching the whole merged thing - select the region of the screen
125 containing `yw' and define it as `yw'.
127 Such a combined entry - a ligature - is only needed if the letters
128 cannot be separated at all. It's not needed if they merely abut.
134 The OCR query UI allows you to delete things from the local glyph
135 dictionary. However you are not guaranteed to actually get an OCR
136 query at all (and since it is not possible to override the presence of
137 an entry in the master database with the absence of one in the local
138 database). So this is not a reliable feature for being able to fix
141 If you think you have made mistakes answering OCR queries (for
142 example, the recognised data is wrong), you should delete the file
143 _local-char*.txt, which contains your local updates. It will then
144 only use the centrally provided (and vetted) master file (which is
145 automatically updated when you run the yarrg client, by default).
147 It is also possible to have the OCR system reject particular strings.
148 If you put a regexp in _local-reject.txt, any OCR result which
149 matches this string will instead cause an OCR failure, invoking the
150 OCR dictionary editor if appropriate. _master-reject.txt is the
151 centrally maintained version of this file.
153 Alternatively you can edit _local-char*.txt with a text editor. The
154 format is not documented at the moment.
157 Enabling interactive character set update
158 -----------------------------------------
160 Now that you have read this document, you should rerun your OCR job
161 with the --edit-charset option. So run
162 ./yarrg --edit-charset
163 In future, this option is not usually needed, because it is the
164 default if there is a local character set dictionary _local-<h>.txt
165 for the relevant character height.
167 With --edit-charset, when the OCR finds characters it does not
168 understand, it will put up an OCR resolution query window. This will
169 display the part of the text it is having trouble with, showing where
170 it has got to, and allow you to edit the character set dictionary it
171 uses for recognising the text.
173 The process is subtle and it is important to understand the way the
174 machinery works, and the possible mistakes you can make, before
175 answering the program. So *Please read this documentation*, which
176 explains the meaning of the entries you make.
178 The character set updates you make will by default be submitted to my
179 server so that they can be checked by me and shared with other users.
182 If you need help please ask me (ijackson@chiark.greenend.org.uk, or
183 Aristarchus on Cerulean in game if I'm on line, or ask any pirate of
184 the crew Special Circumstances if they happen to know where I am
185 and/or can get in touch).