Handing OCR failures
--------------------

Sometimes the OCR will not be able to recognise some text.  By
default, when this happens, the program will stop with a fatal error
and refer you to this document.

It is possible to fix this by editing the character set dictionary used
by the OCR algorithm.  But, it is important to get these inputs right
or your client may misrecognise text in future.  You *must* read the
documentation here first.


Recognition algorithm
---------------------

We recognise the text in the commodity screen by doing exact matching
of `glyph' bitmaps, against the bitmap in each cell in the commodity
table.  We match from left to right.

We do not insist that each glyph is followed by whitespace, and nor do
we insist that glyphs do not contain whitespace.  Our glyph dictionary
can contain entries which are strict prefixes of other entries - that
is, a glyph for (say) `v' which is the leftmost part of another glyph
for (say) `w'.  We resolve these ambiguities by taking the longest
(widest) glyph which matches.

So you should not be surprised if the program has matched the
left-hand half of some letter and thinks it is a different letter.  If
the part that it did recognise does look like the letter in question,
that isn't wrong.  All you need to do is insert the whole of the
actual letter in the dictionary - move the LH cursor to the start of the
letter, and the RH cursor to its end, and hit `return' and enter the
correct character.  The longest match rule will mean it will prefer
the entry you have just made.


Upper vs lower case - important note regarding `l' and `I'
----------------------------------------------------------

We maintain separate dictionaries for upper and lower case.  At the
beginning of each cell in the table, we expect uppercase; in the
middle of a word we expect lowercase; and, unfortunately, after an
inter-word gap, we are not sure.

This is troublesome because `l' and `I' look identical on the screen.
So any time we see a word starting with `l' or `I', the program has to
ask about it.

*Do not* make an entry in the character set dictionary mapping `vertical
stick' to `l' or `I'.  Instead, select enough of the whole word in
question that no word would start with the other letter, and enter the
whole word or part of it as a new glyph.

For example, in the supplied dictionary there is already a glyph for
`Iron'; this is OK because there are no words which start `lron'.


Short inter-word gaps
---------------------

It can happen that the problem you are being asked about is caused by
the program failing to spot an inter-word gap and mistakenly thinks
that the next word is necessarily in lowercase, so fails to recognise
an uppercase letter.  The context in which each glyph was recognised
is shown on the screen, underneath the text which shows what it was
recognised as.

*You should check the alleged context before entering a character*.
If it is wrong, you should fix it, rather that just making an entry
for the uppercase letter in the lowercase dictionary.

Instead, make a new glyph for the last letter of the previous word
plus the (unusually narrow) inter-word space, and end that entry with
\x20 (yes, type \ x 20).

For example, you might find that `y<space>G' is treated as
`y<??lowercase>' and the G doesn't get matched.  Select the `y<space>'
region of the bitmap and type `y\x20' into the string box.
Sorry for this rather poor UI!


Overlapping characters - ligatures
----------------------------------

Some of the characters in the font used overlap with the next
character.  When this happens, select both the characters and enter
them together as one glyph with a multi-character definition.

For example `yw' is rendered with the top right corner of the `y' and
the top left corner of the `w' overlapping.  This is dealt with by
matching the whole merged thing - select the region of the screen
containing `yw' and define it as `yw'.


Fixing mistakes
---------------

The OCR query UI allows you to delete things from the glyph dictionary.
However since you are not guaranteed to actually get an OCR query at
all if the dictionary contains errors, you shouldn't rely on this.

If you think you have made mistakes answering OCR queries (for
example, the recognised data is wrong), you should download a fresh
copy of charset-15.txt from
 http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt


Enabling interactive character set update
-----------------------------------------

Now that you have read this document, you should rerun your OCR job
with the --edit-charset option.  You probably want to supply --same as
well, to avoid having to wait for it to page through and recapture all
the screenshots.  So, this time,
   ./ypp-commodities --edit-charset --same
and in future, just always run it with the --edit-charset option.

With --edit-charset, when the OCR finds characters it does not
understand, it will put up an OCR resolution query window.  This will
display the part of the text it is having trouble with, showing where
it has got to, and allow you to edit the character set dictionary it
uses for recognising the text.

*This is subtle* and it is important to understand the way the
machinery works, and the possible mistakes you can make, before
answering the program.  *Please read this documentation*, which
explains the meaning of the entries you make.

Also, the character set updates you make will by default be submitted
to my server so that they can be checked by me and shared with other
users.  See README.privacy.

If you need help please ask me (ijackson@chiark.greenend.org.uk, or
Aristarchus on Midnight in game if I'm on line, or ask any pirate of
the crew Special Circumstances if they happen to know where I am
and/or can get in touch).