pctb/README.charset

   1 Character set query tool, and semantics of the glyphs
   2 -----------------------------------------------------
   3
   4 Sometimes the OCR will not be able to recognise some text and you will
   5 have to help it out.  It will display the part it is having trouble
   6 with, showing where it has got to, and allow you to edit the character
   7 set database it uses for recognising the text.
   8
   9 *This is subtle* and it is important to understand the way the
  10 machinery works, and the possible mistakes you can make, before
  11 answering the program.  *Please read this documentation*
  12
  13 If you need help please ask me (ijackson@chiark.greenend.org.uk, or
  14 Aristarchus on Midnight in game if I'm on line, or ask any pirate of
  15 the crew Special Circumstances if they happen to know where I am
  16 and/or can get in touch).
  17
  18
  19 Recognition algorithm
  20 ---------------------
  21
  22 We recognise the text in the commodity screen by doing exact matching
  23 of `glyph' bitmaps, against the bitmap in each cell in the commodity
  24 table.  We match from left to right.
  25
  26 We do not insist that each glyph is followed by whitespace, and nor do
  27 we insist that glyphs do not contain whitespace.  Our glyph database
  28 can contain entries which are strict prefixes of other entries - that
  29 is, a glyph for (say) `v' which is the leftmost part of another glyph
  30 for (say) `w'.  We resolve these ambiguities by taking the longest
  31 (widest) glyph which matches.
  32
  33 So you should not be surprised if the program has matched the
  34 left-hand half of some letter and thinks it is a different letter.  If
  35 the part that it did recognise does look like the letter in question,
  36 that isn't wrong.  All you need to do is insert the whole of the
  37 actual letter in the database - move the LH cursor to the start of the
  38 letter, and the RH cursor to its end, and hit `return' and enter the
  39 correct character.  The longest match rule will mean it will prefer
  40 the entry you have just made.
  41
  42
  43 Upper vs lower case - important note regarding `l' and `I'
  44 ----------------------------------------------------------
  45
  46 We maintain separate databases for upper and lower case.  At the
  47 beginning of each cell in the table, we expect uppercase; in the
  48 middle of a word we expect lowercase; and, unfortunately, after an
  49 inter-word gap, we are not sure.
  50
  51 This is troublesome because `l' and `I' look identical on the screen.
  52 So any time we see a word starting with `l' or `I', the program has to
  53 ask about it.
  54
  55 *Do not* make an entry in the character set database mapping `vertical
  56 stick' to `l' or `I'.  Instead, select enough of the whole word in
  57 question that no word would start with the other letter, and enter the
  58 whole word or part of it as a new glyph.
  59
  60 For example, in the supplied database there is already a glyph for
  61 `Iron'; this is OK because there are no words which start `lron'.
  62
  63 Do not make an entry for a string more than 7 characters long;
  64 currently we cannot cope (and you'll have to remove it manually from
  65 the charset-15.txt file).
  66
  67
  68 Short inter-word gaps
  69 ---------------------
  70
  71 It can happen that the problem you are being asked about is caused by
  72 the program failing to spot an inter-word gap and mistakenly thinks
  73 that the next word is necessarily in lowercase, so fails to recognise
  74 an uppercase letter.  The context in which each glyph was recognised
  75 is shown on the screen, underneath the text which shows what it was
  76 recognised as.
  77
  78 *You should check the alleged context before entering a character*.
  79 If it is wrong, you should fix it, rather that just making an entry
  80 for the uppercase letter in the lowercase database.
  81
  82 Instead, make a new glyph for the last letter of the previous word
  83 plus the (unusually narrow) inter-word space, and end that entry with
  84 \x20 (yes, type \ x 20).
  85
  86 For example, you might find that `y<space>G' is treated as
  87 `y<??lowercase>' and the G doesn't get matched.  Select the `y<space>'
  88 region of the bitmap and type `y\x20' into the string box.
  89 Sorry for this rather poor UI!
  90
  91
  92 Overlapping characters - ligatures
  93 ----------------------------------
  94
  95 Some of the characters in the font used overlap with the next
  96 character.  When this happens, select both the characters and enter
  97 them together as one glyph with a multi-character definition.
  98
  99 For example `yw' is rendered with the top right corner of the `y' and
 100 the top left corner of the `w' overlapping.  This is dealt with by
 101 matching the whole merged thing - select the region of the screen
 102 containing `yw' and define it as `yw'.
 103
 104
 105 Fixing mistakes
 106 ---------------
 107
 108 The OCR query UI allows you to delete things from the glyph database.
 109 However since you are not guaranteed to actually get an OCR query at
 110 all if the database contains errors, you shouldn't rely on this.
 111
 112 If you think you have made mistakes answering OCR queries (for
 113 example, the recognised data is wrong), you should download a fresh
 114 copy of charset-15.txt from
 115  http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt
 116
 117
 118 Send me your updates
 119 --------------------
 120
 121 The character set is in the file `charset-15.txt'.  When you enter new
 122 characters, they are added there.  If you do this, please email me
 123 your charset file (ijackson@chiark.greenend.org.uk) so that I can
 124 include your contributions in future versions.  This will also let me
 125 check that they seem right :-).