From: Ian Jackson Date: Mon, 8 Jun 2009 13:20:58 +0000 (+0100) Subject: Much expanded and improved pctb client docs X-Git-Tag: 1.9.2~130 X-Git-Url: http://www.chiark.greenend.org.uk/ucgi/~yarrgweb/git?p=ypp-sc-tools.db-test.git;a=commitdiff_plain;h=2337ae5465a29659b44037dcbdaf6fa03eb46d84 Much expanded and improved pctb client docs --- diff --git a/pctb/README b/pctb/README index 5f1cea2..0d83e39 100644 --- a/pctb/README +++ b/pctb/README @@ -1,3 +1,6 @@ +Overview +-------- + This tool can: - screenscrape the commodities trading screen - produce the results as a tab separated values file @@ -6,35 +9,153 @@ This tool can: To run it, change to this directory, type `make', and then: ./ypp-commodities --tsv >commods.tsv +While it is capturing the screenshots, do not move the mouse or use +the keyboard. Keyboard focus must stay in the YPP client window. + +*IMPORTANT* It may put up a window asking about characters it does not understand. -It is important to get these inputs right or it may misrecognise -things in future. **TODO** write actual useful instructuions to cover the -subtleties. The results are stored in the file `charset-15.txt'. - -If you need to report a bug, please be sure to remember the exact -error message and circumstances. Also, for recognition problems there -will probably be a very useful screenshot file called `#pages#.pnm'. -This is likely to be very large so don't just email it to me, but if -you can put it up on a webpage for me to download that will help. - -Options available: - - Setting the operation mode: - --find-window-only Just check that we can find the YPP client window. - --screenshot-only Page through and take screenshots, do not OCR - --analyse-only | --same Process previously taken screenshots - --everything (default) Take screenshots and process them - - Options to vary the processing: - --single-page One screenful, no paging - results will be incomplete - --quiet Suppress progress messages - --screenshot-file F Store or read screenshots in F rather than #pages#.pnm - --window-id ID Specified X window is the YPP client - do not search - - Setting the output processing: - --raw-tsv Dump the raw not deduped unsorted OCR'd data - --upload (default) Upload to the PCTB server - --tsv Print data as clean tab-separated-values file - --best-prices Print best buy and sell price for each commodity - --arbitrage Print arbitrage opportunityes +It is important to get these inputs right or your client may +misrecognise text in future. You *must* read the documentation in +README.charset before answering these questions. + + +Command-line options +-------------------- + +Setting the operation mode: + --find-window-only Just check that we can find the YPP client window. + --screenshot-only Page through and take screenshots, do not OCR + --analyse-only | --same Process previously taken screenshots + --everything (default) Take screenshots and process them + +Options to vary the processing: + --single-page One screenful, no paging - results will be incomplete + --quiet Suppress progress messages + --screenshot-file F Store or read screenshots in F rather than #pages#.pnm + --window-id ID Specified X window is the YPP client - do not search + +Controlling what happens to the results: + --upload (default) Upload to the PCTB server + --tsv Print data as clean tab-separated-values file + --raw-tsv Dump the raw (not deduped, unsorted) OCR'd data + --best-prices Print best buy and sell price for each commodity + --arbitrage Print arbitrage opportunities + + +Files we use and update +----------------------- + +The program reads and writes the following files: + + * #pages#.pnm + + Contains one or more images (as raw ppms, end-to-end) which are the + screenshots taken in the last run. This is (over)written whenever + we take screenshots from the YPP client. You can reprocess an + existing set of screenshots with the --same (aka --analyse-only) + option; in that case we just read the screenshots file. + + You can specify a different file with --screenshot-file. + + If you want to display the contents of this file, `display' can do + it. Don't try `display vid:#pages#.pnm' as this will consume + truly stupendous quantities of RAM - it wedged my laptop. + + * charset-15.txt + + Character set database. For the semantics of the contents of this + file see README.charset. There is not currently any accurate + documentation of this database format. + + If you delete this file you'll have to re-enter a lot of glyph data + (and probably get it wrong and make the program misrecognise + things). If you want to undo any mistakes you may have made + answering OCR questions you can safely revert this to the version + I've supplied. + + * #commodmap#.tsv + + Map from commodity names to the numbers required by the PCTB + server. This is fetched and updated automatically as necessary. + It can safely be deleted as it will then be refetched. + + * .new + + When any of these tools overwrite one of the persistent database + files, they temporarily write to .new. + +These files are all in the current working directory. There is not +yet any feature to have them be somewhere else. The helper programs + yppsc-ocr-resolver + yppsc-commod-processor +must (currently) also be in the current directory. + +Future versions may have more helpers and more data files. + + +Installation requirements +------------------------- + +This program has quite a few dependencies: + Package (Debian etch) + + - For building, C compiler and build environment build-essential + - pnm library, including dev files for building libnetpbm10-dev + - pnm command line utilities for image manipulation netpbm + - X11 libraries, including dev files for building libx11-dev + - XTEST library, including dev files for building libxtst-dev + - Tk interpreter /usr/bin/wish tk8.4 + - Perl module XML::Parser libxml-parser-perl + - Perl module JSON::Parser libjson-perl + - XTEST extension in the X server (part of X package) + - Perl interpreter and basic modules perl (usu.installed) + +On other Linux distros the packages may have different names, but +these should be roughly right for Debian and its derivatives. + + +Reporting problems +------------------ + +If you need to report a bug, for example an inability to recognise, +please be sure to remember the exact error message and circumstances. +Also, for recognition problems there will probably be a very useful +screenshot file called `#pages#.pnm'. This is likely to be very large +so don't just email it to me, but if you can put it up on a webpage +for me to download that will help. At least keep a copy of it. + +If the problem is a failure to cope with some particular YPP client +display and is reproducible, try running: + ./ypp-commodities --raw-tsv --single-page +If this reproduces the problem, please email me the screenshot file +#pages#.pnm, which will consist only of the single screen, plus the +error messasge. I'll then be able to understand what's wrong, +hopefully. + + +Phoning home - privacy +---------------------- + +The main purpose of this program is to connect to the PCTB server and +upload data. The program does not currently phone home at all in +modes other than --upload, and when it does it connects to the +PCTB server not to a system of mine. + +However, there are some improvements which I may introduce in the +future which may change this. I am considering: + + * Having the ocr character resolver talk to a server run by me + to look for missing glpyhs, and/or upload those glyphs back + to that server so that they can be shared. + + * Having the upload client upload a copy of the data to a server run + by me, when run in --upload mode. + +If I do do this these new functions may be enabled by default, but it +will be possible to turn them off, or direct them to different +servers, with command-line options, and they will be documented here. + + - Ian Jackson + ijackson@chiark.greenend.org.uk + Aristarchus on the Midnight ocean diff --git a/pctb/README.charset b/pctb/README.charset new file mode 100644 index 0000000..bbabb05 --- /dev/null +++ b/pctb/README.charset @@ -0,0 +1,125 @@ +Character set query tool, and semantics of the glyphs +----------------------------------------------------- + +Sometimes the OCR will not be able to recognise some text and you will +have to help it out. It will display the part it is having trouble +with, showing where it has got to, and allow you to edit the character +set database it uses for recognising the text. + +*This is subtle* and it is important to understand the way the +machinery works, and the possible mistakes you can make, before +answering the program. *Please read this documentation* + +If you need help please ask me (ijackson@chiark.greenend.org.uk, or +Aristarchus on Midnight in game if I'm on line, or ask any pirate of +the crew Special Circumstances if they happen to know where I am +and/or can get in touch). + + +Recognition algorithm +--------------------- + +We recognise the text in the commodity screen by doing exact matching +of `glyph' bitmaps, against the bitmap in each cell in the commodity +table. We match from left to right. + +We do not insist that each glyph is followed by whitespace, and nor do +we insist that glyphs do not contain whitespace. Our glyph database +can contain entries which are strict prefixes of other entries - that +is, a glyph for (say) `v' which is the leftmost part of another glyph +for (say) `w'. We resolve these ambiguities by taking the longest +(widest) glyph which matches. + +So you should not be surprised if the program has matched the +left-hand half of some letter and thinks it is a different letter. If +the part that it did recognise does look like the letter in question, +that isn't wrong. All you need to do is insert the whole of the +actual letter in the database - move the LH cursor to the start of the +letter, and the RH cursor to its end, and hit `return' and enter the +correct character. The longest match rule will mean it will prefer +the entry you have just made. + + +Upper vs lower case - important note regarding `l' and `I' +---------------------------------------------------------- + +We maintain separate databases for upper and lower case. At the +beginning of each cell in the table, we expect uppercase; in the +middle of a word we expect lowercase; and, unfortunately, after an +inter-word gap, we are not sure. + +This is troublesome because `l' and `I' look identical on the screen. +So any time we see a word starting with `l' or `I', the program has to +ask about it. + +*Do not* make an entry in the character set database mapping `vertical +stick' to `l' or `I'. Instead, select enough of the whole word in +question that no word would start with the other letter, and enter the +whole word or part of it as a new glyph. + +For example, in the supplied database there is already a glyph for +`Iron'; this is OK because there are no words which start `lron'. + +Do not make an entry for a string more than 7 characters long; +currently we cannot cope (and you'll have to remove it manually from +the charset-15.txt file). + + +Short inter-word gaps +--------------------- + +It can happen that the problem you are being asked about is caused by +the program failing to spot an inter-word gap and mistakenly thinks +that the next word is necessarily in lowercase, so fails to recognise +an uppercase letter. The context in which each glyph was recognised +is shown on the screen, underneath the text which shows what it was +recognised as. + +*You should check the alleged context before entering a character*. +If it is wrong, you should fix it, rather that just making an entry +for the uppercase letter in the lowercase database. + +Instead, make a new glyph for the last letter of the previous word +plus the (unusually narrow) inter-word space, and end that entry with +\x20 (yes, type \ x 20). + +For example, you might find that `yG' is treated as +`y' and the G doesn't get matched. Select the `y' +region of the bitmap and type `y\x20' into the string box. +Sorry for this rather poor UI! + + +Overlapping characters - ligatures +---------------------------------- + +Some of the characters in the font used overlap with the next +character. When this happens, select both the characters and enter +them together as one glyph with a multi-character definition. + +For example `yw' is rendered with the top right corner of the `y' and +the top left corner of the `w' overlapping. This is dealt with by +matching the whole merged thing - select the region of the screen +containing `yw' and define it as `yw'. + + +Fixing mistakes +--------------- + +The OCR query UI allows you to delete things from the glyph database. +However since you are not guaranteed to actually get an OCR query at +all if the database contains errors, you shouldn't rely on this. + +If you think you have made mistakes answering OCR queries (for +example, the recognised data is wrong), you should download a fresh +copy of charset-15.txt from + http://www.chiark.greenend.org.uk/~ijackson/ypp-sc-tools/master/pctb/charset-15.txt + + +Send me your updates +-------------------- + +The character set is in the file `charset-15.txt'. When you enter new +characters, they are added there. If you do this, please email me +your charset file (ijackson@chiark.greenend.org.uk) so that I can +include your contributions in future versions. This will also let me +check that they seem right :-).