This README describes analog0.9beta3. For the latest version, see the analog home page (address below). This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see http://www.statslab.cam.ac.uk/~sret1/analog/ For examples of the output see http://www.statslab.cam.ac.uk/~sret1/stats/stats.html http://www.statslab.cam.ac.uk/~sret1/stats/statsme.html This program may be freely distributed and modified provided full credit is given to Stephen Turner (sret1@cam.ac.uk), and that the condition to do so is retained. However, please only distribute it in tact, including the domains file and this README. No warranty of any sort is given or implied for this program. Thanks are due to the author of getstats (http://www.eit.com/software/getstats/getstats.html). We (and other people) have found that getstats got buggy when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent. I have added some extra options that I wanted to see, and taken away a couple that I couldn't see the use of. Thanks are also due to the following who helped in the early stages of writing this program. Quentin Stafford-Fraser qs101@cl.cam.ac.uk and Dave Stanworth djh@gamesdom.demon.co.uk for beta testing, spotting bugs and suggesting new features; and especially Gareth McCaughan gjm11@pmms.cam.ac.uk for programming advice, particularly in making the code faster. The following files should have come with this README. If not, get them from the above WWW address. * analog.c The program. * analhead.h A header file, with various settable options. * domains.tab A file matching internet domains to their locations. The file analhead.h should be pretty self-explanatory. You will need to change the values of DOMAINSFILE and LOGFILE, and will want to change HOSTNAME and HOSTURL. You can change any of the rest to your desires, but equally it's probably OK to leave them as they are, at least at first. The file domains.tab should be in the following format: ad Andorra ae United Arab Emirates [...] There are 3 spaces between the code and the corresponding location. The codes must be in lower case. Use ? for the name if you want the domain to be recognised, but don't want the name to be printed out. After each domain can come subdomains of that domain in the same format. These will be printed out in the domain report after the domain to which they apply, in that order. So, for example, the domains file I use contains the lines ug Uganda uk United Kingdom cam.ac.uk University of Cambridge statslab.cam.ac.uk ? um US Minor Outlying Islands and the output looks like 103118 : 46.60% : .uk (United Kingdom) ( 64975):( 35.46%): cam.ac.uk (University of Cambridge) ( 47133):( 20.56%): statslab.cam.ac.uk 49271 : 12.49% : .edu (USA Educational) Numerical subdomains (which have most significant part on the left) can occur before the first domain. They will look like 131 The Ever-Popular 131 domain or 131.111 ? Now compile the program analog.c using your favourite ANSI C compiler. The program usage is analog [logfile | -] [options] The name of the logfile can be omitted, in which case the default LOGFILE will be used. - (or 'stdin') will use stdin. This is useful for constructing pipes, e.g., grep 'interesting' access_log | analog - The legal options are as follows. All of them have default values which can be changed in the header file. If two contradictory options are specified, the latter is heeded (this might be useful if you have an alias set up, for example). See the pages above for the meaning of the various reports. -m Don't do a monthly report. +m[n] Do a monthly report. Make each character in the graph worth n requests; 0 means the program will choose something sensible. -d | +d[n] Don't do/Do a daily summary. -h | +h[n] Don't do/Do an hourly summary. -o Don't do a domain report. +o[a|b|r][n] Do a domain report. Sort alphabetically/by bytes/by requests. Only print domains with at least n requests (r or a) or at least n/100ths of a percent of the traffic (b). -i | +i[a|b|r][n] Directory report. -r | +r[a|b|r][n] Request report. -s | +s Don't/do count number of distinct hosts. -7 | +7 Don't/do give statistics for last 7 days. -ln Level (or depth) of directory report. -k | +k | +kk Don't link in the request report / link to pages / link to everything. -cchar The character to use in the graphical displays. -wn The width of the page for output. 65 is about right. -nhostname The name of your organisation or WWW server. (Printed in the title of the output page). -uhosturl The URL of your front page. Use -u- if you don't want the top line of your output to be linked to anywhere. -fdomainsfile The domains file you want to use. Warnings: Filenames longer than a certain limit (which can be specified) are regarded as corrupt lines and discarded. The figures for the last 7 days are for the 7 days before the program is run, not before the last entry in the logfile. Moreover, they are actually for any times after 7 days before the program is started; so 'future' requests will be included if some more requests come in while the program is reading the logfile (or if the clock on the computer running the program is not sychronised with the one on the computer that recorded the logs). If the logfile contains any 'old-style' lines, they will be read and understood, but as they do not contain any data about bytes transferred, such data will no longer be collected. If any of the reports was to be sorted by bytes, a warning will be issued and sorting will be by requests instead. There is at present no facility to sort by a second criterion (bytes within equal requests or something). If two entries are tied on the sorting criterion, they will be printed out in random order. The reported 'running time' is elapsed real time, not CPU time. Although /%7Euser/page.html is translated to /~user/page.html, no other %nm conversions are attempted. Subdomains will be reported in the domain report if and only if their parent domain is, independent of their usage. There is no way of specifying subdomains *.cam.ac.uk or (not statslab).cam.ac.uk. Known bugs: Everything is done in local time. This means that the last seven days can be 167 or 169 hours long in the week after timezone change times. The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Please send me feedback on this program: whether it works on your system, ease of use, what other options you would like to see, etc. I shall also then send you news on updates or bug fixes. Stephen Turner (sret1@cam.ac.uk) Written: 21-Jun-95 Last modified: 29-Jun-95