README for analog0.92

Contents


Introduction

This README describes analog0.93beta. For the latest version of analog, see the analog home page.

This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see the

For examples of the output see This program may be freely distributed and modified provided full credit is given to Stephen Turner (sret1@cam.ac.uk), and that this condition is retained. However, please only distribute it intact, including the domains file and this README. No warranty of any sort is given or implied for this program.

What's new?

This section describes the main changes in each version of analog. If you are using analog for the first time, you can skip this section.
0.93beta
Approximate hostname counting now possible in fixed memory.
New configuration commands ISPAGE and ISNOTPAGE.
New commandline option -v.
New configuration command WEEKBEGINSON.
Proper error message when memory exceeded.
Program split into several files.
0.92beta
New reports introduced: hostname, full daily, and weekly.
FROM and TO commands introduced.
Header and footer files introduced.
More helpful warning messages.
Ability to read configuration instructions from stdin.
Subdomain commands moved from domains file to configuration file.
Makefile provided.
0.91beta
Configuration file introduced, enabling many new options.
Some bug fixes and speed improvements.
Ability to print "top n" reports (rather than "everything higher than n").
Request report can print only pages.
Ability to try and resolve numerical addresses.
Now less fussy about the format of the domains file.
Logo added.
README converted to HTML.
0.9beta
More speed improvements, and some bug fixes.
Introduced -u option.
Introduced subdomain analysis.
Included "not modified" replies as successes, not redirects.
First public release at 0.9beta3.
0.89beta
Commandline arguments.
Efficiency improvements.
Host count and "last 7 day" statistics.
0.8beta
Initial program, just default options.

Compiling and running the program

If you want to get on with trying out the program straight away, you can leave most of this README until another time. The one thing you need to do is to look at the file analhead.h. These are all user-settable options, but most of them you can leave alone for the moment. You will need to check the values of DOMAINSFILE and LOGFILE, and you will want to change HOSTNAME and HOSTURL.

When you have done that, compile the program by typing

make
(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again.

Types of report

Analog can produce the following types of report. See the analog home page for examples of these reports. There are many options specifying, for example, how to sort these reports and what exactly to print. Details are below.

The domains file

The file domains.tab, to translate internet country codes to locations, should have come with the program. It should be in the following format:
ad   Andorra
ae   United Arab Emirates
[...]
There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.

Subdomains can also be analysed in the domain report. They are specified in the configuration file.

Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,

uk  United Kingdom  # God save the Queen

Configuration files

When analog starts running, it can read instructions from a configuration file (or from stdin). The following instructions can be in this file.


Commandline arguments

The program usage is

  analog [logfile | -] [options]
The name of the logfile can be omitted, in which case the default (as specified in analhead.h) will be used. - will use stdin. This is useful for constructing pipes, e.g.,
  cat access_log1 access_log2 | analog -
(This is safe, because the program works properly even if the logfile it is analysing is not in chronological order).

The legal options are as follows. (Again, you don't need any of these; you can just use the defaults). All of them have default values which can be changed in the analhead.h or the configuration file). If two contradictory options are specified, the latter is heeded (this might be useful if you have an alias set up, for example). Note that items in square brackets are optional; also that there can be no space between an option and any arguments to it. See the pages cited above for the meaning of the various reports.

  -v            Just print out the values of all the variables based on
                the defaults, configuration file and commandline options,
                then exit.
  -m            Don't do a monthly report.
  +m[n]         Do a monthly report. Make each character in the graph
                worth n requests; if n = 0, choose something sensible.
  -d | +d[n]    Don't do/Do a daily summary.
  -h | +h[n]    Don't do/Do an hourly summary.
  -o            Don't do a domain report.
  +o[a|b|r][n]  Do a domain report. Sort alphabetically/by bytes/
                by requests. If n is positive, only print domains with
                at least n requests (r or a) or at least n/100ths of a
                percent of the traffic (b). If n is negative, print the
                top n (well, top -n actually) on the report.
  -i | +i[a|b|r][n]   Directory report.
  -r | +r[a|b|r][n]   Request report; report pages only. (But other files
                      are still counted in the other reports).
  -R | +R[a|b|r][n]   Request report; report all filenames.
  -S | +S[a|b|r][n]   Host report; one line for every host. This is slow
                and produces copious output unless n is set high or only
                a small (part of a) logfile is being analysed.
  +s            Count number of distinct hosts. This can use a lot of
                memory! +S forces +s.
  +ss           Do an approximate count in fixed memory instead.
  -s            Don't do any host count.
  -7 | +7       Don't/do give statistics for last 7 days.
  -ln           Level (or depth) of directory report.
  -k | +k | +kk Don't link in the request report / link to pages /
                link to everything.
  -cchar        The character to use in the graphical displays.
  -wn           The width of the page for output. 65 is about right.
  -nhostname    The name of your organisation or WWW server. (Printed
                in the title of the output page).
  -uhosturl     The URL of your front page. Use -u- if you don't want
                the top line of your output to be linked to anywhere.
  -fdomainsfile The domains file you want to use.
  +gconffile    The configuration file you want to use.
                Use +g- to read configuration instructions from stdin.
  +Gconffile    Use another configuration file in addition to the
                default one.
  -g | -G       Don't use a configuration file.
  +Hheaderfile  Include the specified header file near the top of the output.
  -H            Don't use a header file.
  +Ffooterfile  Put this file at the bottom of the output.
  -F            Don't use a footer file.
  -p            Don't put a logo at the top (p for picture)
  +p[logourl]   Do put a logo. Find it at the given URL.
  -1 / +1       Don't / do try and look up all numerical address.
                (May be slow).

Warnings

Lines with filenames longer than a certain limit (which can be specified in analhead.h) are regarded as corrupt lines and discarded.

The figures for the last 7 days are for the 7 days before the program is run, not before the last entry in the logfile. Moreover, they are actually for any times after 7 days before the program is started; so `future' requests will be included if some more requests come in while the program is reading the logfile (or if the clock on the computer running the program is not sychronised with the one on the computer that recorded the logs).

If we are doing a `top n' report and two entries tie for nth place, only one will be printed.

The reported `running time' is elapsed real time, not CPU time.

If you specify +oa-10 you really do get the top ten domains alphabetically. This is almost certainly useless!

Subdomains will be reported in the domain report if and only if their parent domain is, independent of their usage.


Known bugs

Everything is done in local time. This means that the last seven days can be 167 or 169 hours long in the week after timezone change times.

The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate.


Wishlist

Please send me (sret1@cam.ac.uk) feedback on this program: whether it works on your system (yes, even if it does I should like to know!), ease of use, what other options you would like to see, etc. I shall also then send you news on updates or bug fixes.

I expect this version of analog to be very close to version 1.0, so this is your last chance to suggest what should be in it before beta testing ends. The following list contains in no particular order things that people have suggested up to now. I welcome comments on how high a priority people place on the various items, and what other things would be useful. I won't spend time doing things unless there is a demand for them.


Acknowledgements

Thanks are due to the author of getstats, Kevin Hughes. We (and other people) have found that getstats gets buggy and very slow when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent.

Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.


Stephen Turner
University of Cambridge Statistical Laboratory
E-mail: sret1@cam.ac.uk

Page last modified: 27-Jul-95