README for analog0.92

Introduction

This README describes analog0.92beta. For the latest version of analog, see the analog home page.

This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see the

analog home page.

For examples of the output see

This program may be freely distributed and modified provided full credit is given to Stephen Turner (sret1@cam.ac.uk), and that this condition is retained. However, please only distribute it in tact, including the domains file and this README. No warranty of any sort is given or implied for this program.

What's new?

This section describes the main changes in each version of analog. If you are using analog for the first time, you can skip this section.

0.92beta: New reports introduced: hostname, full daily, and weekly.
FROM and TO commands introduced.
Header and footer files introduced.
More helpful warning messages.
Ability to read configuration instructions from stdin.
Subdomain commands moved from domains file to configuration file.
Makefile provided.
0.91beta3: Efficiency savings in terms of memory used; should make host count possible on more machines.
New commandline option +G.
Scrapped "microseconds per request"; it doesn't make much sense with the ability to analyse only certain lines from the logfile.
0.91beta2: Minor bug fixes.
Logo given transparent background; better on some screens.
0.91beta: Configuration file introduced, enabling many new options.
Some bug fixes and speed improvements.
Ability to print "top n" reports (rather than "everything higher than n").
Request report can print only pages.
Ability to try and resolve numerical addresses.
Now less fussy about the format of the domains file.
Logo added.
README converted to HTML.
0.9beta: More speed improvements, and some bug fixes.
Introduced -u option, and subdomain analysis.
Included "not modified" replies as successes, not redirects.
First public release at 0.9beta3.
0.89beta: Commandline arguments.
Efficiency improvements.
Host count and "last 7 day" statistics.
0.8beta: Initial program, just default options.

Types of report

Analog can produce the following types of report. See the analog home page for examples of these reports.

Summary statistics; total number of requests etc. These are always produced.
Monthly report. How many requests in each month.
Weekly report. How many requests each week.
Daily summary. How many requests on each day of the week.
Daily report. One entry for each day in the logfile.
Hourly summary. How many requests at each hour of the day.
Domain report. Which countries they came from.
Host report. How many requests from each separate host.
Directory report. Which of your directories they wanted to read.
Request report. And which of your files.

There are many options specifying, for example, how to sort these reports and what exactly to print. Details are below.

Auxiliary files

The following files should have come with this README. In any case, the latest versions are available from the analog home page.

analog.c     The program.
analogo.gif  A logo for the top of the output.
analhead.h   A header file, with various settable options.
domains.tab  A file matching internet domains to their locations.
Makefile     A Makefile

analogo.gif

A logo that you can use at the top of your reports. You can specify a different logo if you want, for example, your organisation's logo to appear instead.

analhead.h

The file analhead.h should be pretty self-explanatory. You will need to change the values of DOMAINSFILE and LOGFILE, and will want to change HOSTNAME and HOSTURL. You can change any of the rest to your desires, but equally it's probably OK to leave them as they are, at least at first.

You don't need to know about the domains.tab file (as long as you've got one) or configuration files (at all). If you just want to try out the program, you can go straight to the section on compiling and running.

domains.tab

The file domains.tab should be in the following format:

ad   Andorra
ae   United Arab Emirates
[...]

There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. Subdomains can also be analysed in the domain report. They are specified in the configuration file. Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,

uk  United Kingdom  # God save the Queen

Header and footer files

Analog has the ability to insert a header file of your choice between the top line of the page and the start of the statistics. (See below for how to specify this option). This file is inserted directly into the output, so it can contain HTML markup; in fact, it should if it is more than one paragraph long. Similarly, a footer file can be inserted at the bottom of the output.

Configuration files

You can safely skip this section on a first reading if you want.

When analog starts running, it can read instructions from a configuration file (or from stdin). At the moment the following instructions that can be in this file. (You are welcome to suggest others for the wishlist).

FILEALIAS file1 file2

Whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately.

If * is placed at the end of the first entry, then all filenames starting with file1 will be changed to start with file2. So, for example, after the command

FILEALIAS /~sret1/statprog/* /~sret1/analog/

a filename looking like /~sret1/statprog/statprog/stat.c will be understood as /~sret1/analog/statprog/stat.c. (Note that the conversion is done only once for each filename; you don't get /~sret1/analog/analog/stat.c).

The second command that can occur in the configuration file is

HOSTALIAS host1 host2

This is useful if your server records local hosts in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use

HOSTALIAS lion lion.statslab.cam.ac.uk
HOSTALIAS www lion.statslab.cam.ac.uk
HOSTALIAS www.statslab.cam.ac.uk lion.statslab.cam.ac.uk

Again, only one conversion is done per host, which is why I need both the second and the third line.

Next there are some commands for analysing only part of the logfile. Use of these commands can slow the program down a bit. The command

FILEIGNORE filename

ignores altogether logfile entries due to certain files. Again, an asterisk (*) can appear at the end of the name. Another command

FILEONLY filename

looks at only that file. If two FILEONLY commands appear, the program will look at entries matching either of them. For example

FILEONLY /~sret1/*
FILEONLY /~sret2/*
FILEIGNORE   /~sret1/backgammon/*
FILEIGNORE   /~sret2/home.html

will look at my files and sret2's files, but not my backgammon files or sret2's home page. These commands are applied after any aliasing of filenames has been done, but the files given in these names are not subject to any aliasing.

There are similar commands for including or excluding certain hosts. For example,

HOSTONLY        *.statslab.cam.ac.uk
HOSTIGNORE   lion.statslab.cam.ac.uk

will only include hosts from my local site, but not lion.statslab. In this case, *.statslab.cam.ac.uk matches the host statslab.cam.ac.uk. If you want to exclude it, you can give a HOSTIGNORE line for it.

Next there are commands to analyse only the logfile entries from certain dates. The commands are FROM yymmdd and TO yymmdd. So, for example, if I wanted to analyse only requests in June 1995 I could use the configuration

FROM 950601
TO   950630

There is a command SUBDOMAIN to analyse subdomains within the domain report. The syntax is

SUBDOMAIN subdomain subdomain_name

If the subdomain name has spaces in, it must be enclosed in quotes. The symbol ? can be used as the subdomain name, indicating a nameless subdomain. For example, I typically run analog with a configuration file including the lines

SUBDOMAIN cam.ac.uk 'University of Cambridge'
SUBDOMAIN statslab.cam.ac.uk   ?

which produces output like

 103118 :  46.60% : .uk (United Kingdom)
( 64975):( 35.46%):     cam.ac.uk (University of Cambridge)
( 47133):( 20.56%):       statslab.cam.ac.uk
  49271 :  12.49% : .edu (USA Educational)

Numerical subdomains (which have most significant part on the left) can also occur. They will look like

131   The Ever-Popular 131 domain

131.111   ?

Finally, all of the commandline arguments can have values specified in the configuration file. These values are overridden by the commandline values, but override the default values given in analhead.h. This is useful if you often process the logfile in two different ways; then you can get one of them just by supplying a configuration file without the need to recompile the program or specify a long list of commandline arguments. The legal values for variables are exemplified below.

LOGFILE  logfile          # the logfile to be analysed
DOMAINSFILE  domainsfile  # where to get the domains from
HEADERFILE headerfile # Put this file near the top of the output
FOOTERFILE none       # But none at the bottom
LOGOURL logourl    # the URL where the logo for the report title is.
LOGO on            # or off. Whether to use the logo.
HOSTNAME "Stephen's pages"   # our organisation
HOSTURL http://www.statslab.cam.ac.uk/  # where to find us; - for off.
REQFLOOR 1  # if sorting by requests or alphabetically, the
            # fewest requests needed to get on the request
            # report; if by bytes, the min traffic in 1/100ths
            # of a percent; if negative, do a 'top n' report.
DIRFLOOR 1    # directory report
DOMFLOOR -10  # domain report
HOSTFLOOR 100 # if this is too low, host report will be slow and long.
REQSORTBY byrequests   # how to sort the request report
DIRSORTBY alphabetical # directory report
DOMSORTBY bybytes      # domain report
HOSTSORTBY alphabetical # host report. Alphabetical is with
              # domain as most significant part.
MARKCHAR *    # the character for graphical displays
MARKCHAR '#'  # in quotes so that it isn't a comment
PAGEWIDTH 65  # the width of the output pages 
MONTHLY on    # do a monthly report?
DAILY off     # a daily summary?
FULLDAILY off # a full daily report (one line for each day)?
WEEKLY off    # a weekly report?
HOURLY on     # an hourly summary?
DOMAIN on     # a domain report?
FULLHOSTS off # a full hostname report (can be very long)?
DIRECTORY on  # a directory report?
REQUEST off   # a request report?
DIRLEVEL 2    # the level of the directory report
REQTYPE pages # what to print in the request report; pages or all
PAGELINKS on  # whether to link to pages in the request report
COUNTHOSTS on # count the total number of hosts who have visited?
LASTSEVEN on  # statistics for the last 7 days?
NUMLOOKUP off # try and resolve numerical addresses; NB SLOW!
MONTHLYUNIT 1000  # the size of the character in the graphical displays
HOURLYUNIT 0  # 0 represents 'choose something sensible automatically'
DAILYUNIT 0
FULLDAILYUNIT 0
WEEKLYUNIT 0

Makefile

You can use the supplied Makefile to help you compile the program, with whichever options you need or want.

Compiling and running the program

When you have changed the analhead.h and domains.tab file to your liking, compile the program by typing

make

(It may take a while as the program is rater big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again.

The program usage is

  analog [logfile | -] [options]

The name of the logfile can be omitted, in which case the default (as specified in analhead.h) will be used. - will use stdin. This is useful for constructing pipes, e.g.,

  cat access_log1 access_log2 | analog -

(This is safe, because the program works properly even if the logfile it is analysing is not in chronological order).

The legal options are as follows. (Again, you don't need any of these; you can just use the defaults). All of them have default values which can be changed in the analhead.h or the configuration file). If two contradictory options are specified, the latter is heeded (this might be useful if you have an alias set up, for example). Note that items in square brackets are optional; also that there can be no space between an option and any arguments to it. See the pages cited above for the meaning of the various reports.

  -m            Don't do a monthly report.
  +m[n]         Do a monthly report. Make each character in the graph
                worth n requests; if n = 0, choose something sensible.
  -d | +d[n]    Don't do/Do a daily summary.
  -h | +h[n]    Don't do/Do an hourly summary.
  -o            Don't do a domain report.
  +o[a|b|r][n]  Do a domain report. Sort alphabetically/by bytes/
                by requests. If n is positive, only print domains with
                at least n requests (r or a) or at least n/100ths of a
                percent of the traffic (b). If n is negative, print the
                top n (well, top -n actually) on the report.
  -i | +i[a|b|r][n]   Directory report.
  -r | +r[a|b|r][n]   Request report; report pages only. (But other files
                      are still counted in the other reports).
  -R | +R[a|b|r][n]   Request report; report all filenames.
  -S | +S[a|b|r][n]   Host report; one line for every host. This is slow
                and produces copious output unless n is set high or only
                a small (part of a) logfile is being analysed.
  -s | +s       Don't/do count number of distinct hosts. Not doing this
                can save a lot of memory! +S forces +s.
  -7 | +7       Don't/do give statistics for last 7 days.
  -ln           Level (or depth) of directory report.
  -k | +k | +kk Don't link in the request report / link to pages /
                link to everything.
  -cchar        The character to use in the graphical displays.
  -wn           The width of the page for output. 65 is about right.
  -nhostname    The name of your organisation or WWW server. (Printed
                in the title of the output page).
  -uhosturl     The URL of your front page. Use -u- if you don't want
                the top line of your output to be linked to anywhere.
  -fdomainsfile The domains file you want to use.
  +gconffile    The configuration file you want to use.
                Use +g- to read configuration instructions from stdin.
  +Gconffile    Use another configuration file in addition to the
                default one.
  -g | -G       Don't use a configuration file.
  +Hheaderfile  Include the specified header file near the top of the output.
  -H            Don't use a header file.
  +Ffooterfile  Put this file at the bottom of the output.
  -F            Don't use a footer file.
  -p            Don't put a logo at the top (p for picture)
  +p[logourl]   Do put a logo. Find it at the given URL.
  -1 / +1       Don't / do try and look up all numerical address.
                (May be slow).

Warnings

Lines with filenames longer than a certain limit (which can be specified in analhead.h) are regarded as corrupt lines and discarded.

The figures for the last 7 days are for the 7 days before the program is run, not before the last entry in the logfile. Moreover, they are actually for any times after 7 days before the program is started; so `future' requests will be included if some more requests come in while the program is reading the logfile (or if the clock on the computer running the program is not sychronised with the one on the computer that recorded the logs).

If we are doing a `top n' report and two entries tie for nth place, only one will be printed.

The reported `running time' is elapsed real time, not CPU time.

If you specify +oa-10 you really do get the top ten domains alphabetically.

Subdomains will be reported in the domain report if and only if their parent domain is, independent of their usage.

The program uses a rather naïve method to determine what is a `page' for the purpose of the request report. A page is anything that ends in .html, .htm .shtml, .shtm or /.

Known bugs

Everything is done in local time. This means that the last seven days can be 167 or 169 hours long in the week after timezone change times.

The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred.

Wishlist

Please send me (sret1@cam.ac.uk) feedback on this program: whether it works on your system (yes, even if it does I should like to know!), ease of use, what other options you would like to see, etc. I shall also then send you news on updates or bug fixes.

The following list contains in no particular order things that I think would be good to be able to do with analog, as well as things that people have suggested that I have not yet implemented. I welcome comments on how high a priority people place on the various items, and what other things would be useful. Some of these I want to do anyway, but some definitely won't get done unless other people want them.

Take account of timezones (see Bugs).
Do some sort of domain report even if there is no domains file.
Graph of number of new hosts each week/month; or number of hosts.
In the monthly report, extrapolate to give monthly rate for first and last months, not raw data.
Save some of the information so that we don't have to process the whole logfile next time.
It would be possible to do an approximate host count in a fixed amount of memory if people can't cope with recording all hosts. Is this worth doing?
In the configuration file, be able to specify what should count as a `page' for the request report.
Analysing domains subdomains *.cam.ac.uk and (not statslab).cam.ac.uk. (The exact user input could be tricky).
Sorting on a second criterion; e.g., bytes within equal requests.

Acknowledgements

Thanks are due to the author of getstats, Kevin Hughes. We (and other people) have found that getstats got buggy when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent. I have added some extra options that I wanted to see, and taken away a couple that I couldn't see the use of.

Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice.

Stephen Turner
University of Cambridge Statistical Laboratory
E-mail: sret1@cam.ac.uk

Page last modified: 11-Jul-95

README for analog0.92

Contents