README for analog0.94



This README describes analog0.94beta. For the latest version of analog, see the analog home page.

This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see the

For examples of the output see This program may be freely distributed and modified provided full credit is given to Stephen Turner (, and that this condition is retained. However, please only distribute it intact, including the domains file and this README. No warranty of any sort is given or implied for this program.

What's new?

This section describes the main changes in each version of analog. If you are using analog for the first time, you can skip this section.
New configuration variables SEPCHAR and REPORTORDER.
New configuration commands WITHARGS and WITHOUTARGS.
New commandline options +-A and +-x. (Config.: ALL and GENERAL).
Logfile entries with - as the return code are now regarded as successes, not corrupt entries.
Fixed bugs in host report when aliases or numerical hosts are present.
Documentation rewritten.
Approximate hostname counting now possible in fixed memory.
New configuration commands ISPAGE and ISNOTPAGE.
New commandline option -v.
New configuration command WEEKBEGINSON.
Proper error message when memory exceeded.
Program split into several files.
New reports introduced: hostname, full daily, and weekly.
FROM and TO commands introduced.
Header and footer files introduced.
More helpful warning messages.
Ability to read configuration instructions from stdin.
Subdomain commands moved from domains file to configuration file.
Makefile provided.
Configuration file introduced, enabling many new options.
Some bug fixes and speed improvements.
Ability to print "top n" reports (rather than "everything higher than n").
Request report can print only pages.
Ability to try and resolve numerical addresses.
Now less fussy about the format of the domains file.
Logo added.
README converted to HTML.
More speed improvements, and some bug fixes.
Introduced -u option.
Introduced subdomain analysis.
Included "not modified" replies as successes, not redirects.
First public release at 0.9beta3.
Commandline arguments.
Efficiency improvements.
Host count and "last 7 day" statistics.
Initial program, just default options.

Compiling and running the program

If you want to get on with trying out the program straight away, you can leave most of this README until another time. The one thing you need to do is to look at the file analhead.h. These are all user-settable options, but most of them you can leave alone for the moment. You will need to check the values of DOMAINSFILE and LOGFILE, and you will want to change HOSTNAME and HOSTURL.

When you have done that, compile the program by typing

(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again.

Then just type

to run the program. To send the output to a particular file instead of to the screen, type, e.g.,
analog > outfile.html

Customising analog

Pretty soon you will want to customise the output of analog to your personal preferences. How to do that is explained in this section. There are lots of options, so this section is rather long.

There are three ways in which customising can be done. First, the file analhead.h contains various settable parameters. These can be changed before compiling the program. They are explained in that file, so they will not be documented again here.

Secondly, there are commandline options, given after the commandname in the usual way. So, for example, the command

analog +d
uses the +d option to tell analog to include a daily summary in its ouput. All the commandline options are explained below.

Thirdly, you can tell analog to use a configuration file to read in extra options. This is specified by means of the commandline option +g. For example,

analog +gextra.conf
tells analog to read configuration commands from the file extra.conf. (Note that there is no space between +g and the filename; this is true of all commandline options). If +G is used instead of +g, the default configuration file as specified in analhead.h is read first, then the one specified after +G. You can specify standard input as the configuration file by the options +g- and +G-.

The configuration file can contain several commands on separate lines; any text after a hash (#) on a line is ignored as a comment. So the following is an example of a configuration file.

DAILY      OFF   # We don't want a daily summary
FULLDAILY  ON    # We want a full daily report instead 
An argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space. The various commands which can occur in the configuration file are explained below.

Why three separate methods to specify options? Although some options can be set in two or even three ways, the three methods have different functions. The file analhead.h contains default values for the variables, which you want always to apply when you don't set anything else. The configuration file is appropriate for options you often use. For example, I run three jobs every night to calculate different sets of statistics from our server; each of these different formats is controlled by a configuration file. Commandline options, on the other hand, are the quickest thing to use if you want to run the program on line, or if you want to override one of the options set in a configuration file.

In order to use the three separate methods together, you have to know which takes precedence over which. The default values in analhead.h have the lowest priority. They are overriden by the values in the configuration file (if two configuration files are read, the one specified by +G takes precedence over the default one). And they in turn are overridden by the commandline arguments. If two contradictory options are specified in one configuration file or on the commandline, the later one is obeyed.

If this is all a bit confusing, just run

analog -v [other options]
That will tell you what the values of all the variables will be based on analhead.h, the configuration options and the commandline options.

Now we shall look at options which affect one of the reports; after that we shall see options which affect several or all of the reports. We shall look at the options under the following headings.

General Summary

Program started at Mon-26-Jun-1995 17:09 local time.
Analysed requests from Thu-28-Jul-1994 20:31 to Mon-26-Jun-1995 17:09 (332.8 days).
Total completed requests: 368 063 (12 872)
Total failed requests: 4 089 (139)
Total redirected requests: 35 277 (1 838)
Average requests per day: 1 219 (2 121)
Number of distinct files requested: 966 (336)
Number of distinct hosts served: 28 589 (1 589)
Number of new hosts served in last 7 days: 1 037
Corrupt logfile entries: 869
Total bytes transferred: 1 852 029 300 (85 752 881)
Average bytes transferred per day: 5 544 997 (12 250 411)
(Figures in parentheses refer to the last 7 days).

The general summary can be turned off by the commandline option -x or the configuration command

or on by +x or GENERAL ON. If the general summary is off, all the `Go To' links in the output are also omitted.

The figures for the last 7 days can be turned on and off with +7 and -7 or the configuration command


Counting hosts is something which can take a lot of memory (we have to remember the name of every host that has accessed our server). If memory is a problem, you can turn the host counting off with the commandline option -s or the configuration command

Alternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using +ss or
and you can specify the amount of memory to be used by
APPROXHOSTSIZE 100000  # or whatever number, in bytes
About 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used.

Monthly report

Each + represents 1000 requests, or part thereof.

   month: #reqs
--------  -----
Nov 1994: 24784: +++++++++++++++++++++++++
Dec 1994: 32767: +++++++++++++++++++++++++++++++++
Jan 1995: 37656: ++++++++++++++++++++++++++++++++++++++
Feb 1995: 41666: ++++++++++++++++++++++++++++++++++++++++++
Mar 1995: 45113: ++++++++++++++++++++++++++++++++++++++++++++++
The monthly report can be turned on and off with +m and -m or

The value of + can be specified by a number after the +m option; e.g., +m1000 for the above display. If you specify +m0 (or if 0 is the default setting from analhead.h) the program will choose something sensible automatically. The equivalent configuration command is

MONTHLYUNIT 1000   # or 0, or whatever

Weekly report

week beg.: #reqs
---------  -----
24/Jul/94:   187: +
31/Jul/94:  3909: +++++++++++++++++
 7/Aug/94:  3550: ++++++++++++++++
14/Aug/94:  3920: +++++++++++++++++
21/Aug/94:  5220: +++++++++++++++++++++
This is configured in exactly the same way as the previous report, but with +W and -W in place of +m and -m, and configuration commands WEEKLY and WEEKLYUNIT.

Daily summary

day: #reqs
---  -----
Sun: 29488: ++++++++++++++++++++
Mon: 55680: ++++++++++++++++++++++++++++++++++++++
Tue: 58162: +++++++++++++++++++++++++++++++++++++++
Wed: 59157: ++++++++++++++++++++++++++++++++++++++++
Thu: 61907: ++++++++++++++++++++++++++++++++++++++++++
Fri: 60827: +++++++++++++++++++++++++++++++++++++++++
Sat: 32573: ++++++++++++++++++++++
Again as before, with +d and -d, and DAILY and DAILYUNIT.

Daily report

     date: #reqs
---------  -----
28/Jul/94:    11: +
29/Jul/94:   174: ++++
30/Jul/94:     2: +
31/Jul/94:     0: 
 1/Aug/94:   104: +++
 2/Aug/94:   517: +++++++++++
This report has one request for each day from the first to the last request, so it can be very large. The appropriate commands are +D, -D, FULLDAILY and FULLDAILYUNIT.

Hourly summary

hr: #reqs
--  -----
 0: 12245: ++++++++++++++++++++++++++++++++++++++++++
 1: 10163: ++++++++++++++++++++++++++++++++++
 2:  9137: ++++++++++++++++++++++++++++++++
 3:  8899: ++++++++++++++++++++++++++++++
 4:  8070: ++++++++++++++++++++++++++++
 5:  7713: ++++++++++++++++++++++++++
+h, -h, HOURLY and HOURLYUNIT are the appropriate commands for this report.

Domain report

  #reqs :  %bytes : domain
--------  --------  ------
 103125 :  46.58% : .uk (United Kingdom)
( 64982):( 35.45%): (University of Cambridge)
( 47138):( 20.55%):
  49290 :  12.49% : .edu (USA Educational)
  54879 :   9.35% : .com (USA Commercial)
  39812 :   6.97% : (Numerical domains)
  15186 :   2.84% : .de (Germany)
This report is turned on and off with the commandline options +o and -o, or the configuration command

The report can be sorted by number of requests, percentage of bytes, or alphabetically. This is achieved on the commandline by adding a letter after the +o option; +or, +ob or +oa respectively. In the configuration file, the command

can be given.

The report can be listed to any required depth by putting a number after the +o, +or, +ob or +oa option. If sorting is by requests or alphabetical, the number is interpreted as the minimum number of requests required to get onto the report. If sorting is by bytes, it is hundredths of a percent of bytes. For example, +oa15 will list all domains with at least 15 requests, sorted alphabetically, whereas +ob15 will list all domains with at least 0.15% of the traffic, sorted by bytes. If a negative number is given, a `top n' report is calculated; so, for example, +or-20 will list the 20 domains with the highest numbers of requests. The number can also be supplied by means of the configuration command

DOMFLOOR 15  # or -20, or whatever

Subdomains can be specified for each domain. This can only be done in the configuration file. The syntax of the command is

SUBDOMAIN subdomain subdomain_name
If the subdomain name has spaces in, it must be enclosed in quotes. The symbol ? can be used as the subdomain name, indicating a nameless subdomain. For example, to produce the above output, I would include the following lines in the configuration file
SUBDOMAIN 'University of Cambridge'

Numerical subdomains (which have most significant part on the left) can also occur. They will look like

131   The Ever-Popular 131 domain
131.111   ?
Within a domain, subdomains will be output in the order in which they occur in the configuration file.

The domain report relies on having a domains file available, listing which geographical locations correspond to which domains. Which file is to be used as the domains file can be specified by the commandline option -ffilename or the configuration command

DOMAINSFILE domainsfile
The correct format of the domains file is explained in a separate section.

Host report

#reqs: %bytes: host
-----  ------  ----
   10:  0.03%:
   11:  0.04%:
    1:       :
    2:  0.01%:
    1:       :
This is much the same as the domain report, with commandline options +S and -S, and configuration commands FULLHOSTS, HOSTSORTBY and HOSTFLOOR. Note that in this report, alphabetical sorting is by domain as most significant part. This report can be very long and slow to sort, and should be used with a high floor if at all.

Directory report

 #reqs: %bytes: directory
------  ------  ---------
237985: 35.40%: /~sret1/
 18596: 17.60%: /~rrw1/
  3574: 11.89%: /~richard/
  2376:  7.92%: /~steve/
 13518:  7.42%: /Dept/
Again, this is much the same as the domain report, with commandline options +i and -i, and configuration commands DIRECTORY, DIRSORTBY and DIRFLOOR. There is one further variable for this report, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like
 #reqs: %bytes: directory
------  ------  ---------
 43772: 72.06%: /~sret1/backgammon/
173426: 19.93%: /~sret1/backgammon/bitmaps/
 11298:  4.14%: /~sret1/
  5322:  1.71%: /~sret1/backgammon/books/
  2773:  1.22%: /~sret1/images/
   728:  0.66%: /~sret1/backgammon/clubs/
This can be specified by the commandline option +l3 or the configuration command

Request report

#reqs: %bytes: filename
-----  ------  --------
33980: 23.66%: /~sret1/backgammon/main.html
21162:  2.69%: /~sret1/backgammon/bitmaps/board.xbm
20422:  0.49%: /~sret1/backgammon/bitmaps/dice1.xbm
20187:  0.49%: /~sret1/backgammon/bitmaps/dice2b.xbm
12690:  0.86%: /
 8457:  1.09%: /header.gif
 7198:  0.81%: /~sret1/coldlist.html
 5461:  0.48%: /home.xbm
 3550:  0.32%: /~sret1/home.html
 3370:  0.23%: /~mcmc/html/
Commandline options +r and -r, and configuration commands REQUEST, REQSORTBY and REQFLOOR work analogously to the last three reports. There are also various options to control which files are printed and which are given links.

In fact, if the commandline option +r is used, only pages will be displayed in the report. If you want to list all files, including, for example, graphics, then you should use +R instead; alternatively, if neither +r nor +R is specified on the commanline, the configuration command

will control whether pages or all files are listed.

There are three possible modes of linking in the request report; you can link to none of the files, or pages only, or all files. The commandline options for these are -k, +k and +kk respectively; or you can use the configuration command


You can also specify in the configuration file what should be counted as a page in the requests report (thus giving you complete control over what goes in the report, or what is linked to). At the beginning, the following are `pages': *.html, *.htm, *.shtml and *.shtm. The command

ISPAGE filename
will specify that some other file is a `page'. Filenames can begin with an asterisk (*) as a wild card; so, for example,
ISPAGE *.ps.gz
would mean that Postscript files and compressed Postscript files are to be regarded as pages. You can also use
ISNOTPAGE filename
to specify that something which would otherwise be a page is not to be regarded as a page.

Analysing parts of the logfile

The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command

analog logfile.log
will use that logfile for its report. You can also write
analog -
to use standard input as the logfile. This is useful in constructing pipes; for example, if you want to analyse an old compressed logfile, you could type
gzcat logfile.old.gz | analog -
(gzcat might be called zcat on some systems). You can also specify which logfile to use in the configuration file by means of a command like
LOGFILE logfile.log   # or ...
LOGFILE stdin         # for standard input

There are various commands which instruct the program only to analyse part of the logfile. These are all configuration commands only; they have no commandline or analhead.h equivalents.

First, you can instruct the program only to tak into account certain files. This is done by means of the FILEONLY and FILEIGNORE commands. Asterisks can appear at the end of the filenames specified, as wildcards. For example, the configuration

FILEONLY /~sret1/*
FILEIGNORE /~sret1/backgammon/*
FILEIGNORE /~sret1/home.html
would instruct the program to examine only my files, and excluding my backgammon files and home page. (This should not be confused with excluding them from the request report, which still includes them in other reports; this excludes them altogether from the whole analysis).

There are similar commands HOSTONLY and HOSTIGNORE to analyse only the requests from certain sites. Here an asterisk can occur at the start of a hostname. For example,

would ignore accesses from my site (including itself).

Finally, there are commands to analyse only a subset of the dates in the logfile. The relevant commands are FROM yymmdd and TO yymmdd. So, for example, to analyse only requests in July 1995 I would use the configuration

FROM 950701
TO   950731

Aliases etc.

There are commands to give aliases for filenames and hostnames. The configuration line
FILEALIAS file1 file2
says that whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately.

If * is placed at the end of the first entry, then all filenames starting with file1 will be changed to start with file2. So, for example, after the command

FILEALIAS /~sret1/statprog/* /~sret1/analog/
a filename looking like /~sret1/statprog/statprog/stat.c will be understood as /~sret1/analog/statprog/stat.c. (Note that the conversion is done only once for each filename; you don't get /~sret1/analog/analog/stat.c).

A pair of related commands is WITHARGS and WITHOUTARGS. Normally any arguments given as part of a URL (after a question mark) are ignored. However, if a configuration command like

WITHARGS /cgi-bin/prog.cgi
is given, then the arguments to that file will form part of the filename. So /cgi-bin/prog.cgi?a and /cgi-bin/prog.cgi?b will be regarded as separate files, whereas without that command they would both have been translated to /cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks can again occur at the end of the filename, for example in commands like
WITHARGS /cgi-bin/*
There is also a parallel command WITHOUTARGS; for example,
WITHARGS /cgi-bin/*
WITHOUTARGS /cgi-bin/spam.cgi
would expand read the arguments for all files in /cgi-bin/ except spam.cgi.

There is a command HOSTALIAS, similar to FILEALIAS, which is useful if your server records local hostnames in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use

Again, only one conversion is done per host, which is why I need both the second and the third line. There is no wildcard conversion for this command.

One more related command is the command to tell analog to try and look up the names of hosts that appear only as numerical addresses in the logfile; so, for example, will be translated to Note, however, that not all hosts have names, or we may not be able to discover them. The commandline option to try and translate numerical addresses is +1 (or -1 to turn it off); the equivalent configuration command is

Looking up hostnames is a slow business. If this option is used, be prepared for analog to take a very long time to compile its report.


The final group of options is those which affect the layout of the output. First, there is a configuration command REPORTORDER which specifies which order the reports should occur in. The usage is a line like
This says that the reports should occur in the order hourly summary (h), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i) and request report (r). It is important to include all the above nine letters exactly once each.

There is a commandline option +A to include all reports (particular ones can then be omitted with -d or whatever); likewise -A omits all reports (and particular ones can then be included). The equivalent configuration commands are

ALL ON  # or OFF

The title line of the output page contains three adjustable variables. First, the logo in the top left hand corner can be turned on or off, or any other logo substituted (for example, your organisation's logo). This is accomplished by the commandline arguments -p (no logo: mnemonic, p for picture), +p (use the default logo) and +pURL use the logo at the given URL. The equivalent configuration commands are

LOGO     ON    # or OFF
LOGOURL  url   # where it is
The organisation name on the title line can be specified by means of the option -nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option -uURL; use -u- if you don't want any link. The equivalent configuration options are
HOSTNAME  name  # must be in quotes if it contains spaces
HOSTURL   -     # for no link

A header file and footer file can be inserted near the top and bottom of your output. These should be written in HTML, and can contain anything you want. Possible uses include providing information about your organisation or about the way the statistics were calculated, linking to related pages, and no doubt many other things. The commandline options to achieve this are +Hfilename and +Ffilename, or -H and -F to turn them off, and the configuration commands are

FOOTERFILE none      # if you don't want one

A command like

says which day should be regarded as the first day of the week. This is used in the daily report, daily summary and weekly report.

There is a command SEPCHAR to say which character should separate each group of three digits in long numbers. For example,

will give 123,456,789, whereas
will give 123 456 789.

The character which is used in the barcharts in some of the reports can be changed to, for example, a hash by -c# or

MARKCHAR '#'  # put in quotes so that it isn't a comment

Those graphical reports also need to know how many characters wide the output page is. Although a normal page is 80 characters wide, for Web pages about -w65 or

seems to be about right.

The domains file

The file, to translate internet country codes to locations, should have come with the program. If you haven't got one, you can download one from It should be in the following format:
ad   Andorra
ae   United Arab Emirates
There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.

Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,

uk  United Kingdom  # God save the Queen!

Frequently asked questions

Why don't I get such-and-such a report in the output even though I asked for it?
Maybe the floor for the report is set too high. For example, if you ask for a request report for all pages with at least 50 accesses and no page has that many, no report will be produced.
Why are no data on bytes included in my output?
You have some old-style logfile lines that do not include that information, so the analysis cannot be done.
I ran out of memory when trying to run analog. What can I do?
Try using approximate (instead of exact) hostname counting with the +ss option, or turning hostname counting off altogether with -s.
I have some old compressed logfiles and a current logfile. How can I analyse them both together?
The command gzcat or zcat has an option -f to uncompress compressed files and leave other files alone, and then stick them all together. So
gzcat -f log1.gz log2.gz log3 | analog -
is the required command.
My logfile is getting too big. Can you record some of the data in a convenient format so that analog can read it without having to process the whole logfile, and I can throw the old logfile?
Not at the moment. There are various technical problems with how much information to save, and in what format, that I haven't yet resolved. A better solution might be to compress your old logfile (see the previous question). If even that doesn't save enough space, you could produce an analog page with the old data and then throw it away. If anyone has any good ideas about how to include such a feature in a later version of analog, I should be interested to hear them.
Is there a form interface for analog, so that people can generate the statistics for their own pages without having to log in to the computer running the Web server, or knowing how to construct the command?
Not at the moment, though this is the next thing I intend to do. Send me ( e-mail if you want me to tell you when I have written one.


Lines with filenames longer than a certain limit (which can be specified in analhead.h) are regarded as corrupt lines and discarded.

The figures for the last 7 days are for the 7 days before the program is run, not before the last entry in the logfile. Moreover, they are actually for any times after 7 days before the program is started; so `future' requests will be included if some more requests come in while the program is reading the logfile (or if the clock on the computer running the program is not sychronised with the one on the computer that recorded the logs).

If we are doing a `top n' report and two entries tie for nth place, only one will be printed.

The reported `running time' is elapsed real time, not CPU time.

If you specify +oa-10 you really do get the top ten domains alphabetically. This is almost certainly useless!

Subdomains will be reported in the domain report if and only if their parent domain is, independent of their usage.

Known bugs

Everything is done in local time. This means that the last seven days can be 167 or 169 hours long in the week after timezone change times.

The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate.

Do not alias a file to itself (e.g., FILEALIAS /home.html /home.html) or a host to itself, or it will get lost.


I always welcome mail on analog (my e-mail address is; whether it works on your system (yes, even if it does!), any bug reports or requests for new features. If you send me mail, I shall keep you informed about future releases.

This version of analog, 0.94beta, is intended to be the same as version 1.0. In other words, I do not intend to add any more features to analog before releasing version 1.0 in a couple of weeks time. (I will, of course, fix bugs; so you just have to convince me that the lack of your favourite feature is really a bug...!). However, you are welcome to suggest features for any future releases that there may be.


Thanks are due to the author of getstats, Kevin Hughes. We (and other people) have found that getstats gets buggy and very slow when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent.

Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.

Stephen Turner
University of Cambridge Statistical Laboratory

Page last modified: 31-Aug-95