Readme for analog1.9beta3

Introduction

This Readme describes analog1.9beta3. For the latest version of analog, see the analog home page.

This program analyses logfiles from WWW servers. It should work on any Unix system. It is designed to be fast and to produce attractive statistics. For more details, see the

analog home page.

For examples of the output see

This program is free, and may be freely distributed and modified provided full credit is given to Stephen Turner (sret1@cam.ac.uk), and that this condition is retained. (I should, however, be grateful if you would let me know what modifications you have made). No warranty of any sort is given or implied for this program or its use. This is a beta test version, and some bugs can be expected.

What's new?

This section describes the main changes in each version of analog. If you are using analog for the first time, you can skip this section.

1.9beta3

Mainly bug fixes and improved documentation.
Browser and referer reports now include failed requests.
The WARNINGS option can now be specified on the form.

1.9beta2

Small bug fixes

1.9beta

Lots of changes. The most important new features are

Six new reports (hourly report, browser report, browser summary, referer report, status code report and error report).
Analysis of NCSA/Apache referer log, agent log and combined log formats.
Graphical time reports that still work on text-based browsers.
Configurable columns in the time reports.
Time reports can run backwards.
Time graphs can be plotted by bytes instead of by requests.
Can cache old data so that old logfiles need not be kept.
Can process several logfiles.
Can combine logfiles from several different hosts.
Will uncompress compressed logfiles.
All configuration options can now be specified on the commandline.
Mandatory configuration file added.
Lots of new options in the form processing program.
Wildcards greatly improved throughout.
Alphabetical host report right-aligned.
Bytes now quoted as MBytes etc. instead of long number.
Produces HTML2.0 compliant output.
New sort method RANDOM (saves time for long reports).
Floors for reports now work properly.
Can now specify a report FROM 100 or more days ago.
Option to turn off warnings.
Considerable savings in code length over previous versions.

As far as possible the options are backwards compatible with previous versions, but some changes have been necessary.

Commandline options +1, +c, +f, +F, +G and +H removed or given new meanings.
Options to +d, +D, +h, +i, +m, +o, +r, +S and +W changed.
BACKGROUND and NUMLOOKUP removed.
FILEINCLUDE and FILEEXCLUDE must now be used in place of FILEONLY and FILEIGNORE; similarly for HOST options.
Syntax for wild ALIAS matching changed.
Use DOMMINREQS and DOMMINBYTES instead of DOMFLOOR; similarly for the other reports.
Use REQUESTS and BYTES instead of BYREQUESTS and BYBYTES in the SORTBYs.
The configuration files and commandline arguments are now read in a different order. This shouldn't cause any trouble for most people. The system configuration file is now always read unless explicitly excluded with -G.

1.2.5: Minor bug fix for weekly report.
1.2.4: Patch for Spyglass server logfile format.
1.2.3: A couple of bug fixes (wild subdomains sometimes caused crashes).
-v option now gives the version number.
1.2.2: Patch for proxy servers: http:// not translated to http:/
1.2: Can configure columns in reports to give percentage requests and number of bytes.
Wild subdomains (e.g., *.com).
Nameless subdomains.
Subdomains now listed in alphabetical order.
Proper support for numerical hostnames in HOSTIGNORE, HOSTONLY, SUBDOMAIN and alphabetical sorting.
New BASEURL command allowing statistics to be displayed on other servers.
Output always says how things are sorted.
"Last 7 days" now behaves sensibly with TO.
Filenames containing /../, /./ and // translated.
Header and footer options removed from form (for security reasons).
1.1: Form interface introduced.
ASCII output now possible as well as HTML.
Output file can now be specified in the configuration file.
FROM and TO commands more powerful.
DEBUG and BACKGROUND introduced.
One bug fix: alphabetical sorting doesn't now swap some hostnames.
List of primes included in distribution.
1.0: Only minor changes since 0.94beta.
0.94beta: New configuration variables SEPCHAR and REPORTORDER.
New configuration commands WITHARGS and WITHOUTARGS.
New commandline options +-A and +-x. (Config.: ALL and GENERAL).
Logfile entries with - as the return code are now regarded as successes, not corrupt entries.
Fixed bugs in host report when aliases or numerical hosts are present.
Documentation rewritten.
0.93beta: Approximate hostname counting now possible in fixed memory.
New configuration commands ISPAGE and ISNOTPAGE.
New commandline option -v.
New configuration command WEEKBEGINSON.
Proper error message when memory exceeded.
Program split into several files.
0.92beta: New reports introduced: hostname, full daily, and weekly.
FROM and TO commands introduced.
Header and footer files introduced.
More helpful warning messages.
Ability to read configuration instructions from stdin.
Subdomain commands moved from domains file to configuration file.
Makefile provided.
0.91beta: Configuration file introduced, enabling many new options.
Some bug fixes and speed improvements.
Ability to print "top n" reports (rather than "everything higher than n").
Request report can print only pages.
Ability to try and resolve numerical addresses.
Now less fussy about the format of the domains file.
Logo added.
Readme converted to HTML.
0.9beta: More speed improvements, and some bug fixes.
Introduced -u option.
Introduced subdomain analysis.
Included "not modified" replies as successes, not redirects.
First public release at 0.9beta3.
0.89beta: Commandline arguments.
Efficiency improvements.
Host count and "last 7 day" statistics.
0.8beta: Initial program, just default options.

Compiling and running the program

If you want to get on with trying out the program straight away, you can leave most of this Readme until another time. The one thing you need to do is to look at the file analhead.h. These are all user-settable options, but most of them you can leave alone for the moment. You will probably want to check the first few options in the file, but you can even leave most of them until later.

Next you must move the images that came with the analog program (in the directory images) into the IMAGEDIR specified in analhead.h.

When you have done that, compile the program by typing

make

(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again. NB: There is a known problem with BSD/OS reporting a "yacc stack overflow" with some versions of gcc. Switching to a newer gcc should cure it.

Then just type

analog

to run the program. To send the output to a particular file instead of to the screen, type, e.g.,

analog > outfile.html

(This assumes that . is in you $PATH, but it should be).

Customising analog

Pretty soon you will want to customise the output of analog to your personal preferences. How to do that is explained in this section. There are lots of options, so this section is rather long. (However, you can bypass this section to some extent if you set up a form interface to allow you to choose the main options from a Web page).

Many options can be set in the file analhead.h. These can be changed before compiling the program. They are explained in that file, so they will not be documented again here.

Otherwise, analog takes its options from configuration files. Many of the configuration commands also have abbreviations as commandline arguments. So, for example, the configuration command

DAILY OFF

tells analog not to include a daily summary in the output. But this can also be specified by the command

analog -d

because the -d option is an abbreviation for DAILY OFF. In fact any configuration command can be specified on the commandline by means of the +C option; you could write

analog +C"DAILY OFF"

(This is most useful for running analog from a script or cron job).

To specify a configuration file, you use the commandline argument +g followed by the name of the file. For example,

analog +gextra.conf

tells analog to read configuration commands from the file extra.conf. (Note that there is no space between +g and the filename; this is true of all commandline arguments). (You can also specify standard input as the configuration file by the option +g-).

The configuration file can contain several commands on separate lines; any text after a hash (#) on a line is ignored as a comment. So the following is an example of a configuration file.

DAILY      OFF   # We don't want a daily summary
FULLDAILY  ON    # We want a full daily report instead

An argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space.

Commandline arguments are read in the order in which they occur, and configuration files are read when the +g argument is reached. If commands conflict, later commands override earlier ones.

There are also two special configuration files which can be specified in analhead.h. The default configuration file is run before all other configuration files. You can put in there configuration commands that you normally want to include but which you can override. You can stop analog running the default configuration file by the commandline option -G.

The mandatory configuration file is run after all other configuration commands have been read, and overrides them all. If the mandatory configuration file cannot be found, the program exits immediately. This can be used by system administrators to stop users analysing certain files or producing certain reports, for example. (Note, however, that the only way to stop it completely is to deny users read access to the logfile. Otherwise there is nothing to stop them analysing it by another copy of analog or another program).

If this is all a bit confusing, just run

analog -v [other options]

That will tell you what the values of all the variables will be, based on analhead.h, the configuration options and the commandline options.

We shall now look at all the configuration commands and their commandline equivalents under the following headings. There is a summary list of all of them in the reference section.

General Summary

Program started at Mon-26-Jun-1995 17:09 local time.
Analysed requests from Thu-28-Jul-1994 20:31 to Mon-26-Jun-1995 17:09 (332.8 days).
Total completed requests: 368 063 (12 872)
Total failed requests: 4 089 (139)
Total redirected requests: 35 277 (1 838)
Average requests per day: 1 219 (2 121)
Number of distinct files requested: 966 (336)
Number of distinct hosts served: 28 589 (1 589)
Number of new hosts served in last 7 days: 1 037
Corrupt logfile entries: 869
Total data transferred: 1 766 Mbytes (83 743 kbytes)
Average data transferred per day: 5 415 kbytes (11 963 kbytes)
(Figures in parentheses refer to the last 7 days).

The general summary can be turned off by the command

GENERAL OFF

(or the commandline argument -x) or on by GENERAL ON (or +x). If the general summary is off, all the `Go To' links in the output are also omitted.

The figures in parentheses refer to the last 7 days. They can be turned on and off with

LASTSEVEN ON    # or OFF

or with the commandline arguments +7 and -7. Note that the last 7 days refers to the last 7 days before the program is run, not before the last entry in the logfile. (If a TO command is specified, however, the last 7 days will be until that date).

Counting hosts is something which can take a lot of memory (we have to remember the name of every host that has accessed our server). If memory is a problem, you can turn the host counting off with the commandline option -s or the configuration command

COUNTHOSTS OFF

Alternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using +ss or

COUNTHOSTS APPROX

and you can specify the amount of memory to be used by

APPROXHOSTSIZE 100000  # or whatever number, in bytes

About 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used.

Time reports

Each unit () represents 4 000 requests, or part thereof.


   month:  #reqs: 
--------  ------  
Nov 1995: 119865: 
Dec 1995: 121214: 
Jan 1996: 144960:

The above display is of a monthly report. In this category, we also have the weekly report (one line for each week), daily summary (one line for Sundays, one for Mondays etc.), daily report (one line for each day ever), hourly summary (one line for midnight, one for 1am etc.) and hourly report (one line for each hour ever).

The following configuration commands show how to turn these reports on and off.

MONTHLY ON
WEEKLY  ON
DAILY   ON
FULLDAILY OFF
HOURLY ON
FULLHOURLY OFF

You can also use the corresponding commandline arguments +m, +W, +d, -D, +h, -H (use + to turn the corresponding reports on, - to turn them off).

You should use these reports sensitively. If your output is 200k long, people won't be able to download it. In particularly, you probably don't want a daily report very often, and you certainly don't want an hourly report unless you have restricted the analysis to just a couple of days.

The graphs above are designed to produce coloured bars on graphical browsers and ASCII graphs on non-graphical browsers. They don't use tables or image-stretching properties, so should work on any browser. However, you can produce plain ASCII graphs instead by the command

GRAPHICAL OFF    # or ON to turn it back on again

This has the advantage of producing smaller output which does not require any images to be downloaded.

The graphs rely on having the images distributed with analog available in the directory IMAGEDIR specified in analhead.h; or you can override that choice with a command like

IMAGEDIR /Images/

You can change the character used in the graphs on non-graphical terminals by means of a command such as

MARKCHAR '#'  # put in quotes so that it isn't a comment

The graphs can be plotted by bytes transferred instead of by requests. This can be done by means of commands like

MONTHGRAPH B    # by bytes
WEEKGRAPH  R    # by requests

There are also commands DAYGRAPH, FULLDAYGRAPH, HOURGRAPH and FULLHOURGRAPH. Alternatively, you can add the letter after the relevant commandline argument; for example, +hB to turn on the hourly summary with a graph sorted by bytes.

You can display the graphs backwards (with most recent requests at the top) by means of commands like

MONTHLYBACK ON  # or OFF

There are also the commands WEEKLYBACK, FULLDAILYBACK and FULLHOURLYBACK. The hourly summary and daily summary cannot be displayed backwards. I find it confusing to have some of the reports going backwards and some forwards, so you can also use

ALLBACK ON  # or OFF

to change all four of the reports to backwards or forwards together.

You can specify which columns appear in the various reports in which order. The above example showed the number of requests being given. You can also have the percentage of the requests, the number of bytes, and the percentage of the bytes. For example, the command

MONTHCOLS RBbr

tells analog to include in the monthly report columns for number of requests (R), number of bytes (B), percentage of bytes (b), and percentage of requests (r) in that order. The other commands are WEEKCOLS, DAYCOLS, FULLDAYCOLS, HOURCOLS and FULLHOURCOLS.

For some reports, analog needs to know where weeks begin and end. You can specify

WEEKBEGINSON WEDNESDAY

to change it to Wednesday, for example. (I guess Sunday or Monday is more likely).

In the graphs, analog will choose the value of the unit () automatically based on the length of the largest bar and the width of the page. You can specify the page width with, for example,

PAGEWIDTH 70

or the commandline option +w70. (I find about 65 works well). Occasionally you may want to specify the value of yourself (for example, to make it the same as on some other page). You can do this by a command like

MONTHLYUNIT 1000

Setting it to 0 makes analog choose it automatically again. Of course, the other reports have WEEKLYUNIT, DAILYUNIT, FULLDAILYUNIT, HOURLYUNIT and FULLHOURLYUNIT.

Other reports

This section discusses the following reports.

Domain report

  #reqs :  %bytes : domain
--------  --------  ------
 103125 :  46.58% : .uk (United Kingdom)
( 64982):( 35.45%):     cam.ac.uk (University of Cambridge)
( 47138):( 20.55%):       statslab.cam.ac.uk
  49290 :  12.49% : .edu (USA Educational)

Host report

#reqs: %bytes: host
-----  ------  ----
   10:  0.03%:          zlsm03.arcs.ac.at
   11:  0.04%:           iki10.boku.ac.at
  158:  0.15%:       talus.maths.su.oz.au

Directory report

#reqs: %bytes: directory
------  ------  ---------
237985: 35.40%: /~sret1/
 18596: 17.60%: /~rrw1/
  3574: 11.89%: /~richard/

Request report

#reqs: %bytes: filename
-----  ------  --------
33980: 23.66%: /~sret1/backgammon/main.html
21162:  2.69%: /~sret1/backgammon/bitmaps/board.xbm
12690:  0.86%: /

Referer report

#reqs: refering URL
-----  ------------
  260: http://webcrawler.com/cgi-bin/WebQuery
  239: http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/
  185: http://guide-p.infoseek.com/WW/NS/Titles?qt=backgammon&col=WW
  149: http://www.yahoo.com/Recreation/Games/Board_Games/Backgammon/

Browser summary

#reqs: browser
-----  -------
16797: Netscape
 1532: Mosaic
  693: IWENG
  492: Lynx

Browser report

#reqs: browser
-----  -------
 3105: Mozilla/1.22 (Windows; I; 16bit)
 2785: Mozilla/1.1N (Windows; I; 16bit)
  458: IWENG/1.2.003

These reports can be turned on and off with commands like

DOMAIN    ON
FULLHOSTS OFF
DIRECTORY ON
REQUEST   ON
REFERER   OFF
BROWSER   ON
FULLBROWSER  OFF

or with the commandline arguments +o (domain report), -S (host report), +i (directory report), +r or +R (request report; see below), -f (referer report), +b (browser summary) and -B (browser report). (As in the date reports, use + to turn the corresponding reports on, - to turn them off).

Another similarity with the date reports is that you can tell analog which columns to print on each report with the commands DOMCOLS, HOSTCOLS, DIRCOLS, REQCOLS, REFCOLS, BROWCOLS and FULLBROWCOLS. Again, each command is followed by letters indicating which columns are wanted and in which order. For example,

DOMCOLS RrBb  # no. of reqs, %age reqs, no. of bytes, %age bytes

Each of these reports can be sorted in four different ways; by bytes, by requests, alphabetically or randomly (i.e., unsorted). (The only advantage of the last one is so as not to spend time sorting very long reports). The commands to change this look like

DOMSORTBY BYTES  # or REQUESTS or ALPHABETICAL or RANDOM

The commands for the other reports are HOSTSORTBY, DIRSORTBY, REQSORTBY, REFSORTBY, BROWSORTBY and FULLBROWSORTBY. You can also add a letter b, r, a or x after the relevant commandline option; for example, +Sa for a host report sorted alphabetically.

It is important to be able to specify how many entries you want printed in each report. This is done by means of two variables for each report, one specifying the minimum number of bytes if the sorting is by bytes, and the other specifying the minimum number of requests if the sorting is by any of the other three methods. The following configuration commands illustrate the possible usages.

DOMMINREQS 20      # all items with at least 20 requests
HOSTMINREQS -20    # the first 20 items
                   # NB: useless if alphabetical or random sort
REQMINREQS 0.01%   # all items with at least 0.01% of the requests
DIRMINBYTES 100000 # all items with at least 100000 bytes
REFMINBYTES 100k   # all items with at least 100 kbytes
                   # (10M etc. also work)
BROWMINBYTES -40   # Top 40 if sorting is by bytes
FULLBROWMINBYTES 0.005%   # all with at least 0.005% of the traffic

You can also specify the amount on the commandline by adding it after the sort method. For example, +Sr-50 turns on a host report, sorted by requests, with only the top 50 items included, and +ib20k gives a directory report, sorted by bytes, including all directories with at least 20 kilobytes transferred.

We now describe features unique to a particular one of the reports. First the domain report.

Subdomains can be specified for each domain. The syntax of the command is

SUBDOMAIN subdomain subdomain_name

If the subdomain name has spaces in, it must be enclosed in quotes. The subdomain name can be omitted, indicating a nameless subdomain. For example, to produce the example above, I would include the following lines in the configuration file

SUBDOMAIN cam.ac.uk 'University of Cambridge'
SUBDOMAIN statslab.cam.ac.uk

Numerical subdomains (which have most significant part on the left) can also occur. They will look like

131   The Ever-Popular 131 domain
131.111   # Nameless

Also subdomains with wildcards in can occur. The following are examples:

SUBDOMAIN *.edu       # mit.edu, umn.edu  etc.
SUBDOMAIN 131.111.*   # 131.111.1, 131.111.2 etc.
SUBDOMAIN %           # all top-level numerical domains, from 1 to 255

The variables SUBDOMMINREQS and SUBDOMMINBYTES can be specified in the same way as above, except they can't be negative. If you ask for wild subdomains, you will probably want to set the minimum requests and minimum bytes quite high. However, you cannot alter the sort order; within a domain, subdomains will always be output in alphabetical order.

There is a command NOTSUBDOMAIN to erase a previously requested subdomain. For example, you can write

NOTSUBDOMAIN *.edu
NOTSUBDOMAIN cam.ac.uk

However, if you request, for example, *.edu, then NOTSUBDOMAIN mit.edu will ont override it.

The domain report relies on having a domains file available, listing which geographical locations correspond to which domains. Which file is to be used as the domains file can be specified by the command

DOMAINSFILE domainsfile

The correct format of the domains file is explained in a separate section.

There is little to say about the host report, except to note that alphabetical sorting is by domain as most significant part. This report can be very long and slow to sort, and should be used with a high floor if at all.

The directory report has one further variable, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like

#reqs: %bytes: directory
------  ------  ---------
 43772: 72.06%: /~sret1/backgammon/
173426: 19.93%: /~sret1/backgammon/bitmaps/
 11298:  4.14%: /~sret1/

This can be specified by the commandline option +l3 or the configuration command

DIRLEVEL 3

Note that the figures for each directory do not include those for the subdirectories of that directory, except where the directory is at the deepest level. So in the above example, /~sret1/backgammon/bitmaps/dice/d1.xbm would be reckoned in the directory /~sret1/backgammon/bitmaps/ (which is at the deepest level) but not in the other two directories.

We mentioned above that the request report has two commandline arguments, +r and +R. The difference is that if the commandline option +r is used, only pages will be displayed in the report. If you want to list all files, including, for example, graphics, then you should use +R instead. Alternatively the configuration command

REQTYPE PAGES  # or ALL

will control whether pages or all files are listed.

There are three possible modes of linking in the request report; you can link to none of the files, or pages only, or all files. The commandline options for these are -k, +k and +kk respectively; or you can use the configuration command

PAGELINKS OFF   # or ON, or ALL

There is also a related command BASEURL to specify a URL to prepend to the links. For example, if

BASEURL http://www.statslab.cam.ac.uk

were specified, then /~sret1/analog/ would be linked to http://www.statslab.cam.ac.uk/~sret1/analog/. This is useful if you want to display the statistics on a different server than the one they belong to. (See below for combining logfiles from two different servers).

You can also specify in the configuration file what should be counted as a `page' in the requests report. At the beginning, the following are `pages': *.html, *.htm, *.shtml, *.shtm, *.html3, *.ht3 and directories (*/). The command

ISPAGE filename

will specify that some other file is a `page'. You can give a list of filenames, separated by commas (without spaces). For example,

ISPAGE *.ps,*.ps.gz

would mean that Postscript files and gzipped Postscript files are to be regarded as pages. You can also use

ISNOTPAGE filename

to specify that something which would otherwise be a page is not to be regarded as a page.

The referer report, browser summary and browser report have no special commands, although the relevant logfiles must be present on the system (see below for how to specify where they are). Note that if you are using separate logfiles, rather than the NCSA combined log, you cannot sort these reports by bytes, or include bytes columns in the reports.

It is important to note that the referer report and browser reports are notoriously inaccurate. For the referer report, many browsers do not pass this information to the server, and many pass it wrongly (sending the URL of the previous page even when your page was not reached by selecting a link from that page). For the browser reports, some browsers even lie deliberately about what sort of browser they are, or let users configure the browser name. Furthermore, there is no fixed format for browser information. (NB: I have combined all Mosaics as a special case). In addition, graphical browsers automatically generate more requests than non-graphical browsers by loading the graphics, so it is not a very good guide to browser usage. For all these reasons many people would argue that the browser reports are so unhelpful as to be worse than useless. At best, interpret them with extreme caution.

Error and status code reports

The error report lists all the errors found in your error log:

#occs: error type
-----  ----------
19360: Send timed out
11286: Send aborted
 7962: File does not exist

The status code report lists how many of each type of status code occurred in your logfile:

#occs: no. description
-----  ---------------
35564: 200 OK
  173: 301 Document moved
    3: 302 Document found elsewhere
 5732: 304 Not modified since last retrieval

They are turned on and off by commands like

STATUS ON
ERROR  OFF

or by the commandline arguments +c and +e. There is a command ERRMINOCCS which says how many occurrences of an error there must be before it appears on the error report. For example

ERRMINOCCS 20

What to analyse

The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command

analog logfile.log

will use that logfile for its report. Analog will read the common log format (which most servers write) as well as the old NCSA format and the NCSA combined log format (which includes referer and agent information). Detection of which format each line of the logfile is in is automatic. You can also write

analog -

to use standard input as the logfile. (This is useful in constructing pipes). You can also specify which logfile to use in the configuration file by means of a command like

LOGFILE logfile.log   # or stdin for standard input

You can specify several logfiles on one configuration line by separating their names with commas (no spaces). For example

LOGFILE log1,log2,log3

Sometimes it is necessary to combine logfiles from two different servers, without getting filenames that happen to be the same on both servers confused. To do this you can use a second argument to the LOGFILE command, specifying a prefix for each filename. For example

LOGFILE log1,log2  http://www.a.com   # These logfiles from a.com
LOGFILE log3       http://www.b.com   # This one from b.com

If you use this, the directory report will need specifying to a deeper level.

Logfiles specified in the user's configuration files and commandline options replace any specified in the default configuration file, and are in turn overridden by any in the mandatory configuration file. In addition you can use none as the name of the logfile to overwrite the specification of all previous logfiles.

Analog can also read the NCSA/Apache referer log, agent log and error log formats. Logfiles of these types can be specified by commands like

REFLOG   referer_log
BROWLOG  agent_log.old,agent_log
ERRLOG   error_log

The same comments about which logfiles replace which apply as in the last paragraph.

Analog can uncompress compressed logfiles. You need to tell it how to uncompress each type of file by supplying a command that sends the uncompressed file to standard output (rather than uncompressing it into a file). The file can be a list of type of files, separated by commas. For example, depending what commands are on your system, you can use

UNCOMPRESS *.gz      "gunzip -c"  # or
UNCOMPRESS *.gz,*.Z  gzcat

This would be a suitable command to include in the default configuration file.

There are various commands which instruct the program to analyse only part of the logfile. First, you can instruct the program only to take into account certain files. This is done by means of the FILEINCLUDE and FILEEXCLUDE commands. Each command can have a list of filename, separated by commas (no spaces). One asterisk and any number of question marks can appear in each of the filenames specified, as wildcards. Each file is included and excluded as each new command is reached. Unspecified files are included if the first command found was an exclusion, and excluded if the first command found was an inclusion. For example, the configuration

FILEINCLUDE /~sret1/*
FILEEXCLUDE /~sret1/backgammon/*,/~sret1/analog/*
FILEINCLUDE /~sret1/backgammon/*.gif

would instruct the program to examine only my files, excluding my backgammon and analog files, but including gifs in my backgammon directory. On the other hand,

FILEEXCLUDE /~sret1/*

would analyse all files except mine. Remember you can always run analog -v to see what the options you have specified represent.

You can exclude all gifs with FILEEXCLUDE *.gif but this may not be what you want to do. This will then exclude them from all the reports, and not count the bytes transferred due to them. More likely, you just want to exclude them from the request report while still including them in the other reports, which you can do by means of REQTYPE PAGES.

There are similar commands HOSTINCLUDE and HOSTEXCLUDE to analyse only the requests from certain sites. For example,

HOSTEXCLUDE emu.pmms.cam.ac.uk
HOSTEXCLUDE *.statslab.cam.ac.uk

would ignore accesses from emu and from the whole of the statslab.

There are also commands REFINCLUDE and REFEXCLUDE for referers. You probably want to ignore referers from your own site. For example, I use

REFEXCLUDE http://www.statslab.cam.ac.uk/*

This would be a suitable command to put in your default configuration file.

Finally, there are commands to analyse only a subset of the dates in the logfile. The simplest usage is FROM yymmdd and TO yymmdd. So, for example, to analyse only requests in July 1995 I would use the configuration

FROM 950701
TO   950731

Also each of the pairs of digits can be preceded by - and the month and date can by preceded by + to represent time relative to the current date. This allows constructions like

FROM -01-00+01   # from tomorrow last year
TO -00-0131  # to the end of last month (OK even if last month
             # didn't have 31 days)
FROM -00-00-112
TO   -00-00-01  #statistics for the last 16 weeks

There are commandline abbreviations +F and +T for these commands; for example +T-00-00-01 looks at statistics until the end of yesterday. -F and -T turn off the from and to, as do FROM OFF and TO OFF.

If a TO command is given, the figures for the last 7 days refer to the time until then.

Aliases etc.

There are commands to give aliases for filenames, hostnames and referers. The configuration line

FILEALIAS file1 file2

says that whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately. It also understands that .. means `parent directory,' . means `this directory' and // is the same as /, and translates those filenames to their canonical forms.

Wildcards can occur in the aliases. For example, after

FILEALIAS   /~sret1/*.gif   /images/*g.gif
FILEALIAS   /~sret2/a?c*    /sa/*

/~sret1/a.gif would be translated to /images/ag.gif and /~sret2/abcd.txt would become /sa/d.txt.

There are also the commands HOSTALIAS and REFALIAS (for referers) which work in the same way. HOSTALIAS is particularly useful if your server records local hostnames in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use

HOSTALIAS lion lion.statslab.cam.ac.uk
HOSTALIAS www lion.statslab.cam.ac.uk
HOSTALIAS www.statslab.cam.ac.uk lion.statslab.cam.ac.uk

REFALIAS could be used to combine several referers from one site. For example

REFALIAS http://www.webcrawler.com/* http://webcrawler.com/
REFALIAS http://webcrawler.com/* http://webcrawler.com/

A pair of related commands is WITHARGS and WITHOUTARGS. Normally any arguments given as part of a URL (after a question mark) are ignored. However, if a configuration command like

WITHARGS /cgi-bin/prog.cgi

is given, then the arguments to that file will form part of the filename. So /cgi-bin/prog.cgi?a and /cgi-bin/prog.cgi?b will be regarded as separate files, whereas without that command they would both have been translated to /cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks and lists of files can again occur, and there is also a parallel command WITHOUTARGS; for example,

WITHARGS /cgi-bin/*
WITHOUTARGS /cgi-bin/spam.cgi

would read the arguments for all files in /cgi-bin/ except spam.cgi.

Commands REFWITHARGS and REFWITHOUTARGS work in the same way for referers, except that in this case the default is to include all the arguments (so that you can see what people are requesting from search engines).

The ability to look up numerical IP addresses and translate them to hostnames has been removed in this version of analog because it didn't work well and caused problems on some systems. I recommend instead pre-processing the logfile with the program logresolve.c (which is distributed with the Apache server).

Cache files

Analog has the ability to archive some of the data in your logfile in a cache file so that the logfile can be thrown away without losing the most important data. Important: The information that is recorded is only that which identifies how many successful requests there were at each time. No information on which files the requests were for, or where the requests were from, is kept. Neither is information on failed requests. So from the cache file you can reconstruct the time reports but not any other reports.

To produce a cache file instead of the normal output, use the command

OUTPUT CACHE

To read data from a cache file, use, e.g.,

CACHEFILE cache.out

(This will still read the ordinary logfile as well). You can also use the commandline argument +Ucache.out. You can specify several cache files by putting them in a comma-separated list, or using several +U commands.

To use this feature and avoid losing entries or double counting them, I suggest you follow the following procedure.

Stop the server.
Move the old logfile to a new location.
Restart the server with a fresh logfile.
Make a cache file from the old logfile.
Make an ordinary report from the old logfile too.
Make a report from the cache file and no other logfile to check it works.

Although it should now be safe to throw away the old logfile, I can take no responsibility if something goes wrong. This is beta test software and is expected to contain bugs. Also if you are going to use this feature please make sure you understand what information is and is not recorded in the cache file. You may find that the cache file is not the right feature for you. Compressing logfiles (with gzip -9) is very efficient owing to the large number of repeated strings. That in itself may solve your filespace problems.

Miscellaneous options

The final group of options is those which affect the layout of the output and other miscellaneous options. First, you can choose whether you want ASCII (plain text) or HTML output, using the commandline option +a or -a, or the configuration command

OUTPUT ASCII  # or HTML

If you choose ASCII output, some of the other options are ignored, but it should be obvious which ones they will be.

You can select the file for the output to be sent to in the configuration file or on the commandline. So instead of

analog > outfile.html

you can use the configuration command

OUTFILE outfile.html

or the commandline option +Ooutfile.html.

There is a configuration command REPORTORDER which specifies which order the reports should occur in. The usage is a line like

REPORTORDER hHDdWmoSirfbBec

This says that the reports should occur in the order hourly summary (h), hourly report (H), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i), request report (r), referer report (f), browser summary (b), browser report (B), error report (e) and status code report (c). It is important to include all the above fifteen letters exactly once each.

There is a command

ALL ON

to include all reports except the hourly report (particular ones can then be omitted with -d or whatever); likewise ALL OFF omits them (and particular ones can then be included). The equivalent commandline arguments are +A and -A. The hourly report and general report are not turned on by ALL ON or +A; they must be turned on separately with +H and +x. Note also that order is important; for example, +i -A +r will include the request report but not the directory report.

The title line of the output page contains three adjustable variables. First, the logo in the top left hand corner can be turned on or off, or any other logo substituted (for example, your organisation's logo). This is accomplished by the command

LOGOURL  url   # or none

or by the commandline arguments -p (no logo: mnemonic, p for picture) and +pURL. The organisation name on the title line can be specified by means of the option -nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option -uURL; use -u- if you don't want any link. The equivalent configuration options are

HOSTNAME  name  # must be in quotes if it contains spaces
HOSTURL   URL
HOSTURL   -     # for no link

A header file and footer file can be inserted near the top and bottom of your output. These should be written in HTML or ASCII according to whether your output is HTML or ASCII, and can contain anything you want. Possible uses include providing information about your organisation or about the way the statistics were calculated, linking to related pages, and no doubt many other things. The commands to achieve this are

HEADERFILE filename
FOOTERFILE none      # if you don't want one

There is a command SEPCHAR to say which character should separate each group of three digits in long numbers. For example,

SEPCHAR ,

will give 123,456,789, whereas

SEPCHAR ' '

will give 123 456 789.

You can specify whether analog prints long numbers of bytes as exact numbers (e.g., 5,053,234) or as kilobytes, megabytes etc. (e.g., 4934k) by the command

RAWBYTES ON  # for exact, OFF for abbreviated

There is a debugging command, for printing (to stderr) problems with your logfile. There are currently three levels of debugging: 0 for no debugging, 1 for printing corrupt logfile lines (prepended by "C:"), and 2 which also prints hosts for which the domain is unknown (prepended by "U:") and errors which cannot be classified (prepended by "E:"). The command for level n debugging is

DEBUG n

and the equivalent commandline argument is +Vn (V for verbose). You can also use commandline options +V for level 1 and -V for level 0.

Finally, there is an option to turn off warnings. It is

WARNINGS OFF  # or ON

The equivalent commandline argument is -q to turn warnings off (q for quiet) and +q to turn them on again. This is useful in scripts or cron jobs if you really do want to give a configuration that you know will generate a warning.

The domains file

The file domains.tab, to translate internet country codes to locations, should have come with the program. If you haven't got one, you can download one from http://www.statslab.cam.ac.uk/~sret1/analog/analog/domains.tab. It should be in the following format:

ad   Andorra
ae   United Arab Emirates
[...]

There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.

Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,

uk  United Kingdom  # God save the Queen!

The form interface

Another way to run analog is via the form interface; this allows users to select which options they want via a Web page.

To set up the form interface, go to the directory where the analog source code lives, and follow these steps.

In analhead.h, make sure that the FORMPROG is set to be the URL of the form processing program, which will be wherever cgi-bin programs live on your server; normally in the cgi-bin directory.
Edit the top of analform.c to indicate where the analog program lives (the program name within your computer's whole filespace, not a URL).
Type make form.
Move the program analform.cgi to the place you specified as the FORMPROG. Make sure it is executable by the server.
Make sure analog itself is executable by the server too, and that domains.tab is readable.
The file analogform.html is the actual form interface; move it to wherever you want people to get at it. Make sure it is world readable.

If the third step above fails to generate a form, you can generate one yourself by means of the command analog -form +Oanalogform.html. You might also want to run this command yourself if you want to supply different default options from normal for the form user: if you run the command with extra commandline or configuration file options, they will be respected in the construction of the form.

It is expected that system administrators may want to provide different options on the forms from the default ones. For this reason, the cgi program understands various other options that are not normally on the form. These can be added to the form by hand. For example, you may want to allow a choice of logfiles, perhaps via a <select>. Or you may want form users to use certain default options; these could be specified as <input type=hidden>. Because the form uses GET not POST you can also construct links to it. For experts, here follows a complete list of form options. [*] marks a default value.

bq  browser summary? 0 for off [*], 1 for on, 2 for default.
ba  +ve MINBROWREQS
bb  -ve MINBROWREQS
bc  +ve MINBROWBYTES
bd  -ve MINBROWBYTES
bs  BROWSORTBY (0 = REQUESTS [*], 1 = BYTES, 2 = ALPHABETICAL, 3 = RANDOM)
Other reports similarly with initial B, f, i, o, r, S in place of b.
cq  status code report?
dq  daily summary?
dg  DAYGRAPH (R or B)
Other time reports similarly with D, h, H, m, W in place of d.
eq  error report?
fi  FILEIGNORE; list, separated by commas
fr  FROM
fy  FILEONLY; list, separated by commas
hi  HOSTIGNORE; list, separated by commas
ho  HOSTURL
hy  HOSTONLY; list, separated by commas
ie  DIRLEVEL
lb  BROWLOG; list, separated by commas
lc  CACHEFILE; list, separated by commas
le  ERRLOG; list, separated by commas
lf  REFLOG; list, separated by commas
lo  LOGFILE; list, separated by commas
or  HOSTNAME
ou  OUTPUT -- 0 for HTML [*], 1 for ASCII
rl  REQLINKS -- f for ALL (files) [*], p for PAGES, n for OFF (none)
rt  REQTYPE -- f for ALL [*], p for PAGES
to  TO
TZ  timezone
wa  WARNINGS (to error_log) -- 0 for OFF, 1 for ON [*].
xq  general report?

If the form doesn't seem to work, check the following:

Look in the server's error_log for clues.
Do other cgi-bin programs work on your server?
Are all the files in the right places, with the right access permissions, as specified above?
Try the following. setenv QUERY_STRING "xq=1" (C Shell) or export QUERY_STRING="xq=1" (other shells), then run analform from the shell.
If the local time doesn't seem to be correct in the output, you may have to set the timezone yourself in the form. Four lines from the bottom, there is a line like <input type=hidden name="TZ" value=""> For the value you should insert your timezone, in standard format. Usually this looks like your winter timezone name, followed by hours west of Greenwich, followed by your summer timezone name. So the East Coast of the USA should have value="EST5EDT", and Germany value="MEZ-1MESZ".

It is better, although not essential, if when you change the default options for your analog, you remake the form.

Note that you probably want to restrict access to the form and form program to certain users; if it is world readable there could be considerable load on your server as well as potential confidentiality problems. Exactly how to do this depends on which server you are running.

Glossary and reference

Many people have asked exactly what counts as a request, and the meaning of other terms used in the analog output. Here is an explanation. Each time someone requests a file from your server, that is a request. It may be that the page contains some inline images; then they must be loaded separately by people with graphical browsers, which counts as further requests.

Unfortunately, you cannot tell how many times your file has been read from this. The user may in fact request the file from a proxy server which already has a copy of it, or retrieve it from a local cache. In these cases no connection is made to your server, and no request is scored.

There are three categories of request, which can be seen in the status code report. Completed (or successful) requests are those with codes in the 200s (where the document was returned) or with code 304 (where the document was not needed because it had not been recently modified and the user could use a cached copy). Redirected requests are those with other codes in the 300s. The most common cause of these requests is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection ("you probably mean the following") and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). Failed requests are those with codes in the 400s (error in request) or 500s (server error). They come about for a variety of reasons, but the most common are when the requested file is not found or is read-protected.

The total data transferred refers only to successful requests, and does not include the message header, only the actual data. The detailed reports also only include successes, except for the referer report and browser reports which include all request types.

Corrupt logfile lines are those we can't understand, and unwanted lines are those that refer to files, hosts or dates that we have specifically excluded.

A host is a computer that has requested something from you. Analog gives the number of distinct (different) hosts that have made a successful request, and the number of distinct files they have requested.

Here is a complete list of all 121 configuration commands. For their usage, see the full documentation.

Specifying files to analyse: BROWLOG, CACHEFILE, DOMAINSFILE, ERRLOG, LOGFILE, REFLOG
Turning reports on and off: ALL, BROWSER, COUNTHOSTS, DAILY, DIRECTORY, DOMAIN, ERROR, FULLBROWSER, FULLDAILY, FULLHOSTS, FULLHOURLY, GENERAL, HOURLY, LASTSEVEN, MONTHLY, REFERER, REQUEST, STATUS, WEEKLY
Columns in reports: BROWCOLS, DAYCOLS, DIRCOLS, DOMCOLS, FULLBROWCOLS, FULLDAYCOLS, FULLHOURCOLS, HOSTCOLS, HOURCOLS, MONTHCOLS, REFCOLS, REQCOLS, WEEKCOLS
Graphs by requests or bytes: DAYGRAPH, FULLDAYGRAPH, FULLHOURGRAPH, HOURGRAPH, MONTHGRAPH, WEEKGRAPH
Graphs forwards or backwards in time: ALLBACK, FULLDAILYBACK, FULLHOURLYBACK, MONTHLYBACK, WEEKLYBACK
Value of unit in graphs: DAILYUNIT, FULLDAILYUNIT, FULLHOURLYUNIT, HOURLYUNIT, MONTHLYUNIT, WEEKLYUNIT
How to sort reports: BROWSORTBY, DIRSORTBY, DOMSORTBY, FULLBROWSORTBY, HOSTSORTBY, REFSORTBY, REQSORTBY
Floors for reports: BROWMINREQS, BROWMINBYTES, DIRMINREQS, DIRMINBYTES, DOMMINREQS, DOMMINBYTES, ERRMINOCCS, FULLBROWMINREQS, FULLBROWMINBYTES, HOSTMINREQS, HOSTMINBYTES, REFMINREQS, REFMINBYTES, REQMINREQS, REQMINBYTES, SUBDOMMINREQS, SUBDOMMINBYTES
Inclusions and exclusions: FILEINCLUDE, FILEEXCLUDE, FROM, HOSTINCLUDE, HOSTEXCLUDE, REFINCLUDE, REFEXCLUDE, TO
Aliases: FILEALIAS, HOSTALIAS, REFALIAS
Other output configuration: BASEURL, FOOTERFILE, GRAPHICAL, HEADERFILE, HOSTNAME, HOSTURL, IMAGEDIR, LOGOURL, MARKCHAR, OUTPUT, OUTPUTFILE, PAGELINKS, PAGEWIDTH, RAWBYTES, REPORTORDER, REQTPYE, SEPCHAR
Other commands: APPROXHOSTSIZE, DEBUG, DIRLEVEL, ISPAGE, ISNOTPAGE, NOTSUBDOMAIN, REFWITHARGS, REFWITHOUTARGS, SUBDOMAIN, UNCOMPRESS, WARNINGS, WEEKBEGINSON, WITHARGS, WITHOUTARGS

Here is a summary of all 39 commandline arguments. Again, for their usage, see the full documentation. Many of them can be given a - instead of a + to turn something off.

 +7  stats for last 7 days
 +a  ASCII output
 +A  all reports (except hourly report)
 +b  browser summary
 +B  browser report
 +c  status code report
 +C  configuration command
 +d  daily summary
 +D  daily report
 +e  error report
 +f  referer report
 +form  do a form
 +F  from
 +g  configuration file
 -G  default config file off
 +help  help message
 +h  hourly summary
 +H  hourly report
 +i  directory report
 +k  pagelinks
 +l  dirlevel
 +m  monthly report
 +n  hostname
 +o  domain report
 +O  outfile
 +p  logo
 -q  no warnings
 +r  request report, pages only
 +R  request report, all files
 +s  host count
 +ss approximate host count
 +S  host report
 +T  to
 +u  host url
 +U  cache file
 +v  printvbles
 +V  debug level
 +w  pagewidth
 +W  weekly report
 +x  general summary

Frequently asked questions

When I try and compile analog, it gives me an error.: Look in the Makefile to see if you need to include any extra options.
Also, make sure you are using an ANSI C compiler (like gcc) or have included the right CFLAGS in the Makefile to turn on the ANSI option in a compiler like cc. For BSD/OS see the note above.
Why don't I get such-and-such a report in the output even though I asked for it? (or why don't I get the subdomains I requested in the domain report?): Maybe the floor for the report is set too high. For example, if you ask for a request report for all pages with at least 50 accesses and no page has that many, no report will be produced. See also the next question.
Why doesn't such-and-such a file appear in the request report?: You've asked for only pages, and this file is not a page. The remedy is to use REQTYPE ALL to list all files in the request report, or ISPAGE to say that this file is a `page.'
Why don't the total requests in the request report add up to the grand total?: Two reasons. Maybe the request report only lists files above a certain threshold of requests. Or maybe it is listing only pages and some requests are due to other files. See the previous two questions.
Can I count the number of individual visitors to my site?: No. This information is (usually) not recorded in the logfile. The number of hosts people came from is the best estimate.
Well, can I generate the number of visits then?: The usual suggestion is to count all requests from the same machine as one visit until there is, say, 30 minutes gap. Unfortunately this would not only be very difficult to collect, but is a very unsound method of counting visits. (Local hosts score far too few, and other hosts too many).
Why are directories listed in the request report?: They are not directories, they are pages with the same name as the directory. (They arise because if you ask the server for a directory, it typically returns the page in that directory called index.html).
Why are no data on bytes included in my output?: You have some old-style logfile lines that do not include that information, so the analysis cannot be done. Browser logs and referer logs never include that information either.
Why don't you make proper graphs or tables?: Because lots of people then couldn't read them. Analog produces HTML2.0 output so that people with any browser can read it. Also, I don't want to assume that people have any particular graphics creation tools.
Can you extrapolate from the current month's partial data to produce a prediction for the whole month, based on the rate so far?: No. There are too many problems in trying to produce anything sensible, especially near the beginning of the month. Different days of the week and different times of day cause lots of problems. I would prefer to produce raw accurate data than suspect derived data.
I ran out of memory when trying to run analog. What can I do?: Try using approximate (instead of exact) hostname counting with the +ss option, or turning hostname counting off altogether with -s.
You're processing 1,000,000 requests in under 2 minutes. Why is mine much slower?: It probably depends on the speed of your computer and disks. But you could try increasing the hash sizes in analhead.h.

Warnings

Lines with filenames longer than a certain limit (which can be specified in analhead.h) are regarded as corrupt lines and discarded.

If we are doing a `top n' report and two entries tie for nth place, only one will be printed.

The reported `running time' is elapsed real time, not CPU time.

You can sort a report by requests even when you have turned off the request columns. This may confuse your readers.

The behaviour of FILEALIAS a b; FILEALIAS b c is undefined.

Known bugs

The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate.

Do not alias a file to itself (e.g., FILEALIAS /home.html /home.html) or a host to itself, or it will get lost.

Wishlist and discussion

I always welcome mail on analog (my e-mail address is sret1@cam.ac.uk); whether it works on your system (yes, even if it does!), any bug reports, patches or requests for new features. If you send me mail, I shall keep you informed about future releases.

I am happy to help people who have trouble with analog, but please read the FAQ and list of known bugs first. Also, you might be able to diagnose the problem yourself if you run

analog -v [your usual options]

which lists the value of all variables. But if you still can't get it to work, ask me. It helps me find bugs, and to know where the documentation is unclear. When submitting bug reports, please include the version number (which you can find out by the command analog -v).

The following features are already on the list to be done by version 2.0. Let me know if you have any comments on them.

Output preformatted data. Contact me if you have any comments on my proposal (see http://www.statslab.cam.ac.uk/~sret1/analog/proposal.html).
In the general summary, number of requests for pages.
Make what is included in the request report and what is linked to more flexible; allow them to be independent of each other and of what is a page.
File type (i.e., file extension) report.
Specify maximum number of lines for each time report.
Extend TO and FROM to allow times to be specified.

I would also welcome discussion on the following issues.

What other errors can occur in the error log that I have not included?
What other options should be allowed in the form cgi program?

Acknowledgements

Thanks are due to the author of getstats, Kevin Hughes. We (and other people) have found that getstats gets buggy and very slow when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent.

Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.

Stephen Turner
University of Cambridge Statistical Laboratory
E-mail: sret1@cam.ac.uk

Page last modified: 01-Mar-96

Readme for analog1.9beta3

Contents