README for analog1.1

Introduction

This README describes analog1.1. For the latest version of analog, see the analog home page.

This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see the

analog home page.

For examples of the output see

This program may be freely distributed and modified provided full credit is given to Stephen Turner (sret1@cam.ac.uk), and that this condition is retained. However, please only distribute it intact, including the domains file and this README. No warranty of any sort is given or implied for this program.

What's new?

This section describes the main changes in each version of analog. If you are using analog for the first time, you can skip this section.

1.1: Form interface introduced.
ASCII output now possible as well as HTML.
Output file can now be specified in the configuration file.
FROM and TO commands more powerful.
DEBUG and BACKGROUND introduced.
One bug fix: alphabetical sorting doesn't now swap some hostnames.
List of primes included in distribution.
1.0: Only minor changes since 0.94beta.
0.94beta: New configuration variables SEPCHAR and REPORTORDER.
New configuration commands WITHARGS and WITHOUTARGS.
New commandline options +-A and +-x. (Config.: ALL and GENERAL).
Logfile entries with - as the return code are now regarded as successes, not corrupt entries.
Fixed bugs in host report when aliases or numerical hosts are present.
Documentation rewritten.
0.93beta: Approximate hostname counting now possible in fixed memory.
New configuration commands ISPAGE and ISNOTPAGE.
New commandline option -v.
New configuration command WEEKBEGINSON.
Proper error message when memory exceeded.
Program split into several files.
0.92beta: New reports introduced: hostname, full daily, and weekly.
FROM and TO commands introduced.
Header and footer files introduced.
More helpful warning messages.
Ability to read configuration instructions from stdin.
Subdomain commands moved from domains file to configuration file.
Makefile provided.
0.91beta: Configuration file introduced, enabling many new options.
Some bug fixes and speed improvements.
Ability to print "top n" reports (rather than "everything higher than n").
Request report can print only pages.
Ability to try and resolve numerical addresses.
Now less fussy about the format of the domains file.
Logo added.
README converted to HTML.
0.9beta: More speed improvements, and some bug fixes.
Introduced -u option.
Introduced subdomain analysis.
Included "not modified" replies as successes, not redirects.
First public release at 0.9beta3.
0.89beta: Commandline arguments.
Efficiency improvements.
Host count and "last 7 day" statistics.
0.8beta: Initial program, just default options.

Compiling and running the program

If you want to get on with trying out the program straight away, you can leave most of this README until another time. The one thing you need to do is to look at the file analhead.h. These are all user-settable options, but most of them you can leave alone for the moment. You will need to check the values of DOMAINSFILE and LOGFILE, and you will want to change HOSTNAME and HOSTURL.

When you have done that, compile the program by typing

make

(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again.

Then just type

analog

to run the program. To send the output to a particular file instead of to the screen, type, e.g.,

analog > outfile.html

Customising analog

Pretty soon you will want to customise the output of analog to your personal preferences. How to do that is explained in this section. There are lots of options, so this section is rather long. (However, you can bypass this section to some extent if you set up a form interface to allow you to choose the main options from a Web page).

There are three ways in which customising can be done. First, the file analhead.h contains various settable parameters. These can be changed before compiling the program. They are explained in that file, so they will not be documented again here.

Secondly, there are commandline options, given after the commandname in the usual way. So, for example, the command

analog +d

uses the +d option to tell analog to include a daily summary in its ouput. All the commandline options are explained below.

Thirdly, you can tell analog to use a configuration file to read in extra options. This is specified by means of the commandline option +g. For example,

analog +gextra.conf

tells analog to read configuration commands from the file extra.conf. (Note that there is no space between +g and the filename; this is true of all commandline options). If +G is used instead of +g, the default configuration file as specified in analhead.h is read first, then the one specified after +G. You can specify standard input as the configuration file by the options +g- and +G-.

The configuration file can contain several commands on separate lines; any text after a hash (#) on a line is ignored as a comment. So the following is an example of a configuration file.

DAILY      OFF   # We don't want a daily summary
FULLDAILY  ON    # We want a full daily report instead

An argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space. The various commands which can occur in the configuration file are explained below.

Why three separate methods to specify options? Although some options can be set in two or even three ways, the three methods have different functions. The file analhead.h contains default values for the variables, which you want always to apply when you don't set anything else. The configuration file is appropriate for options you often use. For example, I run three jobs every night to calculate different sets of statistics from our server; each of these different formats is controlled by a configuration file. Commandline options, on the other hand, are the quickest thing to use if you want to run the program on line, or if you want to override one of the options set in a configuration file.

In order to use the three separate methods together, you have to know which takes precedence over which. The default values in analhead.h have the lowest priority. They are overriden by the values in the configuration file (if two configuration files are read, the one specified by +G takes precedence over the default one). And they in turn are overridden by the commandline arguments. If two contradictory options are specified in one configuration file or on the commandline, the later one is obeyed.

If this is all a bit confusing, just run

analog -v [other options]

That will tell you what the values of all the variables will be based on analhead.h, the configuration options and the commandline options.

Now we shall look at options which affect one of the reports; after that we shall see options which affect several or all of the reports. We shall look at the options under the following headings.

General Summary

Program started at Mon-26-Jun-1995 17:09 local time.
Analysed requests from Thu-28-Jul-1994 20:31 to Mon-26-Jun-1995 17:09 (332.8 days).
Total completed requests: 368 063 (12 872)
Total failed requests: 4 089 (139)
Total redirected requests: 35 277 (1 838)
Average requests per day: 1 219 (2 121)
Number of distinct files requested: 966 (336)
Number of distinct hosts served: 28 589 (1 589)
Number of new hosts served in last 7 days: 1 037
Corrupt logfile entries: 869
Total bytes transferred: 1 852 029 300 (85 752 881)
Average bytes transferred per day: 5 544 997 (12 250 411)
(Figures in parentheses refer to the last 7 days).

The general summary can be turned off by the commandline option -x or the configuration command

GENERAL OFF

or on by +x or GENERAL ON. If the general summary is off, all the `Go To' links in the output are also omitted.

The figures for the last 7 days can be turned on and off with +7 and -7 or the configuration command

LASTSEVEN ON    # or OFF

Counting hosts is something which can take a lot of memory (we have to remember the name of every host that has accessed our server). If memory is a problem, you can turn the host counting off with the commandline option -s or the configuration command

COUNTHOSTS OFF

Alternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using +ss or

COUNTHOSTS APPROX

and you can specify the amount of memory to be used by

APPROXHOSTSIZE 100000  # or whatever number, in bytes

About 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used.

Monthly report

Each + represents 1000 requests, or part thereof.

   month: #reqs
--------  -----
Nov 1994: 24784: +++++++++++++++++++++++++
Dec 1994: 32767: +++++++++++++++++++++++++++++++++
Jan 1995: 37656: ++++++++++++++++++++++++++++++++++++++
Feb 1995: 41666: ++++++++++++++++++++++++++++++++++++++++++
Mar 1995: 45113: ++++++++++++++++++++++++++++++++++++++++++++++
...

The monthly report can be turned on and off with +m and -m or

MONTHLY ON  # or OFF

The value of + can be specified by a number after the +m option; e.g., +m1000 for the above display. If you specify +m0 (or if 0 is the default setting from analhead.h) the program will choose something sensible automatically. The equivalent configuration command is

MONTHLYUNIT 1000   # or 0, or whatever

Weekly report

week beg.: #reqs
---------  -----
24/Jul/94:   187: +
31/Jul/94:  3909: +++++++++++++++++
 7/Aug/94:  3550: ++++++++++++++++
14/Aug/94:  3920: +++++++++++++++++
21/Aug/94:  5220: +++++++++++++++++++++
...

This is configured in exactly the same way as the previous report, but with +W and -W in place of +m and -m, and configuration commands WEEKLY and WEEKLYUNIT.

Daily summary

day: #reqs
---  -----
Sun: 29488: ++++++++++++++++++++
Mon: 55680: ++++++++++++++++++++++++++++++++++++++
Tue: 58162: +++++++++++++++++++++++++++++++++++++++
Wed: 59157: ++++++++++++++++++++++++++++++++++++++++
Thu: 61907: ++++++++++++++++++++++++++++++++++++++++++
Fri: 60827: +++++++++++++++++++++++++++++++++++++++++
Sat: 32573: ++++++++++++++++++++++

Again as before, with +d and -d, and DAILY and DAILYUNIT.

Daily report

     date: #reqs
---------  -----
28/Jul/94:    11: +
29/Jul/94:   174: ++++
30/Jul/94:     2: +
31/Jul/94:     0: 
 1/Aug/94:   104: +++
 2/Aug/94:   517: +++++++++++
...

This report has one request for each day from the first to the last request, so it can be very large. The appropriate commands are +D, -D, FULLDAILY and FULLDAILYUNIT.

Hourly summary

hr: #reqs
--  -----
 0: 12245: ++++++++++++++++++++++++++++++++++++++++++
 1: 10163: ++++++++++++++++++++++++++++++++++
 2:  9137: ++++++++++++++++++++++++++++++++
 3:  8899: ++++++++++++++++++++++++++++++
 4:  8070: ++++++++++++++++++++++++++++
 5:  7713: ++++++++++++++++++++++++++
...

+h, -h, HOURLY and HOURLYUNIT are the appropriate commands for this report.

Domain report

  #reqs :  %bytes : domain
--------  --------  ------
 103125 :  46.58% : .uk (United Kingdom)
( 64982):( 35.45%):     cam.ac.uk (University of Cambridge)
( 47138):( 20.55%):       statslab.cam.ac.uk
  49290 :  12.49% : .edu (USA Educational)
  54879 :   9.35% : .com (USA Commercial)
  39812 :   6.97% : (Numerical domains)
  15186 :   2.84% : .de (Germany)
...

This report is turned on and off with the commandline options +o and -o, or the configuration command

DOMAIN ON  # or OFF

The report can be sorted by number of requests, percentage of bytes, or alphabetically. This is achieved on the commandline by adding a letter after the +o option; +or, +ob or +oa respectively. In the configuration file, the command

DOMSORTBY BYREQUESTS  # or BYBYTES, or ALPHABETICAL

can be given.

The report can be listed to any required depth by putting a number after the +o, +or, +ob or +oa option. If sorting is by requests or alphabetical, the number is interpreted as the minimum number of requests required to get onto the report. If sorting is by bytes, it is hundredths of a percent of bytes. For example, +oa15 will list all domains with at least 15 requests, sorted alphabetically, whereas +ob15 will list all domains with at least 0.15% of the traffic, sorted by bytes. If a negative number is given, a `top n' report is calculated; so, for example, +or-20 will list the 20 domains with the highest numbers of requests. The number can also be supplied by means of the configuration command

DOMFLOOR 15  # or -20, or whatever

Subdomains can be specified for each domain. This can only be done in the configuration file. The syntax of the command is

SUBDOMAIN subdomain subdomain_name

If the subdomain name has spaces in, it must be enclosed in quotes. The symbol ? can be used as the subdomain name, indicating a nameless subdomain. For example, to produce the above output, I would include the following lines in the configuration file

SUBDOMAIN cam.ac.uk 'University of Cambridge'
SUBDOMAIN statslab.cam.ac.uk   ?

Numerical subdomains (which have most significant part on the left) can also occur. They will look like

131   The Ever-Popular 131 domain

131.111   ?

Within a domain, subdomains will be output in the order in which they occur in the configuration file.

The domain report relies on having a domains file available, listing which geographical locations correspond to which domains. Which file is to be used as the domains file can be specified by the commandline option -ffilename or the configuration command

DOMAINSFILE domainsfile

The correct format of the domains file is explained in a separate section.

Host report

#reqs: %bytes: host
-----  ------  ----
   10:  0.03%: zlsm03.arcs.ac.at
   11:  0.04%: iki10.boku.ac.at
    1:       : oeh1.boku.ac.at
    2:  0.01%: dopefish.esi.ac.at
    1:       : piassun1.joanneum.ac.at
...

This is much the same as the domain report, with commandline options +S and -S, and configuration commands FULLHOSTS, HOSTSORTBY and HOSTFLOOR. Note that in this report, alphabetical sorting is by domain as most significant part. This report can be very long and slow to sort, and should be used with a high floor if at all.

Directory report

 #reqs: %bytes: directory
------  ------  ---------
237985: 35.40%: /~sret1/
 18596: 17.60%: /~rrw1/
  3574: 11.89%: /~richard/
  2376:  7.92%: /~steve/
 13518:  7.42%: /Dept/
...

Again, this is much the same as the domain report, with commandline options +i and -i, and configuration commands DIRECTORY, DIRSORTBY and DIRFLOOR. There is one further variable for this report, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like

 #reqs: %bytes: directory
------  ------  ---------
 43772: 72.06%: /~sret1/backgammon/
173426: 19.93%: /~sret1/backgammon/bitmaps/
 11298:  4.14%: /~sret1/
  5322:  1.71%: /~sret1/backgammon/books/
  2773:  1.22%: /~sret1/images/
   728:  0.66%: /~sret1/backgammon/clubs/
...

This can be specified by the commandline option +l3 or the configuration command

DIRLEVEL 3

Request report

#reqs: %bytes: filename
-----  ------  --------
33980: 23.66%: /~sret1/backgammon/main.html
21162:  2.69%: /~sret1/backgammon/bitmaps/board.xbm
20422:  0.49%: /~sret1/backgammon/bitmaps/dice1.xbm
20187:  0.49%: /~sret1/backgammon/bitmaps/dice2b.xbm
12690:  0.86%: /
 8457:  1.09%: /header.gif
 7198:  0.81%: /~sret1/coldlist.html
 5461:  0.48%: /home.xbm
 3550:  0.32%: /~sret1/home.html
 3370:  0.23%: /~mcmc/html/
...

Commandline options +r and -r, and configuration commands REQUEST, REQSORTBY and REQFLOOR work analogously to the last three reports. There are also various options to control which files are printed and which are given links.

In fact, if the commandline option +r is used, only pages will be displayed in the report. If you want to list all files, including, for example, graphics, then you should use +R instead; alternatively, if neither +r nor +R is specified on the commanline, the configuration command

REQTYPE PAGES  # or ALL

will control whether pages or all files are listed.

There are three possible modes of linking in the request report; you can link to none of the files, or pages only, or all files. The commandline options for these are -k, +k and +kk respectively; or you can use the configuration command

PAGELINKS OFF   # or ON, or ALL

You can also specify in the configuration file what should be counted as a page in the requests report (thus giving you complete control over what goes in the report, or what is linked to). At the beginning, the following are `pages': *.html, *.htm, *.shtml, *.shtm, *.html3, *.ht3 and directories (*/). The command

ISPAGE filename

will specify that some other file is a `page'. Filenames can begin with an asterisk (*) as a wild card; so, for example,

ISPAGE *.ps
ISPAGE *.ps.gz

would mean that Postscript files and compressed Postscript files are to be regarded as pages. You can also use

ISNOTPAGE filename

to specify that something which would otherwise be a page is not to be regarded as a page.

Analysing parts of the logfile

The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command

analog logfile.log

will use that logfile for its report. You can also write

analog -

to use standard input as the logfile. This is useful in constructing pipes; for example, if you want to analyse an old compressed logfile, you could type

gzcat logfile.old.gz | analog -

(gzcat might be called zcat on some systems). You can also specify which logfile to use in the configuration file by means of a command like

LOGFILE logfile.log   # or ...
LOGFILE stdin         # for standard input

There are various commands which instruct the program only to analyse part of the logfile. These are all configuration commands only; they have no commandline or analhead.h equivalents.

First, you can instruct the program only to tak into account certain files. This is done by means of the FILEONLY and FILEIGNORE commands. Asterisks can appear at the end of the filenames specified, as wildcards. For example, the configuration

FILEONLY /~sret1/*
FILEIGNORE /~sret1/backgammon/*
FILEIGNORE /~sret1/home.html

would instruct the program to examine only my files, and excluding my backgammon files and home page. (This should not be confused with excluding them from the request report, which still includes them in other reports; this excludes them altogether from the whole analysis).

There are similar commands HOSTONLY and HOSTIGNORE to analyse only the requests from certain sites. Here an asterisk can occur at the start of a hostname. For example,

HOSTIGNORE *.statslab.cam.ac.uk

would ignore accesses from my site (including statslab.cam.ac.uk itself).

Finally, there are commands to analyse only a subset of the dates in the logfile. The simplest usage is FROM yymmdd and TO yymmdd. So, for example, to analyse only requests in July 1995 I would use the configuration

FROM 950701
TO   950731

Also each of the pairs of digits can be preceded by - and the month and date can by preceded by + to represent time relative to the current date. This allows constructions like

FROM -01-00+01   # from tomorrow last year
TO -00-0131  # to the end of last month (OK even if last month
             # didn't have 31 days)
FROM -00-00-56
TO   -00-00-01  #statistics for the last 8 weeks

Aliases etc.

There are commands to give aliases for filenames and hostnames. The configuration line

FILEALIAS file1 file2

says that whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately.

If * is placed at the end of the first entry, then all filenames starting with file1 will be changed to start with file2. So, for example, after the command

FILEALIAS /~sret1/statprog/* /~sret1/analog/

a filename looking like /~sret1/statprog/statprog/stat.c will be understood as /~sret1/analog/statprog/stat.c. (Note that the conversion is done only once for each filename; you don't get /~sret1/analog/analog/stat.c).

A pair of related commands is WITHARGS and WITHOUTARGS. Normally any arguments given as part of a URL (after a question mark) are ignored. However, if a configuration command like

WITHARGS /cgi-bin/prog.cgi

is given, then the arguments to that file will form part of the filename. So /cgi-bin/prog.cgi?a and /cgi-bin/prog.cgi?b will be regarded as separate files, whereas without that command they would both have been translated to /cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks can again occur at the end of the filename, for example in commands like

WITHARGS /cgi-bin/*

There is also a parallel command WITHOUTARGS; for example,

WITHARGS /cgi-bin/*
WITHOUTARGS /cgi-bin/spam.cgi

would expand read the arguments for all files in /cgi-bin/ except spam.cgi.

There is a command HOSTALIAS, similar to FILEALIAS, which is useful if your server records local hostnames in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use

HOSTALIAS lion lion.statslab.cam.ac.uk
HOSTALIAS www lion.statslab.cam.ac.uk
HOSTALIAS www.statslab.cam.ac.uk lion.statslab.cam.ac.uk

Again, only one conversion is done per host, which is why I need both the second and the third line. There is no wildcard conversion for this command.

One more related command is the command to tell analog to try and look up the names of hosts that appear only as numerical addresses in the logfile; so, for example, 131.111.20.59 will be translated to lion.statslab.cam.ac.uk. Note, however, that not all hosts have names, or we may not be able to discover them. The commandline option to try and translate numerical addresses is +1 (or -1 to turn it off); the equivalent configuration command is

NUMLOOKUP ON  # or OFF

Looking up hostnames is a slow business. If this option is used, be prepared for analog to take a very long time to compile its report.

Layout

The final group of options is those which affect the layout of the output. First, you can choose whether you want ASCII (plain text) or HTML output, using the commandline option +a or -a, or the configuration commands

ASCII ON   # or OFF
HTML OFF   # or ON  (equivalent to previous line)

If you choose ASCII output, some of the other options are ignored, but it should be obvious which ones they will be.

You can select the file for the output to be sent to in the configuration file or on the commandline. So instead of

analog > outfile.html

you can use the commandline option +Ooutfile.html or the configuration command

OUTFILE outfile.html

There is a configuration command REPORTORDER which specifies which order the reports should occur in. The usage is a line like

REPORTORDER hDdWmoSir

This says that the reports should occur in the order hourly summary (h), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i) and request report (r). It is important to include all the above nine letters exactly once each.

There is a commandline option +A to include all reports (particular ones can then be omitted with -d or whatever); likewise -A omits all reports (and particular ones can then be included). The equivalent configuration commands are

ALL ON  # or OFF

Note that order is important; for example, +i -A +r will include the request report but not the directory report.

The title line of the output page contains three adjustable variables. First, the logo in the top left hand corner can be turned on or off, or any other logo substituted (for example, your organisation's logo). This is accomplished by the commandline arguments -p (no logo: mnemonic, p for picture), +p (use the default logo) and +pURL use the logo at the given URL. The equivalent configuration commands are

LOGO     ON    # or OFF
LOGOURL  url   # where it is

The organisation name on the title line can be specified by means of the option -nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option -uURL; use -u- if you don't want any link. The equivalent configuration options are

HOSTNAME  name  # must be in quotes if it contains spaces
HOSTURL   URL
HOSTURL   -     # for no link

A header file and footer file can be inserted near the top and bottom of your output. These should be written in HTML, and can contain anything you want. Possible uses include providing information about your organisation or about the way the statistics were calculated, linking to related pages, and no doubt many other things. The commandline options to achieve this are +Hfilename and +Ffilename, or -H and -F to turn them off, and the configuration commands are

HEADERFILE filename
FOOTERFILE none      # if you don't want one

There is also a configuration command to use a certain image as the background to the output page. If you insist on using one it should be small, otherwise people with slow lines won't be able to load your page, and it should not stop people with low resolution monochrome screens being able to read your page. The command is

BACKGROUND none   # preferably!
BACKGROUND URL    # to use that URL

A command like

WEEKBEGINSON SUNDAY

says which day should be regarded as the first day of the week. This is used in the daily report, daily summary and weekly report.

There is a command SEPCHAR to say which character should separate each group of three digits in long numbers. For example,

SEPCHAR ,

will give 123,456,789, whereas

SEPCHAR ' '

will give 123 456 789.

The character which is used in the barcharts in some of the reports can be changed to, for example, a hash by -c# or

MARKCHAR '#'  # put in quotes so that it isn't a comment

Those graphical reports also need to know how many characters wide the output page is. Although a normal page is 80 characters wide, for Web pages about -w65 or

PAGEWIDTH 65

seems to be about right.

Finally, there is a debugging command, for printing (to stderr) problems with your logfile. There are currently three levels of debugging: 0 for no debugging, 1 for printing corrupt logfile lines (prepended by "C:"), and 2 which also prints hosts for which the domain is unknown (prepended by "U:"). The commandline option for level 1 debugging is +V1 (V for verbose) and the configuration command is

DEBUG 1

You can also use commandline options +V for level 1 and -V for level 0.

The domains file

The file domains.tab, to translate internet country codes to locations, should have come with the program. If you haven't got one, you can download one from http://www.statslab.cam.ac.uk/~sret1/analog/analog/domains.tab. It should be in the following format:

ad   Andorra
ae   United Arab Emirates
[...]

There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.

Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,

uk  United Kingdom  # God save the Queen!

The form interface

Another way to run analog is via the form interface; this allows users to select which options they want via a Web page. This section describes how to set up a form interface. The form interface is new, and is therefore still in beta test; so I welcome comments on how well it works, whether it is easy to set up, and whether I have got the balance right between providing enough options and keeping the form simple. You can e-mail me at sret1@cam.ac.uk.

To set up the form interface, go to the directory where the analog source code lives, and follow these steps.

In analhead.h, make sure that the FORMPROG is set correctly.
Edit the top of analform.c to indicate where the analog program lives.
Type make form.
Move the program analform to the place you specified as the FORMPROG (normally your server's cgi-bin directory). Make sure it is executable by the server.
The file analogform.html is the actual form interface; move it to wherever you want people to get at it. Make sure it is world readable.

What the third step above in fact does to make the form is to run the command analog -form. The reason that I supply a form-generation option to analog instead of a ready-made form is that it can take account of your various default options. If you want to supply different default options for the form user, you can run the command analog -form directly with extra commandline or configuration file options, and they will be respected in the construction of the form; or you can specify them in the Makefile. It is better, although not essential, if when you change the default options for your analog, you remake the form.

Note that you might want to restrict access to the form and form program to certain users; if it is world readable there could be considerable load on your server as well as potential confidentiality problems. Exactly how to do this depends on which server you are running. You might also want to specify a default configuration file in analhead.h (which the form user cannot override except where options are provided on the form) or remove some options from the form.

Frequently asked questions

When I try and compile analog, it gives me an error.

Look in the Makefile to see if you need to include any extra libraries.

Why don't I get such-and-such a report in the output even though I asked for it?

Maybe the floor for the report is set too high. For example, if you ask for a request report for all pages with at least 50 accesses and no page has that many, no report will be produced.

Why are no data on bytes included in my output?

You have some old-style logfile lines that do not include that information, so the analysis cannot be done.

Can I ignore all gifs in the analysis?

No, but it's probably not what you want to do anyway (that would make the total bytes transferred go wrong, for example). If you just want them not to appear in the request report, read about the configuration commands REQTYPE and ISPAGE above.

Can I change the background colour of the output page?

No, and you can't make the top request blink either. For such a widespread program, it's only appropriate to use true HTML, not things which one company has added on to HTML on its products.

I ran out of memory when trying to run analog. What can I do?

Try using approximate (instead of exact) hostname counting with the +ss option, or turning hostname counting off altogether with -s.

I have some old compressed logfiles and a current logfile. How can I analyse them both together?

The command gzcat or zcat has an option -f to uncompress compressed files and leave other files alone, and then stick them all together. So

gzcat -f log1.gz log2.gz log3 | analog -

is the required command.

My logfile is getting too big. Can you record some of the data in a convenient format so that analog can read it without having to process the whole logfile, and I can throw the old logfile?

No. Trying compressing your old logfile entries instead using gzip -9 (see the previous question). That is likely to reduce your usage by a factor of about 12.

Warnings

Lines with filenames longer than a certain limit (which can be specified in analhead.h) are regarded as corrupt lines and discarded.

The figures for the last 7 days are for the 7 days before the program is run, not before the last entry in the logfile. Moreover, they are actually for any times after 7 days before the program is started; so `future' requests will be included if some more requests come in while the program is reading the logfile (or if the clock on the computer running the program is not sychronised with the one on the computer that recorded the logs).

If we are doing a `top n' report and two entries tie for nth place, only one will be printed.

The reported `running time' is elapsed real time, not CPU time.

If you specify +oa-10 you really do get the top ten domains alphabetically. This is almost certainly useless!

Subdomains will be reported in the domain report if and only if their parent domain is, independent of their usage.

The behaviour of FILEALIAS a b; FILEALIAS b c is undefined.

Known bugs

The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate.

Do not alias a file to itself (e.g., FILEALIAS /home.html /home.html) or a host to itself, or it will get lost.

Wishlist

I always welcome mail on analog (my e-mail address is sret1@cam.ac.uk); whether it works on your system (yes, even if it does!), any bug reports or requests for new features. I am happy to help people who have trouble with analog, but please read the FAQ first. If you send me mail, I shall keep you informed about future releases.

The following features are already on the list to be done in the next version. Let me know if you have any comments on them.

options to list %requests and number of bytes for each file.
SUBDOMAIN *.com etc.
HOSTIGNORE 131.111.*
Proper alphabetical sort for numerical hosts.
Always output how reports are sorted.

Acknowledgements

Thanks are due to the author of getstats, Kevin Hughes. We (and other people) have found that getstats gets buggy and very slow when the logfile got big, but you may notice that my output (although not my program) is based on his to some extent.

Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.

Stephen Turner
University of Cambridge Statistical Laboratory
E-mail: sret1@cam.ac.uk

Page last modified: 02-Oct-95

README for analog1.1

Contents