This README describes analog1.2. For the latest version of analog, see the analog home page.
This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see the
SUBDOMAINand alphabetical sorting.
BASEURLcommand allowing statistics to be displayed on other servers.
TOcommands more powerful.
LOGFILE, and you will want to change
When you have done that, compile the program by typing
make(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again.
Then just type
analogto run the program. To send the output to a particular file instead of to the screen, type, e.g.,
analog > outfile.html
There are three ways in which customising can be done. First, the file analhead.h contains various settable parameters. These can be changed before compiling the program. They are explained in that file, so they will not be documented again here.
Secondly, there are commandline options, given after the commandname in the usual way. So, for example, the command
analog +duses the
+doption to tell analog to include a daily summary in its ouput. All the commandline options are explained below.
Thirdly, you can tell analog to use a configuration file to read in
extra options. This is specified by means of the commandline option
+g. For example,
analog +gextra.conftells analog to read configuration commands from the file
extra.conf. (Note that there is no space between
+gand the filename; this is true of all commandline options). If
+Gis used instead of
+g, the default configuration file as specified in analhead.h is read first, then the one specified after
+G. You can specify standard input as the configuration file by the options
file can contain several commands on separate lines; any text after a hash
#) on a line is ignored as a comment. So the following is an
example of a configuration file.
DAILY OFF # We don't want a daily summary FULLDAILY ON # We want a full daily report insteadAn argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space. The various commands which can occur in the configuration file are explained below.
Why three separate methods to specify options? Although some options can be set in two or even three ways, the three methods have different functions. The file analhead.h contains default values for the variables, which you want always to apply when you don't set anything else. The configuration file is appropriate for options you often use. For example, I run three jobs every night to calculate different sets of statistics from our server; each of these different formats is controlled by a configuration file. Commandline options, on the other hand, are the quickest thing to use if you want to run the program on line, or if you want to override one of the options set in a configuration file.
In order to use the three separate methods together, you have to know which
takes precedence over which. The default values in analhead.h have the lowest
priority. They are overriden by the values in the configuration file (if two
configuration files are read, the one specified by
precedence over the default one). And they in turn are overridden by the
commandline arguments. If two contradictory options are specified in one
configuration file or on the commandline, the later one is obeyed.
If this is all a bit confusing, just run
analog -v [other options]That will tell you what the values of all the variables will be based on analhead.h, the configuration options and the commandline options.
Now we shall look at options which affect one of the reports; after that we shall see options which affect several or all of the reports. We shall look at the options under the following headings.
The general summary can be turned off by the commandline option
-x or the configuration command
GENERAL OFFor on by
GENERAL ON. If the general summary is off, all the `Go To' links in the output are also omitted.
The figures for the last 7 days can be turned on and off with
-7 or the configuration command
LASTSEVEN ON # or OFF
Counting hosts is something which can take a lot of memory (we have to
the name of every host that has accessed our server). If memory is a problem,
you can turn the host counting off with the commandline option
-s or the configuration command
COUNTHOSTS OFFAlternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using
COUNTHOSTS APPROXand you can specify the amount of memory to be used by
APPROXHOSTSIZE 100000 # or whatever number, in bytesAbout 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used.
+ represents 1000 requests, or part thereof.
month: #reqs -------- ----- Nov 1994: 24784: +++++++++++++++++++++++++ Dec 1994: 32767: +++++++++++++++++++++++++++++++++ Jan 1995: 37656: ++++++++++++++++++++++++++++++++++++++ Feb 1995: 41666: ++++++++++++++++++++++++++++++++++++++++++ Mar 1995: 45113: ++++++++++++++++++++++++++++++++++++++++++++++ ...The monthly report can be turned on and off with
MONTHLY ON # or OFF
The value of
+ can be specified by a number after the
+m option; e.g.,
+m1000 for the above display.
If you specify
+m0 (or if 0 is the default setting from
analhead.h) the program will choose something sensible automatically.
The equivalent configuration command is
MONTHLYUNIT 1000 # or 0, or whatever
week beg.: #reqs --------- ----- 24/Jul/94: 187: + 31/Jul/94: 3909: +++++++++++++++++ 7/Aug/94: 3550: ++++++++++++++++ 14/Aug/94: 3920: +++++++++++++++++ 21/Aug/94: 5220: +++++++++++++++++++++ ...This is configured in exactly the same way as the previous report, but with
-Win place of
-m, and configuration commands
day: #reqs --- ----- Sun: 29488: ++++++++++++++++++++ Mon: 55680: ++++++++++++++++++++++++++++++++++++++ Tue: 58162: +++++++++++++++++++++++++++++++++++++++ Wed: 59157: ++++++++++++++++++++++++++++++++++++++++ Thu: 61907: ++++++++++++++++++++++++++++++++++++++++++ Fri: 60827: +++++++++++++++++++++++++++++++++++++++++ Sat: 32573: ++++++++++++++++++++++Again as before, with
date: #reqs --------- ----- 28/Jul/94: 11: + 29/Jul/94: 174: ++++ 30/Jul/94: 2: + 31/Jul/94: 0: 1/Aug/94: 104: +++ 2/Aug/94: 517: +++++++++++ ...This report has one request for each day from the first to the last request, so it can be very large. The appropriate commands are
hr: #reqs -- ----- 0: 12245: ++++++++++++++++++++++++++++++++++++++++++ 1: 10163: ++++++++++++++++++++++++++++++++++ 2: 9137: ++++++++++++++++++++++++++++++++ 3: 8899: ++++++++++++++++++++++++++++++ 4: 8070: ++++++++++++++++++++++++++++ 5: 7713: ++++++++++++++++++++++++++ ...
HOURLYUNITare the appropriate commands for this report.
#reqs : %bytes : domain -------- -------- ------ 103125 : 46.58% : .uk (United Kingdom) ( 64982):( 35.45%): cam.ac.uk (University of Cambridge) ( 47138):( 20.55%): statslab.cam.ac.uk 49290 : 12.49% : .edu (USA Educational) 54879 : 9.35% : .com (USA Commercial) 39812 : 6.97% : (Numerical domains) 15186 : 2.84% : .de (Germany) ...This report is turned on and off with the commandline options
-o, or the configuration command
DOMAIN ON # or OFF
The report can be sorted by number of requests, percentage of bytes, or
alphabetically. This is achieved on the commandline by adding a letter
+oa respectively. In the configuration file, the command
DOMSORTBY BYREQUESTS # or BYBYTES, or ALPHABETICALcan be given.
You can control which columns which appear in the domain report by means of
the configuration command
DOMCOLS. There are four possible
columns of data: number of requests due to each domain (R), percentage of
requests (r), number of bytes (B) and percentage of bytes (b). A command
DOMCOLS Rrbwould instruct the program to display columns for number of requests, percentage of requests, and number of bytes in that order for each domain. (This can only be done in the configuration file, not on the commandline).
The report can be listed to any required depth by putting a number after the
option. If sorting is by requests or alphabetical, the number is interpreted
as the minimum number of requests required to get onto the report. If sorting
is by bytes, it is hundredths of a percent of bytes. For example,
+oa15 will list all domains with at least 15
requests, sorted alphabetically, whereas
+ob15 will list all
domains with at least 0.15% of the traffic, sorted by bytes. If a negative
number is given, a `top n' report
is calculated; so, for example,
+or-20 will list the 20
domains with the highest numbers of requests. The number can also be supplied
by means of the configuration command
DOMFLOOR 15 # or -20, or whatever
Subdomains can be specified for each domain. This can only be done in the configuration file. The syntax of the command is
SUBDOMAIN subdomain subdomain_nameIf the subdomain name has spaces in, it must be enclosed in quotes. The subdomain name can be omitted, indicating a nameless subdomain. For example, to produce the above output, I would include the following lines in the configuration file
SUBDOMAIN cam.ac.uk 'University of Cambridge' SUBDOMAIN statslab.cam.ac.uk
Numerical subdomains (which have most significant part on the left) can also occur. They will look like
131 The Ever-Popular 131 domain 131.111 # Nameless
Also subdomains with wildcards in can occur. The following are examples:
SUBDOMAIN *.edu # mit.edu, umn.edu etc. SUBDOMAIN 131.111.* # 131.111.1, 131.111.2 etc. SUBDOMAIN % # all top-level numerical domains, from 1 to 255If you specify wild subdomains, you will probably want to set quite a high
There is a command
NOTSUBDOMAIN to erase a previously requested
subdomain. For example, you can write
NOTSUBDOMAIN *.edu NOTSUBDOMAIN cam.ac.ukHowever, if you request, for example,
NOTSUBDOMAIN mit.eduwill be unable to override it.
There is a configuration command
SUBDOMFLOOR to specify how
much traffic or how
many requests a subdomain needs to be included in the output. It works the
same way as the
DOMFLOOR command above.
Within a domain, subdomains will be output in alphabetical order.
The domain report relies on having a domains file available,
listing which geographical locations correspond to which domains. Which file
is to be used as the domains file can be specified by the commandline option
-ffilename or the configuration command
DOMAINSFILE domainsfileThe correct format of the domains file is explained in a separate section.
#reqs: %bytes: host ----- ------ ---- 10: 0.03%: zlsm03.arcs.ac.at 11: 0.04%: iki10.boku.ac.at 1: : oeh1.boku.ac.at 2: 0.01%: dopefish.esi.ac.at 1: : piassun1.joanneum.ac.at ...This is much the same as the domain report, with commandline options
-S, and configuration commands
HOSTCOLS. Note that in this report, alphabetical sorting is by domain as most significant part. This report can be very long and slow to sort, and should be used with a high floor if at all.
#reqs: %bytes: directory ------ ------ --------- 237985: 35.40%: /~sret1/ 18596: 17.60%: /~rrw1/ 3574: 11.89%: /~richard/ 2376: 7.92%: /~steve/ 13518: 7.42%: /Dept/ ...Again, this is much the same as the domain report, with commandline options
-i, and configuration commands
DIRCOLS. There is one further variable for this report, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like
#reqs: %bytes: directory ------ ------ --------- 43772: 72.06%: /~sret1/backgammon/ 173426: 19.93%: /~sret1/backgammon/bitmaps/ 11298: 4.14%: /~sret1/ 5322: 1.71%: /~sret1/backgammon/books/ 2773: 1.22%: /~sret1/images/ 728: 0.66%: /~sret1/backgammon/clubs/ ...This can be specified by the commandline option
+l3or the configuration command
#reqs: %bytes: filename ----- ------ -------- 33980: 23.66%: /~sret1/backgammon/main.html 21162: 2.69%: /~sret1/backgammon/bitmaps/board.xbm 20422: 0.49%: /~sret1/backgammon/bitmaps/dice1.xbm 20187: 0.49%: /~sret1/backgammon/bitmaps/dice2b.xbm 12690: 0.86%: / 8457: 1.09%: /header.gif 7198: 0.81%: /~sret1/coldlist.html 5461: 0.48%: /home.xbm 3550: 0.32%: /~sret1/home.html 3370: 0.23%: /~mcmc/html/ ...Commandline options
-r, and configuration commands
REQCOLSwork analogously to the last three reports. There are also various options to control which files are printed and which are given links.
In fact, if the commandline option
used, only pages will be displayed in the report. If you want to list all
files, including, for example, graphics, then you should use
instead; alternatively, if neither
specified on the commanline, the configuration command
REQTYPE PAGES # or ALLwill control whether pages or all files are listed.
There are three possible modes of linking in the request report; you can link
to none of the files, or pages only, or all files. The commandline options
for these are
respectively; or you can use the configuration command
PAGELINKS OFF # or ON, or ALLThere is also a related command
BASEURLto specify a URL to prepend to the links. For example, if
BASEURL http://www.statslab.cam.ac.ukwere specified, then
/~sret1/analog/would be linked to
http://www.statslab.cam.ac.uk/~sret1/analog/. This is useful if you want to display the statistics on a different server than the one they belong to.
You can also specify in the configuration file what should be counted as a
page in the requests report (thus giving you complete control over what goes
in the report, or what is linked to). At the beginning, the following are
*/). The command
ISPAGE filenamewill specify that some other file is a `page'. Filenames can begin with an asterisk (
*) as a wild card; so, for example,
ISPAGE *.ps ISPAGE *.ps.gzwould mean that Postscript files and compressed Postscript files are to be regarded as pages. You can also use
ISNOTPAGE filenameto specify that something which would otherwise be a page is not to be regarded as a page.
The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command
analog logfile.logwill use that logfile for its report. You can also write
analog -to use standard input as the logfile. This is useful in constructing pipes; for example, if you want to analyse an old compressed logfile, you could type
gzcat logfile.old.gz | analog -(
gzcatmight be called
zcaton some systems). You can also specify which logfile to use in the configuration file by means of a command like
LOGFILE logfile.log # or ... LOGFILE stdin # for standard input
There are various commands which instruct the program only to analyse part of the logfile. These are all configuration commands only; they have no commandline or analhead.h equivalents.
First, you can instruct the program only to tak into account certain files.
This is done by means of the
commands. Asterisks can appear at the end of the filenames specified, as
wildcards. For example, the configuration
FILEONLY /~sret1/* FILEIGNORE /~sret1/backgammon/* FILEIGNORE /~sret1/home.htmlwould instruct the program to examine only my files, and excluding my backgammon files and home page. (This should not be confused with excluding them from the request report, which still includes them in other reports; this excludes them altogether from the whole analysis).
There are similar commands
to analyse only the requests from certain sites. Here an asterisk can occur at
the start of a hostname. For example,
HOSTIGNORE emu.pmms.cam.ac.uk HOSTIGNORE *.statslab.cam.ac.ukwould ignore accesses from emu and from my site (including statslab.cam.ac.uk itself). For numerical domains, the asterisk occurs at the end, not the beginning: for example
HOSTIGNORE 131.111.20.* # ignore unresolved addresses from statslab
Finally, there are commands to analyse only a subset of the dates in the
logfile. The simplest usage is
FROM yymmdd and
TO yymmdd. So, for example, to analyse only requests in July
1995 I would use the configuration
FROM 950701 TO 950731Also each of the pairs of digits can be preceded by
-and the month and date can by preceded by
+to represent time relative to the current date. This allows constructions like
FROM -01-00+01 # from tomorrow last year TO -00-0131 # to the end of last month (OK even if last month # didn't have 31 days) FROM -00-00-56 TO -00-00-01 #statistics for the last 8 weeks
TO command is given, the figures for the last 7 days refer
to the time until then.
FILEALIAS file1 file2says that whenever
file1occurs in the logfile, it is to be replaced by
file2. Analog already understands that
/dir/index.htmlis the same as
/dir/and translates `escaped' entities (e.g.,
%7Eis the same as
~) so these don't need to be specified separately. It also understands that
..means `parent directory,'
.means `this directory' and
//is the same as
/, and translates those filenames to their canonical forms.
* is placed at the end of the first entry, then all
filenames starting with
file1 will be changed to start
file2. So, for example, after the command
FILEALIAS /~sret1/statprog/* /~sret1/analog/a filename looking like
/~sret1/statprog/statprog/stat.cwill be understood as
/~sret1/analog/statprog/stat.c. (Note that the conversion is done only once for each filename; you don't get
A pair of related commands is
WITHOUTARGS. Normally any arguments given as part of a URL (after
a question mark) are ignored. However, if a configuration command like
WITHARGS /cgi-bin/prog.cgiis given, then the arguments to that file will form part of the filename. So
/cgi-bin/prog.cgi?bwill be regarded as separate files, whereas without that command they would both have been translated to
/cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks can again occur at the end of the filename, for example in commands like
WITHARGS /cgi-bin/*There is also a parallel command
WITHOUTARGS; for example,
WITHARGS /cgi-bin/* WITHOUTARGS /cgi-bin/spam.cgiwould expand read the arguments for all files in
There is a command
HOSTALIAS, similar to
which is useful if your
server records local hostnames in the logfile
instead of full internet names. Also, if a host has two names, they
can be combined in this way. So, for example, I might find it
convenient to use
HOSTALIAS lion lion.statslab.cam.ac.uk HOSTALIAS www lion.statslab.cam.ac.uk HOSTALIAS www.statslab.cam.ac.uk lion.statslab.cam.ac.ukAgain, only one conversion is done per host, which is why I need both the second and the third line. There is no wildcard conversion for this command.
One more related command is the command to tell analog to try and look up the
names of hosts that appear only as numerical addresses in the logfile; so, for
example, 22.214.171.124 will be translated to lion.statslab.cam.ac.uk. Note,
however, that not all hosts have names, or we may not be able to discover
them. The commandline option to try and translate numerical addresses is
-1 to turn it off); the equivalent
configuration command is
NUMLOOKUP ON # or OFFLooking up hostnames is a slow business. If this option is used, be prepared for analog to take a very long time to compile its report.
-a, or the configuration commands
ASCII ON # or OFF HTML OFF # or ON (equivalent to previous line)If you choose ASCII output, some of the other options are ignored, but it should be obvious which ones they will be.
You can select the file for the output to be sent to in the configuration file or on the commandline. So instead of
analog > outfile.htmlyou can use the commandline option
+Ooutfile.htmlor the configuration command
There is a configuration command
specifies which order the reports should occur in. The usage is a line like
REPORTORDER hDdWmoSirThis says that the reports should occur in the order hourly summary (h), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i) and request report (r). It is important to include all the above nine letters exactly once each.
There is a commandline option
+A to include all reports
(particular ones can then be omitted with
-d or whatever);
-A omits all reports (and particular ones can then be
included). The equivalent configuration commands are
ALL ON # or OFFNote that order is important; for example,
+i -A +rwill include the request report but not the directory report.
The title line of the output page contains three adjustable variables. First,
the logo in the top left hand corner can be turned on or off, or any other
logo substituted (for example, your organisation's logo). This is accomplished
by the commandline arguments
-p (no logo: mnemonic, p for
+p (use the default logo) and
use the logo at the given URL. The equivalent configuration commands are
LOGO ON # or OFF LOGOURL url # where it isThe organisation name on the title line can be specified by means of the option
-nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option
-u-if you don't want any link. The equivalent configuration options are
HOSTNAME name # must be in quotes if it contains spaces HOSTURL URL HOSTURL - # for no link
A header file and footer file can be inserted near the top and bottom of your
output. These should be written in HTML, and can contain anything you want.
Possible uses include providing information about your organisation or about
the way the statistics were calculated, linking to related pages, and no doubt
many other things. The commandline options to achieve this are
-F to turn them off, and the configuration commands are
HEADERFILE filename FOOTERFILE none # if you don't want one
There is also a configuration command to use a certain image as the background to the output page. If you insist on using one it should be small, otherwise people with slow lines won't be able to load your page, and it should not stop people with low resolution monochrome screens being able to read your page. The command is
BACKGROUND none # preferably! BACKGROUND URL # to use that URL
A command like
WEEKBEGINSON SUNDAYsays which day should be regarded as the first day of the week. This is used in the daily report, daily summary and weekly report.
There is a command
SEPCHAR to say which character should separate
each group of three digits in long numbers. For example,
SEPCHAR ,will give 123,456,789, whereas
SEPCHAR ' 'will give 123 456 789.
The character which is used in the barcharts in some of the reports can be
changed to, for example, a hash by
MARKCHAR '#' # put in quotes so that it isn't a comment
Those graphical reports also need to know how many characters wide the output
page is. Although a normal page is 80 characters wide, for Web pages about
PAGEWIDTH 65seems to be about right.
Finally, there is a debugging command, for printing (to stderr) problems with
your logfile. There are currently three levels of debugging: 0 for no
debugging, 1 for printing corrupt logfile lines (prepended by "C:"), and 2
which also prints hosts for which the domain is unknown (prepended by "U:").
The commandline option for level 1 debugging is
+V1 (V for
verbose) and the configuration command is
DEBUG 1You can also use commandline options
+Vfor level 1 and
-Vfor level 0.
ad Andorra ae United Arab Emirates [...]There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use
?(or anything starting with
?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.
Comments can occur in the domains file. They are introduced by the
So you could write, for example,
uk United Kingdom # God save the Queen!
To set up the form interface, go to the directory where the analog source code lives, and follow these steps.
FORMPROGis set to be the URL of the form processing program, which will be wherever cgi-bin programs live on your server; normally in the cgi-bin directory.
FORMPROG. Make sure it is executable by the server.
If the third step above fails to generate a form, you can generate one
yourself by means of the command
analog -form +Oanalogform.html.
You might also want to run this command yourself if you want to supply
different default options from normal for the form user: if you run the
command with extra commandline or configuration file options, they will be
respected in the construction of the form.
If the form doesn't seem to work, check the following:
setenv QUERY_STRING "xq=1"(C Shell) or
export QUERY_STRING="xq=1"(other shells), then run analform from the shell.
<input type=hidden name="TZ" value="">For the value you should insert your timezone, in standard format. Usually this looks like your winter timezone name, followed by hours west of Greenwich, followed by your summer timezone name. So the East Coast of the USA should have
value="EST5EDT", and Germany
It is better, although not essential, if when you change the default options for your analog, you remake the form.
Note that you probably want to restrict access to the form and form program to certain users; if it is world readable there could be considerable load on your server as well as potential confidentiality problems. Exactly how to do this depends on which server you are running. You might also want to specify a default configuration file in analhead.h (which the form user cannot override except where options are provided on the form) or remove some options from the form.
CFLAGSin the Makefile to turn on the ANSI option in a compiler like cc.
REQTYPE ALLto list all files in the request report, or
ISPAGEto say that this file is a `page.'
+ssoption, or turning hostname counting off altogether with
zcathas an option
-fto uncompress compressed files and leave other files alone, and then stick them all together. So
gzcat -f log1.gz log2.gz log3 | analog -is the required command.
gzip -9(see the previous question). Because of the high number of repeated strings in the logfile, compression is very efficient.
If we are doing a `top n' report and two entries tie for nth place, only one will be printed.
The reported `running time' is elapsed real time, not CPU time.
If you specify
+oa-10 you really do get the top ten domains
alphabetically. This is almost certainly useless!
You can sort a report by requests even when you have turned off the request columns. This may confuse your readers.
The behaviour of
FILEALIAS a b;
FILEALIAS b c is
The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate.
Do not alias a file to itself
FILEALIAS /home.html /home.html) or a host to itself, or
it will get lost.
I am happy to help people who have trouble with analog, but please read the FAQ and list of known bugs first. Also, you might be able to diagnose the problem yourself if you run
analog -v [your usual options]which lists the value of all variables. But if you still can't get it to work, ask me. It helps me find bugs, and to know where the documentation is unclear.
The following feature is already on the list to be done in the next version. Let me know if you have any comments on it.
Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.
Page last modified: 11-Nov-95