This README describes analog0.93beta. For the latest version of analog, see the analog home page.
This program analyses logfiles in both the common log format and NCSA old format from WWW servers. It is designed to be fast on long logfiles and to produce attractive statistics. For more details, see the
LOGFILE, and you will want to change
When you have done that, compile the program by typing
make(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again.
ad Andorra ae United Arab Emirates [...]There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use
?(or anything starting with
?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.
Subdomains can also be analysed in the domain report. They are specified in the configuration file.
Comments can occur in the domains file. They are introduced by the
So you could write, for example,
uk United Kingdom # God save the Queen
When analog starts running, it can read instructions from a configuration file (or from stdin). The following instructions can be in this file.
FILEALIAS file1 file2
file1 occurs in the logfile, it is to be replaced
file2. Analog already understands that
/dir/index.html is the same as
translates `escaped' entities (e.g.,
%7E is the same
~) so these don't need to be specified separately.
* is placed at the end of the first entry, then all
filenames starting with
file1 will be changed to start
file2. So, for example, after the command
FILEALIAS /~sret1/statprog/* /~sret1/analog/a filename looking like
/~sret1/statprog/statprog/stat.cwill be understood as
/~sret1/analog/statprog/stat.c. (Note that the conversion is done only once for each filename; you don't get
HOSTALIAS host1 host2
This is useful if your server records local hosts in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use
HOSTALIAS lion lion.statslab.cam.ac.uk HOSTALIAS www lion.statslab.cam.ac.uk HOSTALIAS www.statslab.cam.ac.uk lion.statslab.cam.ac.ukAgain, only one conversion is done per host, which is why I need both the second and the third line.
ignores altogether logfile entries due to certain files. Again, an
*) can appear at the end of the name. Another
looks at only that file.
FILEONLY commands appear,
the program will look at entries matching either of them. For example
FILEONLY /~sret1/* FILEONLY /~sret2/* FILEIGNORE /~sret1/backgammon/* FILEIGNORE /~sret2/home.htmlwill look at my files and
sret2's files, but not my backgammon files or
sret2's home page. These commands are applied after any aliasing of filenames has been done, but the files given in these names are not subject to any aliasing.
HOSTONLY *.statslab.cam.ac.uk HOSTIGNORE lion.statslab.cam.ac.ukwill only include hosts from my local site, but not
lion.statslab. (In this case,
*.statslab.cam.ac.ukmatches the host
statslab.cam.ac.uk. If you want to exclude it, you can give a
HOSTIGNOREline for it).
TO yymmdd. So, for example, if I wanted to analyse only requests in June 1995 I could use the configuration
FROM 950601 TO 950630
SUBDOMAINto analyse subdomains within the domain report. The syntax is
SUBDOMAIN subdomain subdomain_nameIf the subdomain name has spaces in, it must be enclosed in quotes. The symbol
?can be used as the subdomain name, indicating a nameless subdomain. For example, I typically run analog with a configuration file including the lines
SUBDOMAIN cam.ac.uk 'University of Cambridge' SUBDOMAIN statslab.cam.ac.uk ?which produces output like
103118 : 46.60% : .uk (United Kingdom) ( 64975):( 35.46%): cam.ac.uk (University of Cambridge) ( 47133):( 20.56%): statslab.cam.ac.uk 49271 : 12.49% : .edu (USA Educational)
Numerical subdomains (which have most significant part on the left) can also occur. They will look like
131 The Ever-Popular 131 domainor
131.111 ?Within a domain, subdomains will be output in the order in which they occur in the configuration file.
WEEKBEGINSONto say which day is to be considered the first day of the week. For example
WEEKBEGINSON SUNDAYThis is used in the daily report, daily summary and weekly report. This variable can also be set in the file analhead.h.
APPROXHOSTSIZE 100000says how many bytes of memory to devote to this task; the more bytes, the greater the accuracy. About 3 bytes per host should give a very good extimate, and even 1 byte per host is fair. If statistics for the last 7 days are requested, twice this amount of space will be used.
ISPAGE filenameto specify what is to be regarded as a `page' for the purposes of constructing the request report. At the beginning, the following are pages:
*.shtm. You can also use
ISNOTPAGE filenameto specify that something which would otherwise be a page is not a page. Together with the ability to specify that only pages should be included in the request report, these commands give you complete control over what goes in the request report.
LOGFILE logfile # the logfile to be analysed DOMAINSFILE domainsfile # where to get the domains from HEADERFILE headerfile # Put this file near the top of the output FOOTERFILE none # But none at the bottom LOGOURL logourl # the URL where the logo for the report title is. LOGO on # or off. Whether to use the logo. HOSTNAME "Stephen's pages" # our organisation HOSTURL http://www.statslab.cam.ac.uk/ # where to find us; - for off. REQFLOOR 1 # if sorting by requests or alphabetically, the # fewest requests needed to get on the request # report; if by bytes, the min traffic in 1/100ths # of a percent; if negative, do a 'top n' report. DIRFLOOR 1 # directory report DOMFLOOR -10 # domain report HOSTFLOOR 100 # if this is too low, host report will be slow and long. REQSORTBY byrequests # how to sort the request report DIRSORTBY alphabetical # directory report DOMSORTBY bybytes # domain report HOSTSORTBY alphabetical # host report. Alphabetical is with # domain as most significant part. MARKCHAR * # the character for graphical displays MARKCHAR '#' # in quotes so that it isn't a comment PAGEWIDTH 65 # the width of the output pages MONTHLY on # do a monthly report? DAILY off # a daily summary? FULLDAILY off # a full daily report (one line for each day)? WEEKLY off # a weekly report? HOURLY on # an hourly summary? DOMAIN on # a domain report? FULLHOSTS off # a full hostname report (can be very long)? DIRECTORY on # a directory report? REQUEST off # a request report? DIRLEVEL 2 # the level of the directory report REQTYPE pages # what to print in the request report; pages or all PAGELINKS on # whether to link to pages in the request report COUNTHOSTS on # count the total number of hosts who have visited? Use # 'COUNTHOSTS approx' for approx count in fixed memory. LASTSEVEN on # statistics for the last 7 days? NUMLOOKUP on # try and resolve numerical addresses; NB SLOW! MONTHLYUNIT 1000 # the size of the character in the graphical displays HOURLYUNIT 0 # 0 represents 'choose something sensible automatically' DAILYUNIT 0 FULLDAILYUNIT 0 WEEKLYUNIT 0
The program usage is
analog [logfile | -] [options]The name of the logfile can be omitted, in which case the default (as specified in analhead.h) will be used.
-will use stdin. This is useful for constructing pipes, e.g.,
cat access_log1 access_log2 | analog -(This is safe, because the program works properly even if the logfile it is analysing is not in chronological order).
The legal options are as follows. (Again, you don't need any of these; you can just use the defaults). All of them have default values which can be changed in the analhead.h or the configuration file). If two contradictory options are specified, the latter is heeded (this might be useful if you have an alias set up, for example). Note that items in square brackets are optional; also that there can be no space between an option and any arguments to it. See the pages cited above for the meaning of the various reports.
-v Just print out the values of all the variables based on the defaults, configuration file and commandline options, then exit. -m Don't do a monthly report. +m[n] Do a monthly report. Make each character in the graph worth n requests; if n = 0, choose something sensible. -d | +d[n] Don't do/Do a daily summary. -h | +h[n] Don't do/Do an hourly summary. -o Don't do a domain report. +o[a|b|r][n] Do a domain report. Sort alphabetically/by bytes/ by requests. If n is positive, only print domains with at least n requests (r or a) or at least n/100ths of a percent of the traffic (b). If n is negative, print the top n (well, top -n actually) on the report. -i | +i[a|b|r][n] Directory report. -r | +r[a|b|r][n] Request report; report pages only. (But other files are still counted in the other reports). -R | +R[a|b|r][n] Request report; report all filenames. -S | +S[a|b|r][n] Host report; one line for every host. This is slow and produces copious output unless n is set high or only a small (part of a) logfile is being analysed. +s Count number of distinct hosts. This can use a lot of memory! +S forces +s. +ss Do an approximate count in fixed memory instead. -s Don't do any host count. -7 | +7 Don't/do give statistics for last 7 days. -ln Level (or depth) of directory report. -k | +k | +kk Don't link in the request report / link to pages / link to everything. -cchar The character to use in the graphical displays. -wn The width of the page for output. 65 is about right. -nhostname The name of your organisation or WWW server. (Printed in the title of the output page). -uhosturl The URL of your front page. Use -u- if you don't want the top line of your output to be linked to anywhere. -fdomainsfile The domains file you want to use. +gconffile The configuration file you want to use. Use +g- to read configuration instructions from stdin. +Gconffile Use another configuration file in addition to the default one. -g | -G Don't use a configuration file. +Hheaderfile Include the specified header file near the top of the output. -H Don't use a header file. +Ffooterfile Put this file at the bottom of the output. -F Don't use a footer file. -p Don't put a logo at the top (p for picture) +p[logourl] Do put a logo. Find it at the given URL. -1 / +1 Don't / do try and look up all numerical address. (May be slow).
The figures for the last 7 days are for the 7 days before the program is run, not before the last entry in the logfile. Moreover, they are actually for any times after 7 days before the program is started; so `future' requests will be included if some more requests come in while the program is reading the logfile (or if the clock on the computer running the program is not sychronised with the one on the computer that recorded the logs).
If we are doing a `top n' report and two entries tie for nth place, only one will be printed.
The reported `running time' is elapsed real time, not CPU time.
If you specify
+oa-10 you really do get the top ten domains
alphabetically. This is almost certainly useless!
Subdomains will be reported in the domain report if and only if their parent domain is, independent of their usage.
The bytes aren't reported correctly. This is really a bug in the logfile. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate.
I expect this version of analog to be very close to version 1.0, so this is your last chance to suggest what should be in it before beta testing ends. The following list contains in no particular order things that people have suggested up to now. I welcome comments on how high a priority people place on the various items, and what other things would be useful. I won't spend time doing things unless there is a demand for them.
(not statslab).cam.ac.uk. (The exact user input could be tricky).
Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.
Page last modified: 27-Jul-95