This Readme describes analog1.92beta. For the latest version of analog, see the analog home page.
This program analyses logfiles from WWW servers. It works on most Unix, DOS, Mac and VMS machines. It is designed to be fast and to produce attractive statistics. For more details, see the
Sorry about the length of this Readme. It includes documentation on everything the program can do. It's not as complicated as it looks, and you don't have to read all of it before using the program anyway!
This program is freeware, but its use is covered by a licence which is at the bottom of this file. You must agree to the terms of the licence before using the program. This is a beta test version, and although I believe it to be reliable, some bugs can still be expected.
*.htmare now pages on all machines.
This section describes how to compile analog on Unix and VMS. If you've got the Mac or DOS version, the program comes already compiled, so you can skip to the next section.
If you want to get on with trying out the program straight away, you can leave most of this Readme until another time. The one thing you need to do is to look at the file analhead.h. These are all user-settable options, but most of them you can leave alone for the moment. You will probably want to check the first few options in the file, but you can even leave most of them until later.
Next you must move the images that came with the analog program (in the directory images) into the IMAGEDIR specified in analhead.h.
When you have done that, compile the program by typing
makeunder Unix, or
MMSusing MMS under VMS. If that doesn't work, and you're on Unix, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again. (In particular, Solaris 2 users need to change the LIBS= line). If it still doesn't compile, try DEFS=-DNODNS to ignore the DNS lookup code.
Then just type
analogto run the program. To send the output to a particular file instead of to the screen, type, e.g.,
analog > outfile.html(This assumes that . is in you $PATH, but it should be. Otherwise try ./analog instead of just analog).
Many options can be set in the file analhead.h. If you're on Unix or VMS (or compiling your own version on another platform), these can be changed before compiling the program. They are explained in that file, so they will not be documented again here.
Otherwise, analog takes its options from configuration files. Many of the configuration commands also have abbreviations as commandline arguments. (There aren't any commandline arguments on the Mac; you have to use the configuration commands. Ignore all talk of commandline arguments below).
So, for example, the configuration command
DAILY OFFtells analog not to include a daily summary in the output. But this can also be specified by the command
analog -dbecause the -d option is an abbreviation for DAILY OFF.
In fact any configuration command can be specified on the commandline by means of the +C option; you could write
analog +C"DAILY OFF"(This is most useful for running analog from a script or cron job).
Analog comes with a small configuration file to get you started. To specify a configuration file, you use the commandline argument +g followed by the name of the file. (Mac users have no commandline arguments, so can only use the default configuration file). For example,
analog +gextra.conftells analog to read configuration commands from the file extra.conf. (Note that there is no space between +g and the filename; this is true of all commandline arguments). (You can also specify standard input as the configuration file by the option +g-).
The configuration file can contain several commands on separate lines; any text after a hash (#) on a line is ignored as a comment. So the following is an example of a configuration file.
DAILY OFF # We don't want a daily summary FULLDAILY ON # We want a full daily report insteadAn argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space. Note that configuration commands are not generally the same as those in analhead.h, although many have the same name.
Commandline arguments are read in the order in which they occur, and configuration files are read when the +g argument is reached. If commands conflict, later commands override earlier ones, so the order does matter.
There are also two special configuration files which can be specified in analhead.h. The default configuration file is run before all other configuration files. You can put in there configuration commands that you normally want to include but which you can override. You can stop analog running the default configuration file by the commandline option -G.
The mandatory configuration file is run after all other configuration commands have been read, and overrides them all. If the mandatory configuration file cannot be found, the program exits immediately. This can be used by system administrators to stop users analysing certain files or producing certain reports, for example. (Note, however, that the only way to stop it completely is to deny users read access to the logfile. Otherwise there is nothing to stop them analysing it by another copy of analog or another program).
If this is all a bit confusing, just run
analog -v [other options]That will tell you what the values of all the variables will be, based on analhead.h, the configuration options and the commandline options.
We shall now look at all the configuration commands and their commandline equivalents under the following headings. There is a summary list of all of them in the reference section.
See the Glossary for the meaning of these data.
The general summary can be turned off by the command
GENERAL OFF(or the commandline argument -x) or on by GENERAL ON (or +x). If the general summary is off, all the `Go To' links in the output are also omitted.
The figures in parentheses refer to the last 7 days. They can be turned on and off with
LASTSEVEN ON # or OFFor with the commandline arguments +7 and -7. Note that the last 7 days refers to the last 7 days before the program is run, not before the last entry in the logfile. (If a TO command is specified, however, the last 7 days will be until that date).
Counting hosts is something which can take a lot of memory (we have to remember the name of every host that has accessed our server). If memory is a problem, you can turn the host counting off with the commandline option -s or the configuration command
COUNTHOSTS OFFAlternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using +ss or
COUNTHOSTS APPROXand you can specify the amount of memory to be used by
APPROXHOSTSIZE 100000 # or whatever number, in bytesAbout 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used. COUNTHOSTS APPROX uses much less memory than an accurate count, but can be a bit slower. If the host report is on, COUNTHOSTS will always be turned on automatically, so to turn COUNTHOSTS to off or approximate, you also need to turn the host report off.
Each unit () represents 4 000 requests, or part thereof.
month: #reqs: -------- ------ Nov 1995: 119865: Dec 1995: 121214: Jan 1996: 144960:
The above display is of a monthly report. In this category, we also have the weekly report (one line for each week), daily summary (one line for Sundays, one for Mondays etc.), daily report (one line for each day ever), hourly summary (one line for midnight, one for 1am etc.) and hourly report (one line for each hour ever).
The following configuration commands show how to turn these reports on and off.
MONTHLY ON WEEKLY ON DAILY ON FULLDAILY OFF HOURLY ON FULLHOURLY OFFYou can also use the corresponding commandline arguments +m, +W, +d, -D, +h, -H (use + to turn the corresponding reports on, - to turn them off).
You can specify the maximum number of rows in one of these reports by a line like
FULLHOURROWS 72 # restrict the hourly report to the last 72 hours MONTHROWS 0 # 0 means no restrictionThe other commands are WEEKROWS and FULLDAYROWS.
You should use these reports sensitively. If your output is 500k long, people won't be able to download it. In particularly, you probably don't want a daily report or hourly report unless you have restricted it to just a few rows.
The graphs above are designed to produce coloured bars on graphical browsers and ASCII graphs on non-graphical browsers. They don't use tables or image-stretching properties, so should work on any browser. However, you can produce plain ASCII graphs instead by the command
GRAPHICAL OFF # or ON to turn it back on againThis has the advantage of producing smaller output which does not require any images to be downloaded.
The graphs rely on having the images distributed with analog available in the directory IMAGEDIR specified in analhead.h; or you can override that choice with a command like
You can change the character used in the graphs on non-graphical terminals by means of a command such as
MARKCHAR '#' # put in quotes so that it isn't a comment
The graphs can be plotted by bytes transferred or requests for pages instead of by raw requests. This can be done by means of commands like
MONTHGRAPH B # by bytes WEEKGRAPH R # by requests DAYGRAPH P # by page requestsThere are also commands FULLDAYGRAPH, HOURGRAPH and FULLHOURGRAPH. Alternatively, you can add the letter after the relevant commandline argument; for example, +hB to turn on the hourly summary with a graph sorted by bytes. To specify what counts as a page, see the ISPAGE command below. These commands do not change which columns are displayed on each line, so if you use these commands, you might also want to use the COLS commands explained below.
You can display the graphs backwards (with most recent requests at the top) by means of commands like
MONTHLYBACK ON # or OFFThere are also the commands WEEKLYBACK, FULLDAILYBACK and FULLHOURLYBACK. The hourly summary and daily summary cannot be displayed backwards. I find it confusing to have some of the reports going backwards and some forwards, so you can also use
ALLBACK ON # or OFFto change all four of the reports to backwards or forwards together.
You can specify which columns appear in the various reports in which order. The above example showed the number of requests being given. You can also have the percentage of the requests, the number and percentage of bytes, and the number and percentage of requests for pages. For example, the command
MONTHCOLS RBbrpPtells analog to include in the monthly report columns for number of requests (R), number of bytes (B), percentage of bytes (b), percentage of requests (r), percentage of page requests (p) and number of page requests (P) in that order. The other commands are WEEKCOLS, DAYCOLS, FULLDAYCOLS, HOURCOLS and FULLHOURCOLS. If you use these commands, you might also want to use the GRAPH commands explained above.
For some reports, analog needs to know where weeks begin and end. You can specify
WEEKBEGINSON WEDNESDAYto change it to Wednesday, for example. (I guess Sunday or Monday is more likely).
In the graphs, analog will choose the value of the unit () automatically based on the length of the largest bar and the width of the page. You can specify the page width with, for example,
PAGEWIDTH 70or the commandline option +w70. (I find about 65 works well). (Note that the PAGEWIDTH may not be strictly obeyed with GRAPHICAL ON, as the graphics are measured in pixels not characters). Occasionally you may want to specify the value of yourself (for example, to make it the same as on some other page). You can do this by a command like
MONTHLYUNIT 1000Setting it to 0 makes analog choose it automatically again. Of course, the other reports have WEEKLYUNIT, DAILYUNIT, FULLDAILYUNIT, HOURLYUNIT and FULLHOURLYUNIT.
#reqs : %bytes : domain -------- -------- ------ 103125 : 46.58% : .uk (United Kingdom) ( 64982):( 35.45%): cam.ac.uk (University of Cambridge) ( 47138):( 20.55%): statslab.cam.ac.uk 49290 : 12.49% : .edu (USA Educational)
#reqs: %bytes: host ----- ------ ---- 10: 0.03%: zlsm03.arcs.ac.at 11: 0.04%: iki10.boku.ac.at 158: 0.15%: talus.maths.su.oz.au
#reqs: %bytes: directory ------ ------ --------- 237985: 35.40%: /~sret1/ 18596: 17.60%: /~rrw1/ 3574: 11.89%: /~richard/
#reqs: %bytes: filename ----- ------ -------- 33980: 23.66%: /~sret1/backgammon/main.html 21162: 2.69%: /~sret1/backgammon/bitmaps/board.xbm 12690: 0.86%: /
File Type report
#reqs: %bytes: extension ------ ------ --------- 25592: 35.68%: .html 23311: 20.15%: (directories) 1080: 17.13%: .ps 175575: 13.63%: .gif
#reqs: referring URL ----- ------------ 260: http://webcrawler.com/cgi-bin/WebQuery 239: http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/ 185: http://guide-p.infoseek.com/WW/NS/Titles?qt=backgammon&col=WW 149: http://www.yahoo.com/Recreation/Games/Board_Games/Backgammon/
#reqs: browser ----- ------- 16797: Mozilla 1532: Mosaic 693: IWENG 492: Lynx
#reqs: browser ----- ------- 3105: Mozilla/1.22 (Windows; I; 16bit) 2785: Mozilla/1.1N (Windows; I; 16bit) 458: IWENG/1.2.003
These reports can be turned on and off with commands like
DOMAIN ON FULLHOSTS OFF DIRECTORY ON REQUEST ON FILETYPE ON REFERRER OFF BROWSER ON FULLBROWSER OFFor with the commandline arguments +o (domain report), -S (host report), +i (directory report), +r (request report; see below), +t (file type report), -f (referrer report), +b (browser summary) and -B (browser report). (As in the date reports, use + to turn the corresponding reports on, - to turn them off). Because of the widespread mis-spelling, REFERER is accepted as a synonym of REFERRER.
Another similarity with the date reports is that you can tell analog which columns to print on each report with the commands DOMCOLS, HOSTCOLS, DIRCOLS, REQCOLS, TYPECOLS, REFCOLS, BROWCOLS and FULLBROWCOLS. Again, each command is followed by letters indicating which columns are wanted and in which order. For example,
DOMCOLS RrBb # no. of reqs, %age reqs, no. of bytes, %age bytes DIRCOLS Pp # no. of pages, %age pages
Each of these reports can be sorted in five different ways; by bytes, by requests, by requests for pages, alphabetically or randomly (i.e., unsorted). (The only advantage of the last one is so as not to spend time sorting very long reports). The commands to change this look like
DOMSORTBY BYTES # or REQUESTS or PAGES or ALPHABETICAL or RANDOMThe commands for the other reports are HOSTSORTBY, DIRSORTBY, REQSORTBY, TYPESORTBY, REFSORTBY, BROWSORTBY and FULLBROWSORTBY. You can also add a letter b, p, r, a or x after the relevant commandline option; for example, +Sa for a host report sorted alphabetically.
It is important to be able to specify how many entries you want printed in each report. This is done by means of three variables for each report, one specifying the minimum number of bytes if the sorting is by bytes, one the minimum number of page requests if it is by pages, and the third specifying the minimum number of requests if the sorting is by any of the other three methods. The following configuration commands illustrate the possible usages.
DOMMINREQS 20 # all items with at least 20 requests HOSTMINREQS -20 # the first 20 items # NB: useless if alphabetical or random sort REQMINREQS 0.01% # all items with at least 0.01% of the requests TYPEMINPAGES 20 # at least 20 page requests; -20 and 0.01% also possible DIRMINBYTES 100000 # all items with at least 100000 bytes REFMINBYTES 100k # all items with at least 100 kbytes # (10M etc. also work) BROWMINBYTES -40 # Top 40 if sorting is by bytes FULLBROWMINBYTES 0.005% # all with at least 0.005% of the trafficYou can also specify the amount on the commandline by adding it after the sort method. For example, +Sr-50 turns on a host report, sorted by requests, with only the top 50 items included, and +ib20k gives a directory report, sorted by bytes, including all directories with at least 20 kilobytes transferred.
You can translate items in the reports for the benefit of your readers. For example, the command
REQOUTPUTALIAS /~sret1/analog/ "Analog home page"would make Analog home page appear instead of /~sret1/analog/ in the request report. Wildcards can appear in the aliases: for example
REQOUTPUTALIAS /~sret1/* "Stephen's page (/*)"would translate /~sret1/backgammon to Stephen's page (/backgammon/) etc. The commands for the other reports are DIROUTPUTALIAS, HOSTOUTPUTALIAS, REFOUTPUTALIAS, BROWOUTPUTALIAS and TYPEOUTPUTALIAS.
Each of the reports has a hash size associated with it, which is the size of the table in which it stores the data internally. You don't need to worry about this usually; it doesn't affect the output, but if analog starts running slowly, you might find that making the hash sizes larger or smaller helps. The command to do this for the request report is
REQHASHSIZE 1009The command for the other reports are DIRHASHSIZE, TYPEHASHSIZE, HOSTHASHSIZE, REFHASHSIZE, BROWHASHSIZE, FULLBROWHASHSIZE and SUBDOMHASHSIZE (for subdomains; the top-level domains don't use this). On appropriate platforms, there is also DNSHASHSIZE for DNS lookups. You must choose a prime number for the hash size (there's a list of some primes distributed with the program). Maybe half the number of items of that type expected is a good number, but it shouldn't be critical.
We now describe features unique to a particular one of the reports. First the domain report.
Subdomains can be specified for each domain. The syntax of the command is
SUBDOMAIN subdomain subdomain_nameIf the subdomain name has spaces in, it must be enclosed in quotes. The subdomain name can be omitted, indicating a nameless subdomain. For example, to produce the example above, I would include the following lines in the configuration file
SUBDOMAIN cam.ac.uk 'University of Cambridge' SUBDOMAIN statslab.cam.ac.ukNumerical subdomains (which have most significant part on the left) can also occur. They will look like
131 The Ever-Popular 131 domain 131.111 # NamelessAlso subdomains with wildcards in can occur; they can't have names. The following are examples:
SUBDOMAIN *.edu # mit.edu, umn.edu etc. SUBDOMAIN 131.111.* # 131.111.1, 131.111.2 etc. SUBDOMAIN % # all top-level numerical domains, from 1 to 255The variables SUBDOMMINREQS and SUBDOMMINBYTES can be specified in the same way as above, except they can't be negative. If you ask for wild subdomains, you will probably want to set the minimum requests and minimum bytes quite high. However, you cannot alter the sort order; within a domain, subdomains will always be output in alphabetical order.
There is a command NOTSUBDOMAIN to erase a previously requested subdomain. For example, you can write
NOTSUBDOMAIN *.edu NOTSUBDOMAIN cam.ac.ukHowever, if you request, for example, *.edu, then NOTSUBDOMAIN mit.edu will not override it.
The domain report relies on having a domains file available, listing which geographical locations correspond to which domains. Which file is to be used as the domains file can be specified by the command
DOMAINSFILE domainsfileThe correct format of the domains file is explained in a separate section.
There is little to say about the host report, except to note that alphabetical sorting is by domain as most significant part. This report can be very long and so slow to sort, and should be used with a high floor if at all.
The directory report has one further variable, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like
#reqs: %bytes: directory ------ ------ --------- 43772: 72.06%: /~sret1/backgammon/ 173426: 19.93%: /~sret1/backgammon/bitmaps/ 11298: 4.14%: /~sret1/This can be specified by the commandline option +l3 or the configuration command
DIRLEVEL 3Note that the figures for each directory do not include those for the subdirectories of that directory, except where the directory is at the deepest level. So in the above example, /~sret1/backgammon/bitmaps/dice/d1.xbm would be reckoned in the directory /~sret1/backgammon/bitmaps/ (which is at the deepest level) but not in the other two directories.
You can control which items get listed, and which linked to, in the request report with the commands REQINCLUDE and LINKINCLUDE. These are explained below.
There is a command BASEURL to specify a URL to prepend to the links. For example, if
BASEURL http://www.statslab.cam.ac.ukwere specified, then /~sret1/analog/ would be linked to http://www.statslab.cam.ac.uk/~sret1/analog/. This is useful if you want to display the statistics on a different server than the one they belong to. (See below for combining logfiles from two different servers).
There's nothing special to say about the file type report.
The referrer report only has one special command, REFLINKINCLUDE to say what should be linked to in the report. It is explained below. (However, it is important to note that many browsers do not pass this information to the server, and many pass it wrongly (sending the URL of the previous page even when your page was not reached by selecting a link from that page)).
For the referrer report and the browser reports the relevant logfiles must be present on the system (see below for how to specify where they are). Note that if you are using separate logfiles, rather than the NCSA combined log, you cannot sort these reports by bytes, or include bytes columns in the reports. Also the browser page requests will be inaccurate.
The browser summary and browser report have no special commands, but it is important to note the limitations of these reports. Some browsers even lie deliberately about what sort of browser they are, or let users configure the browser name. I have separated out those browsers that claim to be "Mozilla (compatible)" but that doesn't catch all of them. Furthermore, there is no fixed format for browser information. (NB: I have combined all Mosaics as a special case). In addition, graphical browsers automatically generate more requests than non-graphical browsers by loading the graphics, so it is not a very good guide to browser usage. For all these reasons many people would argue that the browser reports are so unhelpful as to be worse than useless. At best, interpret them with extreme caution.
#occs: error type ----- ---------- 19360: Send timed out 11286: Send aborted 7962: File does not existThe status code report lists how many of each type of status code occurred in your logfile:
#occs: no. description ----- --------------- 35564: 200 OK 173: 301 Document moved 3: 302 Document found elsewhere 5732: 304 Not modified since last retrievalThey are turned on and off by commands like
STATUS ON ERROR OFFor by the commandline arguments +c and -e. (+ for on, - for off). There is a command ERRMINOCCS which says how many occurrences of an error there must be before it appears on the error report. For example
The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command
analog logfile.logwill use that logfile for its report. Analog will read the common log format (which most servers write) as well as the old NCSA format and the NCSA combined log format (which includes referrer and agent information). Detection of which format each line of the logfile is in is automatic. You can also write
analog -to use standard input as the logfile. (This is useful in constructing pipes). You can also specify which logfile to use in the configuration file by means of a command like
LOGFILE logfile.log # or stdin for standard input(This is the only method Mac users can use). You can specify several logfiles on one configuration line by separating their names with commas (no spaces). Except on Mac, you can also use wildcards. For example
Sometimes it is necessary to combine logfiles from two different servers, without getting filenames that happen to be the same on both servers confused. (This is particular useful if you're running several virtual hosts on the same machine). To do this you can use a second argument to the LOGFILE command, specifying a prefix for each filename. For example
LOGFILE log1,log2 http://www.a.com # These logfiles from a.com LOGFILE log3 http://www.b.com # This one from b.comIf you use this, the directory report will need specifying to a deeper level.
Logfiles specified in the user's configuration files and commandline options replace any specified in the default configuration file, and are in turn overridden by any in the mandatory configuration file. In addition you can use none as the name of the logfile to overwrite the specification of all previous logfiles.
Analog can also read the NCSA/Apache referrer log, agent log and error log formats. Logfiles of these types can be specified by commands like
REFLOG referer_log BROWLOG agent_log.old,agent_log ERRLOG error_logThe same comments about which logfiles replace which apply as in the last paragraph. Analog can also read the NCSA/Apache combined log. This is just specified as a LOGFILE as above; analog will automatically recognise and parse the extra fields. Do not specify a combined log as a REFLOG or BROWLOG.
You can decide whether the filenames in the logfile should be regarded as being case sensitive or case insensitive. The default is usually to be case sensitive on Unix and case insensitive on other machines, but you can override it if you want to read a logfile that was created on another machine. The commands to do this are
CASE INSENSITIVE CASE SENSITIVEA note about reading logfiles created on other machines. Different machines have different ways of storing files, in particular with regard to ends of lines. If you are moving logfiles from one machine to another, they should be transferred as text, not as binary data.
Analog on Unix can uncompress compressed logfiles. You need to tell it how to uncompress each type of file by supplying a command that sends the uncompressed file to standard output (rather than uncompressing it into a file). The file can be a list of type of files, separated by commas. For example, depending what commands are on your system, you can use
UNCOMPRESS *.gz "gunzip -c" # or UNCOMPRESS *.gz,*.Z gzcatThis would be a suitable command to include in the default configuration file. Note that if you are using the form interface, the http server needs to have execute access to the decompression program.
Several times I've referred to `page requests'. You can specify in the configuration file what should be counted as a `page'. At the beginning, only *.html, *.htm and directories (*/) are pages. The command
ISPAGE filenamewill specify that some other file is a `page'. You can give a list of filenames, separated by commas (without spaces). For example,
ISPAGE *.ps,*.ps.gzwould mean that Postscript files and gzipped Postscript files are to be regarded as pages. You can also use
ISNOTPAGE filenameto specify that something which would otherwise be a page is not to be regarded as a page.
There are various commands which instruct the program to analyse only part of the logfile. First, you can instruct the program only to take into account certain files. This is done by means of the FILEINCLUDE and FILEEXCLUDE commands. Each command can have a list of filenames, separated by commas (no spaces). One asterisk and any number of question marks can appear in each of the filenames specified, as wildcards. Each file is included and excluded as each new command is reached. Unspecified files are included if the first command found was an exclusion, and excluded if the first command found was an inclusion. For example, the configuration
FILEINCLUDE /~sret1/* FILEEXCLUDE /~sret1/backgammon/*,/~sret1/analog/* FILEINCLUDE /~sret1/backgammon/*.gifwould instruct the program to examine only my files, excluding my backgammon and analog files, but including gifs in my backgammon directory. On the other hand,
FILEEXCLUDE /~sret1/*/img/*would analyse all files except images in my various directories. (Note that wildcards with two *'s in can be slow to process).
Remember you can always run analog -v to see what the options you have specified represent.
Included files can always be excluded later, but excluded files can't always be included easily. (For example, after FILEEXCLUDE /dir/* and FILEEXCLUDE *.gif,*.jpg, FILEINCLUDE *.gif would include all gifs, even those in /dir/, which is not what is wanted). For this reason, there is an extra command FILEALLOW which cancels an exclude. It must be exactly the same as a previous FILEEXCLUDE; in the above example FILEALLOW *.gif would work, but not FILEALLOW *if.
Note that although you can exclude all gifs with FILEEXCLUDE *.gif, this may not be what you want to do. This will then exclude them from all the reports, and not count the bytes transferred due to them. More likely, you just want to exclude them from the request report while still including them in the other reports, which you can do by means of REQEXCLUDE *.gif (or REQINCLUDE pages) which will be explained in a minute.
There are similar commands HOSTINCLUDE, HOSTEXCLUDE and HOSTALLOW to analyse only the requests from certain sites. For example,
HOSTEXCLUDE emu.pmms.cam.ac.uk HOSTEXCLUDE *.statslab.cam.ac.ukwould ignore accesses from emu and from the whole of the statslab.
There are also commands REFINCLUDE, REFEXCLUDE and REFALLOW for referrers. You probably want to ignore referrers from your own site. For example, I use
REFEXCLUDE http://www.statslab.cam.ac.uk/*This would be a suitable command to put in your default configuration file.
There are some other include commands that are specified the same way, but behave slightly differently because they do not actually exclude something from the analysis. So to specify what is included in the request report, there are commands REQINCLUDE, REQINCLUDE and REQALLOW. You can use the special name `pages' to mean all pages. So for example,
REQINCLUDE pages REQINCLUDE *.pswill only include pages and Postscript files in the request report, although other files will still be counted for the other reports. There are also commands LINKINCLUDE, LINKEXCLUDE and LINKALLOW, and REFLINKINCLUDE, REFLINKEXCLUDE and REFLINKALLOW to say what should be linked to in the request report and referrer report.
Finally, there are commands to analyse only a subset of the dates in the logfile. The simplest usage is FROM yymmdd and TO yymmdd. So, for example, to analyse only requests in July 1995 I would use the configuration
FROM 950701 TO 950731Also each of the pairs of digits can be preceded by - and the month and date can by preceded by + to represent time relative to the current date. This allows constructions like
FROM -01-00+01 # from tomorrow last year TO -00-0131 # to the end of last month (OK even if last month # didn't have 31 days) FROM -00-00-112 TO -00-00-01 #statistics for the last 16 weeksThere are commandline abbreviations +F and +T for these commands; for example +T-00-00-01 looks at statistics until the end of yesterday. -F and -T turn off the from and to, as do FROM OFF and TO OFF.
If a TO command is given, the figures for the last 7 days refer to the time until then.
FILEALIAS file1 file2says that whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately. It also understands that .. means `parent directory,' . means `this directory' and // is the same as /, and translates those filenames to their canonical forms.
Actually, it's not quite true about index.html. You can make that into another file if you want, by use of the DIRSUFFIX command. For example, if all your directories return indexes called default.htm, you could write
Wildcards can occur in the aliases. For example, after
FILEALIAS /~sret1/*.gif /images/*g.gif FILEALIAS /~sret2/a?c* /sa/*/~sret1/a.gif would be translated to /images/ag.gif and /~sret2/abcd.txt would become /sa/d.txt.
If two aliases match one filename, only the first one is applied. So after FILEALIAS a b; FILEALIAS b c or FILEALIAS a b; FILEALIAS a c, a will be translated into b.
There are also the commands HOSTALIAS and REFALIAS (for referrers) which work in the same way. HOSTALIAS is particularly useful if your server records local hostnames in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use
HOSTALIAS lion lion.statslab.cam.ac.uk HOSTALIAS bigcat lion.statslab.cam.ac.uk HOSTALIAS bigcat.statslab.cam.ac.uk lion.statslab.cam.ac.ukREFALIAS could be used to combine several referrers from one site. For example
REFALIAS http://www.webcrawler.com/* http://webcrawler.com/ REFALIAS http://webcrawler.com/* http://webcrawler.com/
There are also the OUTPUTALIAS commands, but I described them above.
A pair of related commands is WITHARGS and WITHOUTARGS. Normally any arguments given as part of a URL (after a question mark) are ignored. However, if a configuration command like
WITHARGS /cgi-bin/prog.cgiis given, then the arguments to that file will form part of the filename. So /cgi-bin/prog.cgi?a and /cgi-bin/prog.cgi?b will be regarded as separate files, whereas without that command they would both have been translated to /cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks and lists of files can again occur, and there is also a parallel command WITHOUTARGS; for example,
WITHARGS /cgi-bin/* WITHOUTARGS /cgi-bin/spam.cgiwould read the arguments for all files in /cgi-bin/ except spam.cgi.
Commands REFWITHARGS and REFWITHOUTARGS work in the same way for referrers, except that in this case the default is to include all the arguments (so that you can see what people are requesting from search engines).
To produce a cache file instead of the normal output, use the command
OUTPUT CACHETo read data from a cache file, use, e.g.,
CACHEFILE cache.out(This will still read the ordinary logfile as well). You can also use the commandline argument +Ucache.out. You can specify several cache files by putting them in a comma-separated list, or using several +U commands. Note that this doesn't write to that file. You still write to the normal output file.
To use this feature and avoid losing entries or double counting them, I suggest you follow the following procedure.
Although it should now be safe to throw away the old logfile, I can take no responsibility if something goes wrong. See the licence. This is beta test software and is expected to contain bugs. Also if you are going to use this feature please make sure you understand what information is and is not recorded in the cache file. You may find that the cache file is not the right feature for you. Compressing logfiles (with gzip -9) is very efficient owing to the large number of repeated strings. That in itself may solve your filespace problems.
To turn DNS resolution on, use the configuration command
NUMLOOKUP ONor the commandline option +1. (Turn it off with NUMLOOKUP OFF or -1).
The first time you use lookups, analog may be very slow. But it will record which addresses it looked up so that you do not need to look them up again next time you run analog. You need to specify a file for this purpose. This is done by the command
DNSFILE filenameThe program will first read any old lookups that are recorded there, then overwrite it with a new version at the end.
However, any lookups that are too old will not be trusted, and will be thrown away. You can specify how old a lookup in hours you trust by a command like
DNSFRESHHOURS 168 # check them once a weekNote also that not all numbers can be resolved.
There is also a variable DNSHASHSIZE; see above.
First, one very important option: the language for the output! You can specify any of the following
LANGUAGE ENGLISH LANGUAGE US-ENGLISH LANGUAGE FRENCH LANGUAGE GERMAN LANGUAGE SPANISH LANGUAGE ITALIAN LANGUAGE DANISH(If you are using a language other than English, you might also want to produce a local version of the domains.tab file, and of the form interface). If anyone wants to translate the output into another language, I would be delighted! (But contact me first, so that I can make sure that two people aren't working on the same language).
Next, you can choose whether you want ASCII (plain text), HTML or preformatted output. (Preformatted output is a special machine-readable format, used for importing into spreadsheets or graphics-creation programs. It is described in a separate section below). The output format is chosen using the commandline option +a (ASCII) or -a (HTML), or the configuration command
OUTPUT ASCII # or HTML, or PREFORMATTEDIf you choose ASCII or PREFORMATTED output, some of the other options are ignored, but it should be obvious which ones they will be.
You can select the file for the output to be sent to in the configuration file or on the commandline. So instead of
analog > outfile.htmlyou can use the configuration command
OUTFILE outfile.htmlor the commandline option +Ooutfile.html.
There is a configuration command REPORTORDER which specifies which order the reports should occur in. The usage is a line like
REPORTORDER hHDdWmoSirtfbBecThis says that the reports should occur in the order hourly summary (h), hourly report (H), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i), request report (r), file type report (t), referrer report (f), browser summary (b), browser report (B), error report (e) and status code report (c). It is important to include all the above sixteen letters exactly once each.
There is a command
ALL ONto include all reports except the hourly report (particular ones can then be omitted with -d or whatever); likewise ALL OFF omits them (and particular ones can then be included). The equivalent commandline arguments are +A and -A. The hourly report and general report are not turned on by ALL ON or +A; they must be turned on separately with +H and +x. Note also that order is important; for example, +i -A +r will include the request report but not the directory report.
The title line of the output page contains three adjustable variables. First, the logo in the top left hand corner can be turned on or off, or any other logo substituted (for example, your organisation's logo). This is accomplished by the command
LOGOURL url # or LOGOURL none # for no logoor by the commandline arguments -p (no logo: mnemonic, p for picture) and +pURL. The organisation name on the title line can be specified by means of the option -nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option -uURL; use -u- if you don't want any link. The equivalent configuration options are
HOSTNAME name # must be in quotes if it contains spaces HOSTURL URL HOSTURL - # for no linkAnalog will normally translate characters in the hostname to HTML if necessary. So to include literal HTML, such as accented characters, in the output you need to precede them by a backslash, like this:
HOSTNAME "M\üller & S\öhne"
A header file and footer file can be inserted near the top and bottom of your output. These should be written in HTML or ASCII according to whether your output is HTML or ASCII, and can contain anything you want. Possible uses include providing information about your organisation or about the way the statistics were calculated, linking to related pages, and no doubt many other things. The commands to achieve this are
HEADERFILE filename FOOTERFILE none # if you don't want oneYou can also use HEADERFILE stdin or HEADERFILE - to use standard input.
There is a command SEPCHAR to say which character should separate each group of three digits in long numbers. For example,
SEPCHAR ,will give 123,456,789, whereas
SEPCHAR ' 'will give 123 456 789. If you want the numbers just to run together (123456789) use
SEPCHAR ''You can also specify a character for use within the tables in the reports. This is done in the same way by means of the command REPSEPCHAR.
You can also choose a character for the decimal point. For example, some languages use a comma instead of a full stop; you would specify this by the command
You can specify whether analog prints long numbers of bytes as exact numbers (e.g., 5,053,234) or as kilobytes, megabytes etc. (e.g., 4934k) by the command
RAWBYTES ON # for exact, OFF for abbreviated
There is a debugging command, for printing (to stderr) problems with your logfile. There are currently three levels of debugging: 0 for no debugging; 1 for printing corrupt logfile lines (prepended by "C:"), information on files opened and closed (prepended by "F:"), and some summary data (prepended by "S:"); and 2 which also prints hosts for which the domain is unknown (prepended by "U:"), errors which cannot be classified (prepended by "E:") and any DNS lookups carried out (prepended by "D:"). The command for level n debugging is
DEBUG nand the equivalent commandline argument is +Vn (V for verbose). You can also use commandline options +V for level 1 and -V for level 0.
Finally, there is an option to turn off warnings. It is
WARNINGS OFF # or ONThe equivalent commandline argument is -q to turn warnings off (q for quiet) and +q to turn them on again. This is useful in scripts or cron jobs if you really do want to give a configuration that you know will generate a warning.
ad Andorra ae United Arab Emirates [...]There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.
Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,
uk United Kingdom # God save the Queen!
OUTPUT PREFORMATTEDThis type is designed to be easy to read into spreadsheets, or post-process with graphics creation tools. Each line is separated into columns with a special string of characters. You can specify this string with the PRESEP command; for example
PRESEP :::if for some reason you wanted three colons between each column. Make sure not to use anything that might occur in the output: a space would not be suitable.
Each line in the preformatted output begins with a letter indicating what type of information the line contains. The possible letters are as follows:
The general summary is a bit different. After the initial x, follows a two-character code saying what the line contains. The possible codes are
If you do anything interesting with the preformatted output, I should like to hear about it.
The form interface probably only works on Unix at the moment, though if anyone manages to make it work on other systems, I should be interested to hear about it.
To set up the form interface, go to the directory where the analog source code lives, and follow these steps.
If the third step above fails to generate a form, you can generate one yourself by means of the command analog -form +Oanalogform.html. You might also want to run this command yourself if you want to supply different default options from normal for the form user: if you run the command with extra commandline or configuration file options, they will be respected in the construction of the form.
It is expected that system administrators may want to provide different options on the forms from the default ones. (For example, the form does not by default allow an hourly report). For this reason, the cgi program understands various other options that are not normally on the form. These can be added to the form by hand. For example, you may want to allow a choice of logfiles, perhaps via a <select>. Or you may want form users to use certain default options; these could be specified as <input type=hidden>. Because the form uses GET not POST you can also construct links to it. For experts, here follows a complete list of form options. [*] marks a default value (i.e., what is sent to analog if you don't send anything else to the cgi program. These defaults can override the analog program defaults. Items without a [*] use the program default if nothing else is specified).
bq browser summary? 0 for off [*], 1 for on, 2 for program default. ba +ve BROWMINREQS/PAGES bb -ve BROWMINREQS/PAGES bc +ve BROWMINBYTES bd -ve BROWMINBYTES bs BROWSORTBY (0 = REQUESTS [*], 1 = BYTES, 2 = ALPHABETICAL, 3 = RANDOM, 4 = PAGES) Other reports similarly with initial B, f, i, o, r, S, t in place of b. ch COUNTHOSTS? 0 for off, 1 for on, 2 for approx cq status code report? dq daily summary? dg DAYGRAPH (R, B or P) Other time reports similarly with D, h, H, m, W in place of d. eq error report? fi FILEEXCLUDE; list, separated by commas fr FROM fy FILEINCLUDE; list, separated by commas gr GRAPHICAL? 0 for off, 1 for on hi HOSTEXCLUDE; list, separated by commas ho HOSTURL hy HOSTINCLUDE; list, separated by commas ie DIRLEVEL lb BROWLOG; list, separated by commas lc CACHEFILE; list, separated by commas le ERRLOG; list, separated by commas lf REFLOG; list, separated by commas lo LOGFILE; list, separated by commas or HOSTNAME ou OUTPUT -- 0 for HTML [*], 1 for ASCII, 2 for PREFORMATTED, 3 for program default rl LINKINCLUDE options -- f for ALL (files), p for PAGES, n for NONE, d for program default rt REQINCLUDE options -- f for ALL, p for PAGES, d for program default to TO TZ timezone Vq Equivalent to +V -- 0 for OFF [*], 1 for ON. wa WARNINGS (to error_log) -- 0 for OFF, 1 for ON [*]. xq general report?Important note: Do not add options for HEADERFILE and FOOTERFILE to the form. This would be a security risk.
If the form doesn't seem to work, check the following:
It is better, although not essential, if when you change the default options for your analog, you remake the form.
Note that you probably want to restrict access to the form and form program to certain users; if it is world readable there could be considerable load on your server as well as potential confidentiality problems. Exactly how to do this depends on which server you are running.
Unfortunately, you cannot tell how many times your file has been read from this. The user may in fact request the file from a proxy server which already has a copy of it, or retrieve it from a local cache. In these cases no connection is made to your server, and no request is scored.
There are three categories of request, which can be seen in the status code report. Completed (or successful) requests are those with codes in the 200s (where the document was returned) or with code 304 (where the document was not needed because it had not been recently modified and the user could use a cached copy). Redirected requests are those with other codes in the 300s. The most common cause of these requests is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection ("you probably mean the following") and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). Failed requests are those with codes in the 400s (error in request) or 500s (server error). They come about for a variety of reasons, but the most common are when the requested file is not found or is read-protected.
Note that 302 requests are not counted. Most of them come about by the user requesting a faulty URL, as explained above. However, some cgi scripts also return a 302 code; these are not included because there is no way to tell them apart from the first type.
The total data transferred refers only to successful requests, and does not include the message header, only the actual data. The detailed reports also only include successes, except for the referrer report and browser reports which include all request types.
Corrupt logfile lines are those we can't understand, and unwanted lines are those that refer to files, hosts or dates that we have specifically excluded. (See the DEBUG command for how to list all corrupt lines).
A host is a computer that has requested something from you. Analog gives the number of distinct (different) hosts that have made a successful request, and the number of distinct files they have requested.
The common logfile format is written by most servers. Its lines look like
m45-6.gps.jussieu.fr - - [14/Mar/1996:17:45:35 +0000] "GET /~sret1/analog/ HTTP/1.0" 200 12435(except all on one line). Most of the fields are obvious -- the two numbers at the end are the status code and number of bytes transferred.
The other logfiles are not the same on different servers. Analog understands the files written by the NCSA and Apache (and some other) servers. The browser (or agent) log looks like
[14/Mar/1996:17:45:08] Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)The referrer log looks like
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/In both of these the date may be omitted. The error log looks like
[Thu Jul 28 20:43:10 1994] httpd: successful restartAnalog also understands the NCSA/Apache combined log. This looks like the common log, except that it has the referrer and browser on the end, like this:
lion.statslab.cam.ac.uk - - [18/Jan/1996:12:04:23 +0000] "GET /~sret1/analog/ HTTP/1.0" 200 578 "http://www.statslab.cam.ac.uk/~sret1/" "Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)"(except all on one line again). Note the quotes round the referrer and browser. It is usually better to use the combined log than separate logs, because it stores more information in less space.
The Mac version of analog also reads the WebSTAR and Netpresenz format log files. The WebSTAR file must start with a line like
!!LOG_FORMAT DATE TIME RESULT URL FROM TRANSFER_TIME BYTES_SENT USER HOSTNAMEto tell us what to expect on subsequent lines. (You should add it manually if it's got lost). We require the following fields as a minimum: HOSTNAME, DATE, TIME, RESULT, URL, BYTES_SENT. We will also read AGENT and REFERER if present. Each line of the logfile contains the fields given in the header line, separated by tabs: for example
12/14/95 19:00:04 OK :pages:Downloads.html 72 3176 indy1.indy.net.except all on one line.
Here is a complete list of all 173 configuration commands. For their usage, see the documentation above.
Here is a summary of all 39 commandline arguments. Again, for their usage, see the full documentation. Many of them can be given a - instead of a + to turn something off.
+1 Do DNS lookup +7 stats for last 7 days +a ASCII output +A all reports (except hourly report) +b browser summary +B browser report +c status code report +C configuration command +d daily summary +D daily report +e error report +f referrer report +form do a form +F from +g configuration file -G default config file off +help help message +h hourly summary +H hourly report +i directory report +l dirlevel +m monthly report +n hostname +o domain report +O outfile +p logo -q no warnings +r request report +s host count +ss approximate host count +S host report +t file type report +T to +u host url +U cache file +v printvbles +V debug level +w pagewidth +W weekly report +x general summary
See also the glossary above.
If we are doing a `top n' report and two entries tie for nth place, only one will be printed.
The reported `running time' is elapsed real time, not CPU time.
You can sort a report by requests even when you have turned off the request columns. This may confuse your readers.
The bytes sometimes aren't reported correctly. This is really a server bug. Servers don't actually measure the number of bytes transferred, they measure the size of the file they are about to transfer, so if a connection is interrupted, they may write down more bytes than were actually transferred. (Actually, they sometimes do a bit better than that, but it's still likely to be an overestimate). Of course any inaccuracy in the logfile will make analog not tell the truth.
This is a beta test, and is likely to contain a couple of bugs. If you find any more, please tell me! Thanks.
I can no longer keep everyone who writes to me informed about updates. But if you want to be informed about updates, send me mail with "subscribe analog" in the subject line, and I'll make sure to put you on the mailing list. (You can write to me in the body of the message as well; I'll still read it).
I am happy to help people who have trouble with analog, but please read the FAQ first. Also, you might be able to diagnose the problem yourself if you run
analog -v [your usual options]which lists the value of all variables. But if you still can't get it to work, ask me. It helps me find bugs, and to know where the documentation is unclear. When submitting bug reports, please include the version number (which you can find out by the command analog -v) and what type of computer you're using.
Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.
For recent developments I have to thank Dave Stanworth (again!) and all the other people who have provided mirror sites for analog. I have to thank Mark Roedel for the DOS version, Jason Linhart for the Mac version and Dave Jones for the VMS version. Stephan Somogyi and Nigel Perry have also contributed code for the Mac version. For the translations into other languages, I have to thank Patrice Lafont (French), Mario Ellebrecht (German), Furio Ercolessi (Italian), Javier Solis (Spanish) and Adrian Price (Danish).
Analog is copyright (C) Stephen R. E. Turner 1995, 1996, except those parts written by other people.
This licence describes the conditions under which you may use, modify and
distribute version 1.92beta of analog ("the program"). Except where stated,
the conditions of this licence apply equally to the source code for the
program and to any compiled version. If you are unable or unwilling
to accept these conditions in full, then, notwithstanding the conditions in
the remainder of this licence, you may not use, modify or distribute the
program at all. Text in square brackets is intended for guidance only and does
not form part of the licence in any way.
[Analog is free software. This licence is designed not to restrict your freedom except insofar as is necessary to ensure that the program remains free for all. Of course, I don't refuse donations!]
Stephen R. E. Turner
Page last modified: 08-Oct-96