This Readme describes analog1.9beta6. For the latest version of analog, see the analog home page.
This program analyses logfiles from WWW servers. It should work on any Unix system. It is designed to be fast and to produce attractive statistics. For more details, see the
For examples of the output see This program is free, and may be freely distributed and modified provided that any distribution is accompanied by an unchanged copy of this Readme file, and that this condition remains in force. Where possible, please distribute source code instead of (or with) executables as this allows the user further configuration options. If you make modifications, I should be grateful if you would let me know what modifications you have made. No warranty of any sort is given or implied for this program or its use. This is a beta test version, and although I believe it to be reliable, some bugs can still be expected.Next you must move the images that came with the analog program (in the directory images) into the IMAGEDIR specified in analhead.h.
When you have done that, compile the program by typing
make(It may take a while as the program is rather big). If that doesn't work, have a look in the Makefile to see if there's anything that you need to change to suit your configuration, and try again. NB: There is a known problem with BSD/OS reporting a "yacc stack overflow" with some versions of gcc (in particular 2.6.3). Users have reported that switching to a newer gcc (2.7.0) or an older one (1.42, installed as /bin/cc) cured the problem.
Then just type
analogto run the program. To send the output to a particular file instead of to the screen, type, e.g.,
analog > outfile.html(This assumes that . is in you $PATH, but it should be).
Many options can be set in the file analhead.h. These can be changed before compiling the program. They are explained in that file, so they will not be documented again here.
Otherwise, analog takes its options from configuration files. Many of the configuration commands also have abbreviations as commandline arguments. So, for example, the configuration command
DAILY OFFtells analog not to include a daily summary in the output. But this can also be specified by the command
analog -dbecause the -d option is an abbreviation for DAILY OFF. In fact any configuration command can be specified on the commandline by means of the +C option; you could write
analog +C"DAILY OFF"(This is most useful for running analog from a script or cron job).
To specify a configuration file, you use the commandline argument +g followed by the name of the file. For example,
analog +gextra.conftells analog to read configuration commands from the file extra.conf. (Note that there is no space between +g and the filename; this is true of all commandline arguments). (You can also specify standard input as the configuration file by the option +g-).
The configuration file can contain several commands on separate lines; any text after a hash (#) on a line is ignored as a comment. So the following is an example of a configuration file.
DAILY OFF # We don't want a daily summary FULLDAILY ON # We want a full daily report insteadAn argument to a command can be placed in single or double quotes, and it must be if the argument contains a hash or a space. Note that configuration commands are not generally the same as those in analhead.h, although many have the same name.
Commandline arguments are read in the order in which they occur, and configuration files are read when the +g argument is reached. If commands conflict, later commands override earlier ones.
There are also two special configuration files which can be specified in analhead.h. The default configuration file is run before all other configuration files. You can put in there configuration commands that you normally want to include but which you can override. You can stop analog running the default configuration file by the commandline option -G.
The mandatory configuration file is run after all other configuration commands have been read, and overrides them all. If the mandatory configuration file cannot be found, the program exits immediately. This can be used by system administrators to stop users analysing certain files or producing certain reports, for example. (Note, however, that the only way to stop it completely is to deny users read access to the logfile. Otherwise there is nothing to stop them analysing it by another copy of analog or another program).
If this is all a bit confusing, just run
analog -v [other options]That will tell you what the values of all the variables will be, based on analhead.h, the configuration options and the commandline options.
We shall now look at all the configuration commands and their commandline equivalents under the following headings. There is a summary list of all of them in the reference section.
The general summary can be turned off by the command
GENERAL OFF(or the commandline argument -x) or on by GENERAL ON (or +x). If the general summary is off, all the `Go To' links in the output are also omitted.
The figures in parentheses refer to the last 7 days. They can be turned on and off with
LASTSEVEN ON # or OFFor with the commandline arguments +7 and -7. Note that the last 7 days refers to the last 7 days before the program is run, not before the last entry in the logfile. (If a TO command is specified, however, the last 7 days will be until that date).
Counting hosts is something which can take a lot of memory (we have to remember the name of every host that has accessed our server). If memory is a problem, you can turn the host counting off with the commandline option -s or the configuration command
COUNTHOSTS OFFAlternatively, you can do an approximate host count in a fixed (pre-specified) amount of memory. You do this by using +ss or
COUNTHOSTS APPROXand you can specify the amount of memory to be used by
APPROXHOSTSIZE 100000 # or whatever number, in bytesAbout 3 bytes per host seems to give a very good estimate. Even 1 byte per host will give a fair estimate. If statistics for the last 7 days are on, twice this amount of space will be used.
Each unit () represents 4 000 requests, or part thereof.
month: #reqs: -------- ------ Nov 1995: 119865: Dec 1995: 121214: Jan 1996: 144960:
The above display is of a monthly report. In this category, we also have the weekly report (one line for each week), daily summary (one line for Sundays, one for Mondays etc.), daily report (one line for each day ever), hourly summary (one line for midnight, one for 1am etc.) and hourly report (one line for each hour ever).
The following configuration commands show how to turn these reports on and off.
MONTHLY ON WEEKLY ON DAILY ON FULLDAILY OFF HOURLY ON FULLHOURLY OFFYou can also use the corresponding commandline arguments +m, +W, +d, -D, +h, -H (use + to turn the corresponding reports on, - to turn them off).
You should use these reports sensitively. If your output is 200k long, people won't be able to download it. In particularly, you probably don't want a daily report very often, and you certainly don't want an hourly report unless you have restricted the analysis to just a couple of days.
The graphs above are designed to produce coloured bars on graphical browsers and ASCII graphs on non-graphical browsers. They don't use tables or image-stretching properties, so should work on any browser. However, you can produce plain ASCII graphs instead by the command
GRAPHICAL OFF # or ON to turn it back on againThis has the advantage of producing smaller output which does not require any images to be downloaded.
The graphs rely on having the images distributed with analog available in the directory IMAGEDIR specified in analhead.h; or you can override that choice with a command like
IMAGEDIR /Images/
You can change the character used in the graphs on non-graphical terminals by means of a command such as
MARKCHAR '#' # put in quotes so that it isn't a comment
The graphs can be plotted by bytes transferred instead of by requests. This can be done by means of commands like
MONTHGRAPH B # by bytes WEEKGRAPH R # by requestsThere are also commands DAYGRAPH, FULLDAYGRAPH, HOURGRAPH and FULLHOURGRAPH. Alternatively, you can add the letter after the relevant commandline argument; for example, +hB to turn on the hourly summary with a graph sorted by bytes.
You can display the graphs backwards (with most recent requests at the top) by means of commands like
MONTHLYBACK ON # or OFFThere are also the commands WEEKLYBACK, FULLDAILYBACK and FULLHOURLYBACK. The hourly summary and daily summary cannot be displayed backwards. I find it confusing to have some of the reports going backwards and some forwards, so you can also use
ALLBACK ON # or OFFto change all four of the reports to backwards or forwards together.
You can specify which columns appear in the various reports in which order. The above example showed the number of requests being given. You can also have the percentage of the requests, the number of bytes, and the percentage of the bytes. For example, the command
MONTHCOLS RBbrtells analog to include in the monthly report columns for number of requests (R), number of bytes (B), percentage of bytes (b), and percentage of requests (r) in that order. The other commands are WEEKCOLS, DAYCOLS, FULLDAYCOLS, HOURCOLS and FULLHOURCOLS.
For some reports, analog needs to know where weeks begin and end. You can specify
WEEKBEGINSON WEDNESDAYto change it to Wednesday, for example. (I guess Sunday or Monday is more likely).
In the graphs, analog will choose the value of the unit () automatically based on the length of the largest bar and the width of the page. You can specify the page width with, for example,
PAGEWIDTH 70or the commandline option +w70. (I find about 65 works well). (Note that the PAGEWIDTH may not be strictly obeyed with GRAPHICAL ON, as the graphics are measured in pixels not characters). Occasionally you may want to specify the value of yourself (for example, to make it the same as on some other page). You can do this by a command like
MONTHLYUNIT 1000Setting it to 0 makes analog choose it automatically again. Of course, the other reports have WEEKLYUNIT, DAILYUNIT, FULLDAILYUNIT, HOURLYUNIT and FULLHOURLYUNIT.
Domain report
#reqs : %bytes : domain -------- -------- ------ 103125 : 46.58% : .uk (United Kingdom) ( 64982):( 35.45%): cam.ac.uk (University of Cambridge) ( 47138):( 20.55%): statslab.cam.ac.uk 49290 : 12.49% : .edu (USA Educational)
Host report
#reqs: %bytes: host ----- ------ ---- 10: 0.03%: zlsm03.arcs.ac.at 11: 0.04%: iki10.boku.ac.at 158: 0.15%: talus.maths.su.oz.au
Directory report
#reqs: %bytes: directory ------ ------ --------- 237985: 35.40%: /~sret1/ 18596: 17.60%: /~rrw1/ 3574: 11.89%: /~richard/
Request report
#reqs: %bytes: filename ----- ------ -------- 33980: 23.66%: /~sret1/backgammon/main.html 21162: 2.69%: /~sret1/backgammon/bitmaps/board.xbm 12690: 0.86%: /
Referer report
#reqs: refering URL ----- ------------ 260: http://webcrawler.com/cgi-bin/WebQuery 239: http://www.yahoo.com/Computers_and_Internet/Internet/World_Wide_Web/HTTP/Servers/Log_Analysis_Tools/ 185: http://guide-p.infoseek.com/WW/NS/Titles?qt=backgammon&col=WW 149: http://www.yahoo.com/Recreation/Games/Board_Games/Backgammon/
Browser summary
#reqs: browser ----- ------- 16797: Netscape 1532: Mosaic 693: IWENG 492: Lynx
Browser report
#reqs: browser ----- ------- 3105: Mozilla/1.22 (Windows; I; 16bit) 2785: Mozilla/1.1N (Windows; I; 16bit) 458: IWENG/1.2.003
These reports can be turned on and off with commands like
DOMAIN ON FULLHOSTS OFF DIRECTORY ON REQUEST ON REFERER OFF BROWSER ON FULLBROWSER OFFor with the commandline arguments +o (domain report), -S (host report), +i (directory report), +r or +R (request report; see below), -f (referer report), +b (browser summary) and -B (browser report). (As in the date reports, use + to turn the corresponding reports on, - to turn them off).
Another similarity with the date reports is that you can tell analog which columns to print on each report with the commands DOMCOLS, HOSTCOLS, DIRCOLS, REQCOLS, REFCOLS, BROWCOLS and FULLBROWCOLS. Again, each command is followed by letters indicating which columns are wanted and in which order. For example,
DOMCOLS RrBb # no. of reqs, %age reqs, no. of bytes, %age bytes
Each of these reports can be sorted in four different ways; by bytes, by requests, alphabetically or randomly (i.e., unsorted). (The only advantage of the last one is so as not to spend time sorting very long reports). The commands to change this look like
DOMSORTBY BYTES # or REQUESTS or ALPHABETICAL or RANDOMThe commands for the other reports are HOSTSORTBY, DIRSORTBY, REQSORTBY, REFSORTBY, BROWSORTBY and FULLBROWSORTBY. You can also add a letter b, r, a or x after the relevant commandline option; for example, +Sa for a host report sorted alphabetically.
It is important to be able to specify how many entries you want printed in each report. This is done by means of two variables for each report, one specifying the minimum number of bytes if the sorting is by bytes, and the other specifying the minimum number of requests if the sorting is by any of the other three methods. The following configuration commands illustrate the possible usages.
DOMMINREQS 20 # all items with at least 20 requests HOSTMINREQS -20 # the first 20 items # NB: useless if alphabetical or random sort REQMINREQS 0.01% # all items with at least 0.01% of the requests DIRMINBYTES 100000 # all items with at least 100000 bytes REFMINBYTES 100k # all items with at least 100 kbytes # (10M etc. also work) BROWMINBYTES -40 # Top 40 if sorting is by bytes FULLBROWMINBYTES 0.005% # all with at least 0.005% of the trafficYou can also specify the amount on the commandline by adding it after the sort method. For example, +Sr-50 turns on a host report, sorted by requests, with only the top 50 items included, and +ib20k gives a directory report, sorted by bytes, including all directories with at least 20 kilobytes transferred.
We now describe features unique to a particular one of the reports. First the domain report.
Subdomains can be specified for each domain. The syntax of the command is
SUBDOMAIN subdomain subdomain_nameIf the subdomain name has spaces in, it must be enclosed in quotes. The subdomain name can be omitted, indicating a nameless subdomain. For example, to produce the example above, I would include the following lines in the configuration file
SUBDOMAIN cam.ac.uk 'University of Cambridge' SUBDOMAIN statslab.cam.ac.ukNumerical subdomains (which have most significant part on the left) can also occur. They will look like
131 The Ever-Popular 131 domain 131.111 # NamelessAlso subdomains with wildcards in can occur. The following are examples:
SUBDOMAIN *.edu # mit.edu, umn.edu etc. SUBDOMAIN 131.111.* # 131.111.1, 131.111.2 etc. SUBDOMAIN % # all top-level numerical domains, from 1 to 255The variables SUBDOMMINREQS and SUBDOMMINBYTES can be specified in the same way as above, except they can't be negative. If you ask for wild subdomains, you will probably want to set the minimum requests and minimum bytes quite high. However, you cannot alter the sort order; within a domain, subdomains will always be output in alphabetical order.
There is a command NOTSUBDOMAIN to erase a previously requested subdomain. For example, you can write
NOTSUBDOMAIN *.edu NOTSUBDOMAIN cam.ac.ukHowever, if you request, for example, *.edu, then NOTSUBDOMAIN mit.edu will ont override it.
The domain report relies on having a domains file available, listing which geographical locations correspond to which domains. Which file is to be used as the domains file can be specified by the command
DOMAINSFILE domainsfileThe correct format of the domains file is explained in a separate section.
There is little to say about the host report, except to note that alphabetical sorting is by domain as most significant part. This report can be very long and slow to sort, and should be used with a high floor if at all.
The directory report has one further variable, which is the level (or depth) of the directory report. The example above is a level 1 report; a level 3 report might look like
#reqs: %bytes: directory ------ ------ --------- 43772: 72.06%: /~sret1/backgammon/ 173426: 19.93%: /~sret1/backgammon/bitmaps/ 11298: 4.14%: /~sret1/This can be specified by the commandline option +l3 or the configuration command
DIRLEVEL 3Note that the figures for each directory do not include those for the subdirectories of that directory, except where the directory is at the deepest level. So in the above example, /~sret1/backgammon/bitmaps/dice/d1.xbm would be reckoned in the directory /~sret1/backgammon/bitmaps/ (which is at the deepest level) but not in the other two directories.
We mentioned above that the request report has two commandline arguments, +r and +R. The difference is that if the commandline option +r is used, only pages will be displayed in the report. If you want to list all files, including, for example, graphics, then you should use +R instead. Alternatively the configuration command
REQTYPE PAGES # or ALLwill control whether pages or all files are listed.
There are three possible modes of linking in the request report; you can link to none of the files, or pages only, or all files. The commandline options for these are -k, +k and +kk respectively; or you can use the configuration command
PAGELINKS OFF # or ON, or ALLThere is also a related command BASEURL to specify a URL to prepend to the links. For example, if
BASEURL http://www.statslab.cam.ac.ukwere specified, then /~sret1/analog/ would be linked to http://www.statslab.cam.ac.uk/~sret1/analog/. This is useful if you want to display the statistics on a different server than the one they belong to. (See below for combining logfiles from two different servers).
You can also specify in the configuration file what should be counted as a `page' in the requests report. At the beginning, the following are `pages': *.html, *.htm, *.shtml, *.shtm, *.html3, *.ht3 and directories (*/). The command
ISPAGE filenamewill specify that some other file is a `page'. You can give a list of filenames, separated by commas (without spaces). For example,
ISPAGE *.ps,*.ps.gzwould mean that Postscript files and gzipped Postscript files are to be regarded as pages. You can also use
ISNOTPAGE filenameto specify that something which would otherwise be a page is not to be regarded as a page.
The referer report, browser summary and browser report have no special commands, although the relevant logfiles must be present on the system (see below for how to specify where they are). Note that if you are using separate logfiles, rather than the NCSA combined log, you cannot sort these reports by bytes, or include bytes columns in the reports.
However, it is important to note the limitations of these reports. For the referer report, many browsers do not pass this information to the server, and many pass it wrongly (sending the URL of the previous page even when your page was not reached by selecting a link from that page). The browser reports are much worse. Some browsers even lie deliberately about what sort of browser they are, or let users configure the browser name. Furthermore, there is no fixed format for browser information. (NB: I have combined all Mosaics as a special case). In addition, graphical browsers automatically generate more requests than non-graphical browsers by loading the graphics, so it is not a very good guide to browser usage. For all these reasons many people would argue that the browser reports are so unhelpful as to be worse than useless. At best, interpret them with extreme caution.
#occs: error type ----- ---------- 19360: Send timed out 11286: Send aborted 7962: File does not existThe status code report lists how many of each type of status code occurred in your logfile:
#occs: no. description ----- --------------- 35564: 200 OK 173: 301 Document moved 3: 302 Document found elsewhere 5732: 304 Not modified since last retrievalThey are turned on and off by commands like
STATUS ON ERROR OFFor by the commandline arguments +c and -e. (+ for on, - for off). There is a command ERRMINOCCS which says how many occurrences of an error there must be before it appears on the error report. For example
ERRMINOCCS 20
The first thing to know is how to specify a different logfile to analyse. A default one should have been specified in analhead.h, but you can also specify one by just putting its name on the commandline; so, for example, the command
analog logfile.logwill use that logfile for its report. Analog will read the common log format (which most servers write) as well as the old NCSA format and the NCSA combined log format (which includes referer and agent information). Detection of which format each line of the logfile is in is automatic. You can also write
analog -to use standard input as the logfile. (This is useful in constructing pipes). You can also specify which logfile to use in the configuration file by means of a command like
LOGFILE logfile.log # or stdin for standard inputYou can specify several logfiles on one configuration line by separating their names with commas (no spaces). For example
LOGFILE log1,log2,log3
Sometimes it is necessary to combine logfiles from two different servers, without getting filenames that happen to be the same on both servers confused. To do this you can use a second argument to the LOGFILE command, specifying a prefix for each filename. For example
LOGFILE log1,log2 http://www.a.com # These logfiles from a.com LOGFILE log3 http://www.b.com # This one from b.comIf you use this, the directory report will need specifying to a deeper level.
Logfiles specified in the user's configuration files and commandline options replace any specified in the default configuration file, and are in turn overridden by any in the mandatory configuration file. In addition you can use none as the name of the logfile to overwrite the specification of all previous logfiles.
Analog can also read the NCSA/Apache referer log, agent log and error log formats. Logfiles of these types can be specified by commands like
REFLOG referer_log BROWLOG agent_log.old,agent_log ERRLOG error_logThe same comments about which logfiles replace which apply as in the last paragraph.
Analog can also read the NCSA/Apache combined log. This is just specified as a LOGFILE as above; analog will automatically recognise and parse the extra fields.
Analog can uncompress compressed logfiles. You need to tell it how to uncompress each type of file by supplying a command that sends the uncompressed file to standard output (rather than uncompressing it into a file). The file can be a list of type of files, separated by commas. For example, depending what commands are on your system, you can use
UNCOMPRESS *.gz "gunzip -c" # or UNCOMPRESS *.gz,*.Z gzcatThis would be a suitable command to include in the default configuration file.
There are various commands which instruct the program to analyse only part of the logfile. First, you can instruct the program only to take into account certain files. This is done by means of the FILEINCLUDE and FILEEXCLUDE commands. Each command can have a list of filename, separated by commas (no spaces). One asterisk and any number of question marks can appear in each of the filenames specified, as wildcards. Each file is included and excluded as each new command is reached. Unspecified files are included if the first command found was an exclusion, and excluded if the first command found was an inclusion. For example, the configuration
FILEINCLUDE /~sret1/* FILEEXCLUDE /~sret1/backgammon/*,/~sret1/analog/* FILEINCLUDE /~sret1/backgammon/*.gifwould instruct the program to examine only my files, excluding my backgammon and analog files, but including gifs in my backgammon directory. On the other hand,
FILEEXCLUDE /~sret1/*would analyse all files except mine. Remember you can always run analog -v to see what the options you have specified represent.
You can exclude all gifs with FILEEXCLUDE *.gif but this may not be what you want to do. This will then exclude them from all the reports, and not count the bytes transferred due to them. More likely, you just want to exclude them from the request report while still including them in the other reports, which you can do by means of REQTYPE PAGES.
There are similar commands HOSTINCLUDE and HOSTEXCLUDE to analyse only the requests from certain sites. For example,
HOSTEXCLUDE emu.pmms.cam.ac.uk HOSTEXCLUDE *.statslab.cam.ac.ukwould ignore accesses from emu and from the whole of the statslab.
There are also commands REFINCLUDE and REFEXCLUDE for referers. You probably want to ignore referers from your own site. For example, I use
REFEXCLUDE http://www.statslab.cam.ac.uk/*This would be a suitable command to put in your default configuration file.
Finally, there are commands to analyse only a subset of the dates in the logfile. The simplest usage is FROM yymmdd and TO yymmdd. So, for example, to analyse only requests in July 1995 I would use the configuration
FROM 950701 TO 950731Also each of the pairs of digits can be preceded by - and the month and date can by preceded by + to represent time relative to the current date. This allows constructions like
FROM -01-00+01 # from tomorrow last year TO -00-0131 # to the end of last month (OK even if last month # didn't have 31 days) FROM -00-00-112 TO -00-00-01 #statistics for the last 16 weeksThere are commandline abbreviations +F and +T for these commands; for example +T-00-00-01 looks at statistics until the end of yesterday. -F and -T turn off the from and to, as do FROM OFF and TO OFF.
If a TO command is given, the figures for the last 7 days refer to the time until then.
FILEALIAS file1 file2says that whenever file1 occurs in the logfile, it is to be replaced by file2. Analog already understands that /dir/index.html is the same as /dir/ and translates `escaped' entities (e.g., %7E is the same as ~) so these don't need to be specified separately. It also understands that .. means `parent directory,' . means `this directory' and // is the same as /, and translates those filenames to their canonical forms.
Wildcards can occur in the aliases. For example, after
FILEALIAS /~sret1/*.gif /images/*g.gif FILEALIAS /~sret2/a?c* /sa/*/~sret1/a.gif would be translated to /images/ag.gif and /~sret2/abcd.txt would become /sa/d.txt.
There are also the commands HOSTALIAS and REFALIAS (for referers) which work in the same way. HOSTALIAS is particularly useful if your server records local hostnames in the logfile instead of full internet names. Also, if a host has two names, they can be combined in this way. So, for example, I might find it convenient to use
HOSTALIAS lion lion.statslab.cam.ac.uk HOSTALIAS www lion.statslab.cam.ac.uk HOSTALIAS www.statslab.cam.ac.uk lion.statslab.cam.ac.ukREFALIAS could be used to combine several referers from one site. For example
REFALIAS http://www.webcrawler.com/* http://webcrawler.com/ REFALIAS http://webcrawler.com/* http://webcrawler.com/
A pair of related commands is WITHARGS and WITHOUTARGS. Normally any arguments given as part of a URL (after a question mark) are ignored. However, if a configuration command like
WITHARGS /cgi-bin/prog.cgiis given, then the arguments to that file will form part of the filename. So /cgi-bin/prog.cgi?a and /cgi-bin/prog.cgi?b will be regarded as separate files, whereas without that command they would both have been translated to /cgi-bin/prog.cgi. Note that the filename with the arguments still has to fit inside the maximum length of a filename. Asterisks and lists of files can again occur, and there is also a parallel command WITHOUTARGS; for example,
WITHARGS /cgi-bin/* WITHOUTARGS /cgi-bin/spam.cgiwould read the arguments for all files in /cgi-bin/ except spam.cgi.
Commands REFWITHARGS and REFWITHOUTARGS work in the same way for referers, except that in this case the default is to include all the arguments (so that you can see what people are requesting from search engines).
The ability to look up numerical IP addresses and translate them to hostnames has been removed in this version of analog because it didn't work well and caused problems on some systems. I might put it back in a later version, but for now I recommend instead pre-processing the logfile with the program logresolve.c (which is distributed with the Apache server).
To produce a cache file instead of the normal output, use the command
OUTPUT CACHETo read data from a cache file, use, e.g.,
CACHEFILE cache.out(This will still read the ordinary logfile as well). You can also use the commandline argument +Ucache.out. You can specify several cache files by putting them in a comma-separated list, or using several +U commands.
To use this feature and avoid losing entries or double counting them, I suggest you follow the following procedure.
Although it should now be safe to throw away the old logfile, I can take no responsibility if something goes wrong. This is beta test software and is expected to contain bugs. Also if you are going to use this feature please make sure you understand what information is and is not recorded in the cache file. You may find that the cache file is not the right feature for you. Compressing logfiles (with gzip -9) is very efficient owing to the large number of repeated strings. That in itself may solve your filespace problems.
OUTPUT ASCII # or HTMLIf you choose ASCII output, some of the other options are ignored, but it should be obvious which ones they will be.
You can select the file for the output to be sent to in the configuration file or on the commandline. So instead of
analog > outfile.htmlyou can use the configuration command
OUTFILE outfile.htmlor the commandline option +Ooutfile.html.
There is a configuration command REPORTORDER which specifies which order the reports should occur in. The usage is a line like
REPORTORDER hHDdWmoSirfbBecThis says that the reports should occur in the order hourly summary (h), hourly report (H), daily report (D), daily summary (d), weekly report (W), monthly report (m), domain report (o), host report (S), directory report (i), request report (r), referer report (f), browser summary (b), browser report (B), error report (e) and status code report (c). It is important to include all the above fifteen letters exactly once each.
There is a command
ALL ONto include all reports except the hourly report (particular ones can then be omitted with -d or whatever); likewise ALL OFF omits them (and particular ones can then be included). The equivalent commandline arguments are +A and -A. The hourly report and general report are not turned on by ALL ON or +A; they must be turned on separately with +H and +x. Note also that order is important; for example, +i -A +r will include the request report but not the directory report.
The title line of the output page contains three adjustable variables. First, the logo in the top left hand corner can be turned on or off, or any other logo substituted (for example, your organisation's logo). This is accomplished by the command
LOGOURL url # or noneor by the commandline arguments -p (no logo: mnemonic, p for picture) and +pURL. The organisation name on the title line can be specified by means of the option -nname; the hostname of your server would also be an appropriate thing to put here. The name can have a link to your server's home page by use of the option -uURL; use -u- if you don't want any link. The equivalent configuration options are
HOSTNAME name # must be in quotes if it contains spaces HOSTURL URL HOSTURL - # for no link
A header file and footer file can be inserted near the top and bottom of your output. These should be written in HTML or ASCII according to whether your output is HTML or ASCII, and can contain anything you want. Possible uses include providing information about your organisation or about the way the statistics were calculated, linking to related pages, and no doubt many other things. The commands to achieve this are
HEADERFILE filename FOOTERFILE none # if you don't want one
There is a command SEPCHAR to say which character should separate each group of three digits in long numbers. For example,
SEPCHAR ,will give 123,456,789, whereas
SEPCHAR ' 'will give 123 456 789.
You can specify whether analog prints long numbers of bytes as exact numbers (e.g., 5,053,234) or as kilobytes, megabytes etc. (e.g., 4934k) by the command
RAWBYTES ON # for exact, OFF for abbreviated
There is a debugging command, for printing (to stderr) problems with your logfile. There are currently three levels of debugging: 0 for no debugging, 1 for printing corrupt logfile lines (prepended by "C:"), and 2 which also prints hosts for which the domain is unknown (prepended by "U:") and errors which cannot be classified (prepended by "E:"). The command for level n debugging is
DEBUG nand the equivalent commandline argument is +Vn (V for verbose). You can also use commandline options +V for level 1 and -V for level 0.
Finally, there is an option to turn off warnings. It is
WARNINGS OFF # or ONThe equivalent commandline argument is -q to turn warnings off (q for quiet) and +q to turn them on again. This is useful in scripts or cron jobs if you really do want to give a configuration that you know will generate a warning.
ad Andorra ae United Arab Emirates [...]There can be arbitrary space between the code and the corresponding location. The codes are converted to lower case. Use ? (or anything starting with ?) for the name if you want the domain to be recognised, but don't want the name to be printed out. The domains do not need to be in alphabetical order, though humans may prefer it that way.
Comments can occur in the domains file. They are introduced by the character #. So you could write, for example,
uk United Kingdom # God save the Queen!
To set up the form interface, go to the directory where the analog source code lives, and follow these steps.
If the third step above fails to generate a form, you can generate one yourself by means of the command analog -form +Oanalogform.html. You might also want to run this command yourself if you want to supply different default options from normal for the form user: if you run the command with extra commandline or configuration file options, they will be respected in the construction of the form.
It is expected that system administrators may want to provide different options on the forms from the default ones. (For example, the form does not by default allow an hourly report). For this reason, the cgi program understands various other options that are not normally on the form. These can be added to the form by hand. For example, you may want to allow a choice of logfiles, perhaps via a <select>. Or you may want form users to use certain default options; these could be specified as <input type=hidden>. Because the form uses GET not POST you can also construct links to it. For experts, here follows a complete list of form options. [*] marks a default value (i.e., what is sent to analog if you don't send anything else to the cgi program. These defaults can override the analog program defaults. Items without a [*] use the program default if nothing else is specified).
bq browser summary? 0 for off [*], 1 for on, 2 for program default. ba +ve MINBROWREQS bb -ve MINBROWREQS bc +ve MINBROWBYTES bd -ve MINBROWBYTES bs BROWSORTBY (0 = REQUESTS [*], 1 = BYTES, 2 = ALPHABETICAL, 3 = RANDOM) Other reports similarly with initial B, f, i, o, r, S in place of b. ch COUNTHOSTS? 0 for off, 1 for on, 2 for approx cq status code report? dq daily summary? dg DAYGRAPH (R or B) Other time reports similarly with D, h, H, m, W in place of d. eq error report? fi FILEEXCLUDE; list, separated by commas fr FROM fy FILEINCLUDE; list, separated by commas gr GRAPHICAL? 0 for off, 1 for on hi HOSTEXCLUDE; list, separated by commas ho HOSTURL hy HOSTINCLUDE; list, separated by commas ie DIRLEVEL lb BROWLOG; list, separated by commas lc CACHEFILE; list, separated by commas le ERRLOG; list, separated by commas lf REFLOG; list, separated by commas lo LOGFILE; list, separated by commas or HOSTNAME ou OUTPUT -- 0 for HTML [*], 1 for ASCII, (2 reserved for future use), 3 for program default rl REQLINKS -- f for ALL (files), p for PAGES, n for OFF (none), d for program default rt REQTYPE -- f for ALL, p for PAGES, d for program default to TO TZ timezone wa WARNINGS (to error_log) -- 0 for OFF, 1 for ON [*]. xq general report?Important note: Do not add options for HEADERFILE and FOOTERFILE to the form. This would be a security risk.
If the form doesn't seem to work, check the following:
It is better, although not essential, if when you change the default options for your analog, you remake the form.
Note that you probably want to restrict access to the form and form program to certain users; if it is world readable there could be considerable load on your server as well as potential confidentiality problems. Exactly how to do this depends on which server you are running.
Unfortunately, you cannot tell how many times your file has been read from this. The user may in fact request the file from a proxy server which already has a copy of it, or retrieve it from a local cache. In these cases no connection is made to your server, and no request is scored.
There are three categories of request, which can be seen in the status code report. Completed (or successful) requests are those with codes in the 200s (where the document was returned) or with code 304 (where the document was not needed because it had not been recently modified and the user could use a cached copy). Redirected requests are those with other codes in the 300s. The most common cause of these requests is that the user has incorrectly requested a directory name without the trailing slash. The server replies with a redirection ("you probably mean the following") and the user then makes a second connection to get the correct document (although usually the browser does it automatically without the user's intervention or knowledge). Failed requests are those with codes in the 400s (error in request) or 500s (server error). They come about for a variety of reasons, but the most common are when the requested file is not found or is read-protected.
The total data transferred refers only to successful requests, and does not include the message header, only the actual data. The detailed reports also only include successes, except for the referer report and browser reports which include all request types.
Corrupt logfile lines are those we can't understand, and unwanted lines are those that refer to files, hosts or dates that we have specifically excluded.
A host is a computer that has requested something from you. Analog gives the number of distinct (different) hosts that have made a successful request, and the number of distinct files they have requested.
The common logfile format is written by most servers. Its lines look like
m45-6.gps.jussieu.fr - - [14/Mar/1996:17:45:35 +0000] "GET /~sret1/analog/ HTTP/1.0" 200 12435(except all on one line). Most of the fields are obvious -- the two numbers at the end are the status code and number of bytes transferred.
The other logfiles are not the same on different servers. Analog understands the files written by the NCSA and Apache (and some other) servers. The browser (or agent) log looks like
[14/Mar/1996:17:45:08] Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)The referer log looks like
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/In both of these the date may be omitted. The error log looks like
[Thu Jul 28 20:43:10 1994] httpd: successful restartAnalog also understands the NCSA/Apache combined log. This looks like the common log, except that it has the referer and browser on the end, like this:
lion.statslab.cam.ac.uk - - [18/Jan/1996:12:04:23 +0000] "GET /~sret1/analog/ HTTP/1.0" 200 578 "http://www.statslab.cam.ac.uk/~sret1/" "Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)"(except all on one line again).
Here is a complete list of all 121 configuration commands. For their usage, see the documentation above.
Here is a summary of all 39 commandline arguments. Again, for their usage, see the full documentation. Many of them can be given a - instead of a + to turn something off.
+7 stats for last 7 days +a ASCII output +A all reports (except hourly report) +b browser summary +B browser report +c status code report +C configuration command +d daily summary +D daily report +e error report +f referer report +form do a form +F from +g configuration file -G default config file off +help help message +h hourly summary +H hourly report +i directory report +k pagelinks +l dirlevel +m monthly report +n hostname +o domain report +O outfile +p logo -q no warnings +r request report, pages only +R request report, all files +s host count +ss approximate host count +S host report +T to +u host url +U cache file +v printvbles +V debug level +w pagewidth +W weekly report +x general summary
See also the glossary above.
If we are doing a `top n' report and two entries tie for nth place, only one will be printed.
The reported `running time' is elapsed real time, not CPU time.
You can sort a report by requests even when you have turned off the request columns. This may confuse your readers.
The behaviour of FILEALIAS a b; FILEALIAS b c is undefined.
Do not alias a file to itself (e.g., FILEALIAS /home.html /home.html) or a host to itself, or it will get lost.
Average requests per day don't take accound of daylight savings time changes.
At the moment, I do not have as much time as I should like to work on analog, because I am also trying to write my Ph.D. thesis. I still welcome bug reports and feature requests, but new versions will not always come out quickly.
I am happy to help people who have trouble with analog, but please read the FAQ and list of known bugs first. Also, you might be able to diagnose the problem yourself if you run
analog -v [your usual options]which lists the value of all variables. But if you still can't get it to work, ask me. It helps me find bugs, and to know where the documentation is unclear. When submitting bug reports, please include the version number (which you can find out by the command analog -v).
The following features are already on the list to be done by version 2.0. Let me know if you have any comments on them.
I would also welcome discussion on the following issues.
Thanks are also due to all those who helped in the early stages of writing this program. Those who made helpful suggestions during beta testing are numerous, but I must mention particularly Dan Anderson, Martyn Johnson, Joe Ramey, Chris Ritson, Quentin Stafford-Fraser and Dave Stanworth; and above all Gareth McCaughan for lots of programming advice, particularly in making the code faster.
For recent developments I have to thank Dave Stanworth (again!) for providing the mirror sites for analog, and Mark Roedel for compiling the DOS version.
Page last modified: 10-Apr-96