[ Top | Up | Prev | Next | Map | Index ]

Readme for analog2.90beta4

Choosing a logfile

This is a rather long page, so here is a quick summary of the most important points:
The basic command for selecting a logfile is
LOGFILE logfilename
or just to put the logfile name on the command line without any arguments, e.g., analog logfilename. A - sign or the word stdin is interpreted as standard input: this is useful on Unix systems for constructing pipes. The word none means that the list of logfiles specified so far is erased. All logfiles must be on your local disk -- analog doesn't fetch them from across the network. In the Mac version, you can also analyse a particular single logfile by dragging it onto the analog icon.

You can have several LOGFILE commands. You can include wildcards in the logfile name (but not necessarily in the directory name: this is system-dependent), and you can use a list of logfiles separated by commas (without spaces). So the following commands would tell analog to read logfile1, c:\logs\logfile2, and all files ending in .log:

LOGFILE logfile1,*.log
LOGFILE c:\logs\logfile2
The LOGFILE commands are cumulative, except that any logfiles on the command line or in user-specified configuration files override any in the default configuration file, and are themselves overridden by any in the mandatory configuration file.
Analog knows about several different types of logfile, and will attempt to see if your logfile is of one of the types it knows about, based on the first line. (Note: if the first line of your logfile is corrupt, or if your logfile has lines in different formats, you'll have to tell analog the logfile type yourself). The types it can diagnose are the common log format, the NCSA combined format, referrer log and browser log, the W3 extended log format, the Microsoft IIS format (sometimes), the Netscape format, the WebSTAR format, and the Netpresenz format (sometimes). Examples of all these formats are given at the end of this page. If you have debugging on, analog will report what type of logfile it thinks yours is.

The reason for the "sometimes" in the previous paragraph is as follows. The Microsoft and Netpresenz formats are extremely badly designed in that the date can occur in either of the forms date/month/year or month/date/year, and they don't say which they're using. Analog will detect them automatically if it can tell which date format is being used (e.g., 13/2/98 or 2/13/98), but if it can't, it will tell you to use one of the LOGFORMAT strings below. Also, the NCSA browser log can only be detected if it includes the date.


You can also specify a different type of logfile, using the LOGFORMAT command. If all your logfiles are of formats that analog can diagnose, you need never use the this command. If you change the log format, the change only applies to future logfiles in the same configuration file. This means that different logfiles can have different formats: but it also means that if you want to change the format, you must declare the logfiles after the LOGFORMAT command, and within the same configuration file.

There are two types of argument to the LOGFORMAT command: either you can specify a symbolic word, or a log format string. We'll look at the words first.

The command

LOGFORMAT COMMON
will select common format; you can replace COMMON with COMBINED, REFERRER, BROWSER, EXTENDED, MICROSOFT-NA (North American date format), MICROSOFT-INT (international date format), NETSCAPE, WEBSTAR, NETPRESENZ-NA (North American) or NETPRESENZ-INT (international) to get one of the above types of logfile. The command
LOGFORMAT AUTO
will return to automatic detection. The command LOGFILE none also returns the log format to AUTO.

If your logfile is not in one of the recognised formats, you can tell analog about your format using a log format string. You only ever need this if your logfile has lines which are not in one of the standard formats. The format string consists of a template for the logfile line, with the various fields and special characters replaced by codes as follows.

%S
host (computer making the request)
%r
file requested
%R
Mac-style filename, with colons instead of slashes
%B
browser
%f
referrer (URL referring to the file)
%u
user (tip: a cookie can usefully be defined as %u too)
%v
virtual host
%d
day of the month
%m
month in digits
%M
month, three letter abbreviation
%y
year, last two digits
%Y
year, four digits
%h
hour of the day
%n
minute of the hour
%a
a for am or p for pm (if %h is 12-hour clock)
%b
number of bytes transferred
%c
HTTP status code
%C
Special code, specific to particular servers
%q
query string (part of filename after ?, if recorded in a separate field)
%j
junk: ignore this field
%w
white space: spaces or tabs
%W
optional white space
%%
% sign
\n
new line
\t
tab stop
\\
single backslash
(I shall refer to the first seven things above as items.) So for example, the common log format, which looks like
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243
can be represented by the LOGFORMAT command
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)
including two items, host and file. (The parentheses are needed because the argument contains spaces.)

Logfiles often contain lines in several different formats, so you can specify several log formats for one file. For example, the definition of common format should also include the line

LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b)
to handle lines where the HTTP/1.0 part of the request is absent. Or you might use
LOGFORMAT COMMON
LOGFORMAT COMBINED
to represent a logfile which had lines in both those formats. Analog tries to match the line to the first format first, then if that fails the next, and so on, so the order of the formats is important. Usually you want to specify the most common one first, to minimise the time spent trying to match lines to inappropriate formats.

The log formats which analog can handle are those which are known as instantaneously decipherable: this means that the character which terminates a string can never occur in the string. In the above example, if the hostname ever contained a space, the line would be marked as corrupt, because analog terminates the host at the first space, not at the first occurrence of space-dash-space, and then the rest of the line wouldn't match. Of course, hostnames should never contain spaces, so this shouldn't be a problem. There are a couple of other restrictions: if there is any date or time information, then the year, month, date, hour and minute must all be present: and the same information may not occur twice in the format (so you can't have both %m and %M, for example).

Sometimes you need to read one of the fields in a logfile, but not analyse it. For example, if you have a separate common log and referrer log, the referrer log might look like

[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/
But the requests for /~sret1/analog/ would already have been counted when reading the main logfile, so you don't want to count them again now. You get round this by specifying a * in that item in the format string, like this:
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r)
Any of the seven items can be treated in this way.

Here are the exact rules about which logfile gets which log formats. Log formats accumulate until a LOGFILE command intervenes; or until you specify LOGFORMAT AUTO to return to automatic detection; or until a LOGFILE none command or the end of the command line or of a configuration file, when the format is reset to AUTO implicitly. Conversely, if you specify several logfiles, they will all use the same formats, unless there's another LOGFORMAT command or an implicit return to AUTO format between them.


There's also a second argument to the logfile command, which specifies a prefix to add to all the filenames in that logfile. This is useful if you've got several different servers or virtual hosts, when the same filename may occur on each of the servers. The argument can contain a %v, and the name of the virtual host will then be inserted at that point. For example,
LOGFILE log1,log2 http://www.%v.mydomain.com
would translate a filename /file.html with virtual host spam in log1 or log2 to http://www.spam.mydomain.com/file.html. If you are using the second argument to the LOGFILE command, you will probably want to use the SUBDIR command as well.

If %v is included in the argument and the line doesn't have virtual host, that line will be marked as corrupt. If VHOSTLOWMEM 3 is specified, the %v's will not be translated and will just appear as %v in the output.


There is one other command which applies to individual logfiles, in a similar way to the LOGFORMAT. Sometimes your server is not (or believes it is not) in the same timezone as you. So that you can give your statistics in your local time, there is a command LOGTIMEOFFSET to change the time by a certain number of minutes. You have to be careful using this. Because of daylight savings time in operation in different parts of the world at different times, analog cannot attempt to convert between different timezones. So it's your responsibility to set the right offset for different times of year. For example, if you were in Chicago, but your server was recording time in GMT, you would need to specify two different time offsets, one of minus five hours for summer and one of minus six hours for winter. You would need to split your logfiles in the right places and then run commands like
LOGTIMEOFFSET -300
LOGFILE summer*.log
LOGTIMEOFFSET -360
LOGFILE winter*.log

While we're on the subject of time offsets, there is one other similar command, which is not directly to do with logfiles. You can specify a TIMEOFFSET command to say how much analog should offset the time of the computer on which it is running, to get your local time.


It is often convenient to store logfiles compressed to save disk space. Analog on the Mac can read logfiles compressed using gzip. And analog on Unix, Win32, and VMS 7.0 and above can read compressed logfiles provided that you use an UNCOMPRESS command to say how to uncompress them. You need to supply the types of file that you want to uncompress in a comma-separated list, together with the name of a command that will uncompress the files to standard output (rather than to a file). For example, on Unix you might use
UNCOMPRESS *.gz,*.Z  /usr/bin/gzcat
whereas on Windows NT, you might use
UNCOMPRESS *.gz "c:\Program Files\gzip\gzip -cd"
and on VMS, it could be
UNCOMPRESS *.LOG-GZ;*  "gunzip -c"
This would be a suitable command to include in the default configuration file.

If analog determines when it starts to uncompress a logfile that that file isn't wanted for the analysis, two undesirable things can happen. Either the program might pause until the logfile is fully uncompressed, or there might be a "broken pipe" error reported. This is system dependent, and out of analog's control.


Appendix: logfile formats

Here is a summary of the various logfile formats which analog knows about.

The common logfile format is written by most servers. Its lines look like

jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243
Specifying LOGFORMAT COMMON is the same as specifying the three commands
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b)
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j" %c %b)

The NCSA referrer log looks like
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/
and the browser (or agent) log looks like
[14/Mar/1996:17:45:08] Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)
The respective LOGFORMAT commands are
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r)
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %B)
In both of these logfiles the date can be omitted, except if the date is omitted in the browser log, analog will not be able to detect the log format automatically. (It doesn't contain enough clues, so there is too much danger of confusing other log formats with it; just use "LOGFORMAT %B").
The NCSA combined log is the same as the common log, except that it has the referrer and browser on the end in quotes, like this:
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243
"http://www.statslab.cam.ac.uk/" "Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)"
except all one line. If you are using the Apache server, you can generate this with the mod_log_config module, using the command
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\""
The corresponding LOGFORMAT commands are
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b "%f" "%B")
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b "%f" "%B")
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j" %c %b "%f" "%B")
It is usually better to use the combined log than separate logs, because it stores more information in less space.
The W3 extended log, the Netscape log, and the WebSTAR log can be recognised because they must include at or near the top a line telling analog what to expect on subsequent lines. Analog constructs a LOGFORMAT template based on this header line. (They may also contain later lines changing the format).

The extended log is described at http://www.w3.org/TR/WD-logfile.html. Its header line looks like

#Fields: date time cs-uri
In the rest of the logfile, the fields can be separated by spaces or tabs. The WebSTAR file has a header line like
!!LOG_FORMAT DATE TIME RESULT URL BYTES_SENT HOSTNAME
In the rest of the logfile, the fields are separated by tabs. Some other Mac servers also use the WebSTAR format, or something looking like it. Analog will understand these too. Finally, the Netscape header line looks like
format=%Ses->client.ip% [%SYSDATE%] "%Req->reqpb.clf-request%"
%Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%

Sometimes these three logfile formats can contain header lines which refer to the same item in two different ways. Analog doesn't know which one you want to count, so such header lines will generate a "corrupt format line" warning. You can then use a LOGFORMAT command to specify the format more precisely.


The Microsoft IIS logfile looks like
192.64.25.41, -, 21/02/97, 00:03:46, W3SVC, SPIDER, 192.16.225.10,
30, 303, 1455, 200, 0, GET, /siege.htm, -,
(except all on one line) or
LOGFORMAT (%S, %u, %d/%m/%y, %h:%n:%j, W3SVC, %j, %v, %j, %j, %b, %c, %j, %j, %r, %j,)
However, the format is extremely badly designed, in that the date follows local conventions: in other words, in North America the above example would have the date 02/21/97 instead. Analog will diagnose which form the logfile is in if possible: but if both the date and the month are at most 12, there is no way to tell which format it is. In this case, you need to use the LOGFORMAT command MICROSOFT-NA for North American date format, or MICROSOFT-INT for international date format.
The Netpresenz logfile is unusual in that each entry can spread over several lines. It looks like
5:54 pm  14/11/96  134.87.19.110  HTTP    get file  Research.html
Web:Research:Research.html
Referer: http://guide-p.infoseek.com/Titles
The fields are separated by tabs. It is equivalent to four LOGFORMAT commands:
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R\nReferer: %f)
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R)
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%R)
LOGFORMAT (%j)
Again, the Netpresenz format uses local conventions for the date and time. Analog will diagnose it where it can: otherwise, you will have to use
LOGFORMAT NETPRESENZ-NA    # dates like 9:14 AM  3/23/98 (upper case AM)
or
LOGFORMAT NETPRESENZ-INT   # dates like 9:14 am  23/3/98 (lower case am)
It can even be that the date and time is in neither of these forms, in which case you will have to enter your own LOGFORMAT string.
Stephen Turner
E-mail: sret1@cam.ac.uk

[ Top | Up | Prev | Next | Map | Index ]