LOGFILE logfilenameor just to put the logfile name on the command line without any arguments, e.g., analog logfilename. A - sign or the word stdin is interpreted as standard input: this is useful on Unix systems for constructing pipes. The word none means that the list of logfiles specified so far is erased. All logfiles must be on your local disk -- analog doesn't fetch them from across the network. In the Mac version, you can also analyse a particular single logfile by dragging it onto the analog icon.
You can have several LOGFILE commands. You can include wildcards in the logfile name (but not necessarily in the directory name: this is system-dependent), and you can use a list of logfiles separated by commas (without spaces). So the following commands would tell analog to read logfile1, c:\logs\logfile2, and all files ending in .log:
LOGFILE logfile1,*.log LOGFILE c:\logs\logfile2The LOGFILE commands are cumulative, except that any logfiles on the command line or in user-specified configuration files override any in the default configuration file, and are themselves overridden by any in the mandatory configuration file.
The reason for the "sometimes" in the previous paragraph is as follows. The Microsoft and Netpresenz formats are extremely badly designed in that the date can occur in either of the forms date/month/year or month/date/year, and they don't say which they're using. Analog will detect them automatically if it can tell which date format is being used (e.g., 13/2/98 or 2/13/98), but if it can't, it will tell you to use one of the LOGFORMAT strings below. Also, the NCSA browser log can only be detected if it includes the date.
When you start up analog, all logfiles have the default logfile format. This is normally automatic detection, as explained above, but you can change it if your logfiles are always in a format which analog doesn't know about. You do this by means of the command
DEFAULTLOGFORMAT format-- we'll discuss what the formats can be in a minute.
Sometimes you might want to analyse several logfiles with different formats. For this you need the LOGFORMAT command. This command only applies to future logfiles in the same configuration file. So if you change the format with a command like
LOGFORMAT formatthen any logfiles you select with a LOGFILE command later in the same configuration file will get the new format.
The possible formats for use with the DEFAULTLOGFORMAT and LOGFORMAT commands are of two types. First there are some symbolic words, and then there are log format strings. We'll look at the words first.
There are format words for all the built-in formats analog knows about. For example, COMMON will select common format; you can also have COMBINED, REFERRER, BROWSER, EXTENDED, MICROSOFT-NA (North American date format), MICROSOFT-INT (international date format), NETSCAPE, WEBSTAR, NETPRESENZ-NA (North American) or NETPRESENZ-INT (international). There are also the words AUTO for automatic detection and DEFAULT for whatever the default log format is.
If your logfile is not in one of the recognised formats, you can tell analog about your format using a log format string. You only ever need this if your logfile has lines which are not in one of the standard formats. The format string consists of a template for the logfile line, with the various fields and special characters replaced by codes as follows.
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243can be represented by the LOGFORMAT command
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)including two items, host and file. (The parentheses are needed because the argument contains spaces.)
Logfiles often contain lines in several different formats, so you can specify several log formats one after the other and they will accumulate. For example, the definition of common format should also include the line
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b)to handle lines where the HTTP/1.0 part of the request is absent. Or you might use
LOGFORMAT COMMON LOGFORMAT COMBINEDto represent a logfile which had lines in both those formats. Analog tries to match the line to the first format first, then if that fails the next, and so on, so the order of the formats is important. Usually you want to specify the most common one first, to minimise the time spent trying to match lines to inappropriate formats.
The log formats which analog can handle are those which are known as instantaneously decipherable: this means that the character which terminates a string can never occur in the string. In the above example, if the hostname ever contained a space, the line would be marked as corrupt, because analog terminates the host at the first space, not at the first occurrence of space-dash-space, and then the rest of the line wouldn't match. Of course, hostnames should never contain spaces, so this shouldn't be a problem. There are a couple of other restrictions: if there is any date or time information, then the year, month, date, hour and minute must all be present: and the same information may not occur twice in the format (so you can't have both %m and %M, for example).
Sometimes you need to read one of the fields in a logfile, but not analyse it. For example, if you have a separate common log and referrer log, the referrer log might look like
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/But the requests for /~sret1/analog/ would already have been counted when reading the main logfile, so you don't want to count them again now. You get round this by specifying a * in that item in the format string, like this:
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r)Any of the seven items can be treated in this way.
Here are the exact rules about which logfile gets which log formats. The default logfile format starts off at AUTO. You can change it with a DEFAULTLOGFORMAT command, and then the default format accumulates unless you specify DEFAULTLOGFORMAT AUTO to return to automatic detection.
The current logfile format starts off at DEFAULT. You can change it with a LOGFORMAT command, and then the current format accumulates until a LOGFILE command intervenes; then it restarts at the next LOGFORMAT command. It also restarts if you specify LOGFORMAT AUTO or LOGFILE DEFAULT; or when the current format is reset to DEFAULT automatically, which happens at the end of the command line, and of every configuration file, and whenever a LOGFILE none command is encountered.
The default logfile selected at compilation time always gets the default format (although exactly what the default format is can still be changed with a DEFAULTLOGFORMAT command). Any logfile declared later, in a configuration file for example, gets the current log format at the time it is selected. If you specify several logfiles, they will all use the same format, unless there's a LOGFORMAT command or an implicit return to DEFAULT format between them.
LOGFILE log1,log2 http://www.%v.mydomain.comwould translate a filename /file.html with virtual host spam in log1 or log2 to http://www.spam.mydomain.com/file.html. If you are using the second argument to the LOGFILE command, you will probably want to use the SUBDIR command as well.
If %v is included in the argument and the line doesn't have virtual host, that line will be marked as corrupt. If VHOSTLOWMEM 3 is specified, the %v's will not be translated and will just appear as %v in the output.
LOGTIMEOFFSET -300 LOGFILE summer*.log LOGTIMEOFFSET -360 LOGFILE winter*.log
While we're on the subject of time offsets, there is one other similar command, which is not directly to do with logfiles. You can specify a TIMEOFFSET command to say how much analog should offset the time of the computer on which it is running, to get your local time.
UNCOMPRESS *.gz,*.Z /usr/bin/gzcatwhereas on Windows NT, you might use
UNCOMPRESS *.gz "c:\Program Files\gzip\gzip -cd"and on VMS, it could be
UNCOMPRESS *.LOG-GZ;* "gunzip -c"This would be a suitable command to include in the default configuration file.
If analog determines when it starts to uncompress a logfile that that file isn't wanted for the analysis, two undesirable things can happen. Either the program might pause until the logfile is fully uncompressed, or there might be a "broken pipe" error reported. This is system dependent, and out of analog's control.
The common logfile format is written by most servers. Its lines look like
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243Specifying LOGFORMAT COMMON is the same as specifying the three commands
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b) LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b) LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j" %c %b)
[14/Mar/1996:17:48:10] http://guide-p.infoseek.com/Titles -> /~sret1/analog/and the browser (or agent) log looks like
[14/Mar/1996:17:45:08] Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)The respective LOGFORMAT commands are
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r) LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %B)In both of these logfiles the date can be omitted, except if the date is omitted in the browser log, analog will not be able to detect the log format automatically. (It doesn't contain enough clues, so there is too much danger of confusing other log formats with it; just use "LOGFORMAT %B").
jay.bird.com - fred [14/Mar/1996:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200 1243 "http://www.statslab.cam.ac.uk/" "Mozilla/2.0 (X11; I; HP-UX A.09.05 9000/735)"except all one line. If you are using the Apache server, you can generate this with the mod_log_config module, using the command
LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-Agent}i\""The corresponding LOGFORMAT commands are
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b "%f" "%B") LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j %r" %c %b "%f" "%B") LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j" %c %b "%f" "%B")It is usually better to use the combined log than separate logs, because it stores more information in less space.
The extended log is described at http://www.w3.org/TR/WD-logfile.html. Its header line looks like
#Fields: date time cs-uriIn the rest of the logfile, the fields can be separated by spaces or tabs. The WebSTAR file has a header line like
!!LOG_FORMAT DATE TIME RESULT URL BYTES_SENT HOSTNAMEIn the rest of the logfile, the fields are separated by tabs. Some other Mac servers also use the WebSTAR format, or something looking like it. Analog will understand these too. Finally, the Netscape header line looks like
format=%Ses->client.ip% [%SYSDATE%] "%Req->reqpb.clf-request%" %Req->srvhdrs.clf-status% %Req->srvhdrs.content-length%
Sometimes these three logfile formats can contain header lines which refer to the same item in two different ways. Analog doesn't know which one you want to count, so such header lines will generate a "corrupt format line" warning. You can then use a LOGFORMAT command to specify the format more precisely.
192.64.25.41, -, 21/02/97, 00:03:46, W3SVC, SPIDER, 192.16.225.10, 30, 303, 1455, 200, 0, GET, /siege.htm, -,(except all on one line) or
LOGFORMAT (%S, %u, %d/%m/%y, %h:%n:%j, W3SVC, %j, %v, %j, %j, %b, %c, %j, %j, %r, %j,)However, the format is extremely badly designed, in that the date follows local conventions: in other words, in North America the above example would have the date 02/21/97 instead. Analog will diagnose which form the logfile is in if possible: but if both the date and the month are at most 12, there is no way to tell which format it is. In this case, you need to use the LOGFORMAT command MICROSOFT-NA for North American date format, or MICROSOFT-INT for international date format. It may even be that the date is in neither of these formats, in which case you need to use a LOGFORMAT command of your own.
There are also various third-party extensions to the Microsoft format to include, for example, the browser and referrer. Analog can't automatically diagnose these: you need to write a LOGFORMAT string for them.
5:54 pm 14/11/96 134.87.19.110 HTTP get file Research.html Web:Research:Research.html Referer: http://guide-p.infoseek.com/TitlesThe fields are separated by tabs. It is equivalent to four LOGFORMAT commands:
LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R\nReferer: %f) LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%j\t\n%R) LOGFORMAT (%h:%n %aM\t%m/%d/%y\t%S\tHTTP\t\t%C\t%R) LOGFORMAT (%j)Again, the Netpresenz format uses local conventions for the date and time. Analog will diagnose it where it can: otherwise, you will have to use
LOGFORMAT NETPRESENZ-NA # dates like 9:14 AM 3/23/98 (upper case AM)or
LOGFORMAT NETPRESENZ-INT # dates like 9:14 am 23/3/98 (lower case am)Again, it can be that the date and time is in neither of these forms, in which case you will have to enter your own LOGFORMAT string.