[ Top | Up | Prev | Next | Map | Index ]

Readme for analog 4.04

Specifying a log format

This section is about how to tell analog the format of your logfile. I'll assume that you've read the previous section, and have decided that you need to specify the log format explicitly, because analog can't detect the format of your logfile itself for some reason.

The basic command to specify a log format looks like

LOGFORMAT format
-- we'll discuss what the formats can be in a minute. Or if you are using the Apache server, you will probably find it more convenient to use
APACHELOGFORMAT format
instead.

The LOGFORMAT and APACHELOGFORMAT commands only apply to logfiles specified with a LOGFILE command later in the same configuration file. So you must put the LOGFORMAT above the LOGFILE to which it refers. This way, different logfiles can have different formats, like this:

LOGFILE log0
LOGFORMAT format1
LOGFILE log1
LOGFORMAT format2
LOGFILE log2
LOGFILE log3
In this example, log1 is in format1, log2 and log3 are in format2, and log0 isn't in either format -- analog will try and detect which format it's in.
The APACHELOGFORMAT command is followed by the LogFormat from your Apache httpd.conf file. For example, common format could be represented by
APACHELOGFORMAT (%h %l %u %t \"%r\" %s %b)
(The parentheses are needed because the argument contains spaces.) Analog understands all Apache log formats, with the exception that it won't parse Apache's "%...{format}t" construction for customised times: if you have this construction, you will have to use ordinary LOGFORMAT instead.
The possible formats for use with the LOGFORMAT command are of two types. First there are some symbolic words, and then there are log format strings. We'll look at the words first.

There are format words for all the built-in formats analog knows about. You might need one of these words if your logfile is in a standard format, but analog can't detect which format it's in for some reason; for example, maybe the first line is corrupt; or maybe analog can't tell whether you're using North American or international dates. So for example

LOGFORMAT COMMON
will select common format; you can also have COMBINED, REFERRER, BROWSER, EXTENDED, MICROSOFT-NA (North American date format), MICROSOFT-INT (international date format), WEBSITE-NA, WEBSITE-INT, MS-EXTENDED (Microsoft's attempt at extended format), MS-COMMON (a buggy version of common format in some versions of Microsoft software), NETSCAPE or WEBSTAR. All these formats were defined at the end of the previous section. You can also use the special word AUTO to return to automatic detection.

If your logfile is not in one of the recognised formats, you can tell analog about your format using a log format string. You only ever need this if your logfile has lines which are not in one of the standard formats. (And even if it isn't in a standard format, if you're using the Apache web server, you will find APACHELOGFORMAT easier.)

The format string consists of a template for the logfile line, with the various fields and special characters replaced by codes as follows. Please note that these codes are case sensitive -- for example, %b is completely different from %B!

%S
host (computer making the request)
%r
file requested
%B
browser
%A
browser with +'s instead of spaces
%f
referrer (URL referring to the file)
%u
user (tip: a cookie can usefully be defined as %u too)
%v
virtual host (also called virtual domain)
%d
day of the month
%m
month in digits
%M
month, three letter English abbreviation
%y
year, last two digits
%Y
year, four digits
%h
hour of the day
%n
minute of the hour
%a
a or A for am, or p or P for pm, if %h is in the 12-hour clock. (So to match "am" you need %am and to match "AM" you need %aM)
%U
"Unix time" (seconds since beginning of 1970, GMT)
%b
number of bytes transferred
%t
processing time in seconds
%T
processing time in milliseconds
%c
HTTP status code
%q
query string (part of filename after ?, if recorded in a separate field)
%j
junk: ignore this field (field can be empty too)
%w
white space: spaces or tabs
%W
optional white space
%%
% sign
\n
new line
\t
tab stop
\\
single backslash
So for example, the common log format, which looks like
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000]
"GET /~sret1/ HTTP/1.0" 200 1243
(except all on one line) could be represented by the LOGFORMAT command
LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j %j] "%j %r %j" %c %b)
In other words, it's just the sample line but with the hostname replaced by %S, the username by %u etc. (The parentheses are needed because the argument contains spaces.) Or take another example: if you had lines which looked like
Fri 25/12/98 5:45pm, /~sret1/, jay.bird.com, 200, 1243,
http://www.site.com, Mozilla/2.0 (X11; I; HP-UX A.09.05)
(all on one line again), you could use the format
LOGFORMAT (%j %d/%m/%y %h:%n%am, %r, %S, %c, %b, %f, %B)

A logfile can sometimes have lines in several different formats. So you can specify several LOGFORMAT commands in a row, and they will all apply to the next logfile. This is also useful if the format of your logfile changes half way through. So in this example:
LOGFORMAT COMMON
LOGFORMAT COMBINED
LOGFILE log1
LOGFORMAT (%j %d/%m/%y %h:%n%am, %r, %S, %c, %b, %f, %B)
LOGFILE log2
LOGFILE log3
log1 has lines in both common and combined format, whereas log2 and log3 have lines just in the format in the previous example.

If you specify several formats, analog tries to match each line to the first format first, then if that fails the next, and so on, so the order of the formats is important. Usually you want to specify the most common one first, to minimise the time spent trying to match lines to inappropriate formats.


I suggested above that any logfile which doesn't have a LOGFORMAT command earlier in the same configuration file is auto-detected. But this isn't quite true. Actually such logfiles get a special format called the default log format. The default format starts off as auto-detection, but you can change it if you want with the DEFAULTLOGFORMAT command. This command works exactly the same as the LOGFORMAT command -- it understands the same formats, and if you have several DEFAULTLOGFORMAT commands, they accumulate in the same way. The difference is that they don't need to be put in any particular place. (There is also APACHEDEFAULTLOGFORMAT, which has the same effect but uses the Apache LogFormat strings.)

So let's go back to the first example:

LOGFILE log0
LOGFORMAT format1
LOGFILE log1
LOGFORMAT format2
LOGFILE log2
LOGFILE log3
Here log0 actually gets the default log format. If there are no DEFAULTLOGFORMAT commands, the default will be auto-detection. But if there are DEFAULTLOGFORMAT commands, even in another configuration file, that will be the format of log0.

The times you need to use the DEFAULTLOGFORMAT instead of the LOGFORMAT are if you want to change the format of logfiles which aren't given in a LOGFILE command -- for example, ones specified on the command line, or dragged onto the program icon on a Mac, or compiled in. It is also useful to use the DEFAULTLOGFORMAT if your logfiles are always in the same format, so that you don't have to worry about putting in enough LOGFORMATs in the right places.


A couple more technical details and tips about LOGFORMAT commands.

The "Unix time", %U, is always recorded in GMT. So you will probably need to use a LOGTIMEOFFSET command to convert to your local timezone. Also, it's just the integer part of the time, so if you have decimals you will have to use %U.%j .

The log formats which analog can handle are those which are known as instantaneously decipherable: in practice, this means that the character which terminates a string can never occur in the string. So for example, in common format, which looks like

LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j %j] "%j %r %j" %c %b)
if the hostname ever contained a space, the line would be marked as corrupt, because analog terminates the host at the first space, not at the first occurrence of space-dash-space, and then the rest of the line wouldn't match. Of course, hostnames should never contain spaces, so this shouldn't be a problem. There are a couple of other restrictions: if there is any date or time information, then the year, month, date, hour and minute must all be present: and the same information may not occur twice in the format (so you can't have both %m and %M, for example, because these both represent the month; make one of them a %j to have it ignored).

Sometimes you need to read one of the fields in a logfile, but not analyse it. For example, if you have a separate common log and referrer log, the referrer log might look like

http://guide-p.infoseek.com/Titles -> /~sret1/analog/
But the requests for /~sret1/analog/ would already have been counted when reading the main logfile, so you don't want to count them again now. You get round this by specifying a * in that item in the format string, like this:
LOGFORMAT (%f -> %*r)

A tip: sometimes it is more efficient to specify two or more adjacent fields to ignore with a single %j, as long as the whole group ends with a recognisable character. So common format is more efficiently specified as

LOGFORMAT (%S - %u [%d/%M/%Y:%h:%n:%j] "%j %r %j" %c %b)
-- in the date and time [25/Dec/1998:17:45:35 +0000], the seconds and the timezone can be ignored with a single %j, extending until the close-bracket.

Another tip: %j can also be used to ignore whole lines, rather than just fields analog doesn't use. For example, the extended log format ignores lines beginning with # by using

LOGFORMAT #%j
and the Microsoft format ignore lines corresponding to FTP requests with
LOGFORMAT (%*S, %*u, %m/%d/%y, %h:%n:%j, %j)
If those formats had not been used, the lines would have been incorrectly marked as corrupt.
Finally, both for reference and as examples, here is a list of all the fixed formats that analog understands, together with the example lines from the previous section and their built-in definitions (split over two lines where necessary).
Common format, LOGFORMAT COMMON
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000]
      "GET /~sret1/ HTTP/1.0" 200 1243
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r%wHTTP%j" %c %b)
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r" %c %b)
Microsoft common format, LOGFORMAT MS-COMMON
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000]
      "GET /~sret1/ "HTTP/1.0" 200 1243
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r%w"HTTP%j" %c %b)
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r" %c %b)
Combined log, LOGFORMAT COMBINED
jay.bird.com - fred [25/Dec/1998:17:45:35 +0000] "GET /~sret1/ HTTP/1.0" 200
      1243 "http://www.site.com/" "Mozilla/2.0 (X11; I; HP-UX A.09.05)"
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r%wHTTP%j" %c %b "%f" "%B")
LOGFORMAT (%S %j %u [%d/%M/%Y:%h:%n:%j] "%j%w%r" %c %b "%f" "%B")
Referrer log, LOGFORMAT REFERRER
[25/Dec/1998:17:45:35] http://www.site.com/ -> /~sret1/
or http://www.site.com/ -> /~sret1/
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %f -> %*r)
LOGFORMAT (%f -> %*r)
Browser log, LOGFORMAT BROWSER
[25/Dec/1998:17:45:35] Mozilla/2.0 (X11; I; HP-UX A.09.05)
LOGFORMAT ([%d/%M/%Y:%h:%n:%j] %B)
Microsoft log, North American dates, LOGFORMAT MICROSOFT-NA
192.64.25.41, -, 12/25/98, 17:45:35, W3SVC1, HOST1, 192.16.225.10,
      2178, 303, 1243, 200, 0, GET, /~sret1/, -,
LOGFORMAT (%S, %u, %m/%d/%y, %h:%n:%j, W3SVC%j, %j, %v,
      %T, %j, %b, %c, %j, %j, %r, %q,)
LOGFORMAT (%*S, %*u, %m/%d/%y, %h:%n:%j, %j)
Microsoft log, international dates, LOGFORMAT MICROSOFT-INT
192.64.25.41, -, 25/12/98, 17:45:35, W3SVC1, HOST1, 192.16.225.10,
      2178, 303, 1243, 200, 0, GET, /~sret1/, -,
LOGFORMAT (%S, %u, %d/%m/%y, %h:%n:%j, W3SVC%j, %j, %v,
      %T, %j, %b, %c, %j, %j, %r, %q,)
LOGFORMAT (%*S, %*u, %d/%m/%y, %h:%n:%j, %j)
WebSite log, North American dates, LOGFORMAT WEBSITE-NA
12/25/98 17:45:35  jay.bird.com  host1  Server  fred  GET  /~sret1/
   http://www.site.com/    Mozilla/2.0 (X11; I; HP-UX A.09.05)  200  1243  2178
LOGFORMAT (%m/%d/%y %h:%n:%j\t%S\t%v\t%j\t%u\t%j\t%r\t%f\t%j\t%B\t%c\t%b\t%T)
WebSite log, international dates, LOGFORMAT WEBSITE-INT
25/12/98 17:45:35  jay.bird.com  host1  Server  fred  GET  /~sret1/
   http://www.site.com/    Mozilla/2.0 (X11; I; HP-UX A.09.05)  200  1243  2178
LOGFORMAT (%d/%m/%y %h:%n:%j\t%S\t%v\t%j\t%u\t%j\t%r\t%f\t%j\t%B\t%c\t%b\t%T)
The extended log, Netscape log and WebSTAR log don't have any built-in formats: analog constructs their formats from their header lines.
Stephen Turner
Need help with analog? Subscribe to the analog-help mailing list

[ Top | Up | Prev | Next | Map | Index ]