[ Top | Up | Prev | Next | Map | Index ]

Analog 5.92beta1: Cache files


Analog has the ability to archive some of the data in your logfile into a cache file so that the logfile can be thrown away without losing the most important data. (This is sometimes known as incremental processing.)

For most people, the cache file will not be needed: compressing the logfile using a standard compression utility such as gzip will be sufficient. Compressing a logfile is very efficient owing to the large number of repeated strings: I find about 12 times compression in practice. That in itself may solve your filespace problems, without needing to throw away any information.

The cache file is also not the best format for post-processing the data or feeding it into a spreadsheet. For that you should use the computer-readable output style.

Many people have trouble using the cache file, and end up accidentally recording corrupt data. You do need to understand what you're doing before you throw away your logfiles. See the discussion on Procedures below.

If you are going to use the cache file feature, it is also very important that you understand what is and what is not recorded. The summary is that all INCLUDE and EXCLUDE commands, including FROM and TO, and any ALIASes and LOGTIMEOFFSETs, must be applied when you create the cache file, not when you read it later. If you want different sets of options, you must create several cache files from the same logfile.

The reason for this is that it is not possible to reconstruct everything of interest in the logfile from the cache file. The cache file does contain information about the total number of requests for each host and each file, but not about, for example, which files were read by which hosts. (To do so would take up as much disk space as the compressed logfile.) So you cannot later look at only one file and see which hosts read that file.

Another way to look at this: if you do, for example, a HOSTEXCLUDE when reading the cache file, you are not doing a genuine HOSTEXCLUDE because files that that host read will still be included. You are only excluding those hosts from the Host Report, Organisation Report and Domain Report. This is why you must do all the inclusions and exclusions you want when you create the cache file.

When analog reads in a cache file, it does not apply any more aliases to the items. This is to avoid double-aliasing. So you must do any aliases you want at the time you create the cache file. Similarly, it does not obey the LOGTIMEOFFSET variable, to avoid double-offsetting, so any offset you want must be applied at cache-creation time too.

Also, the cache file does not contain data about the number of requests for each item in the last seven days: it can't, because the figures will be different by the time they are wanted.

Finally, times are only recorded to five-minute resolution.


You can create a cache file by setting the CACHEOUTFILE to be the file you want the cache to live in. Set
CACHEOUTFILE none
to turn it off again. You will still get the regular output as well as the cache output, unless you request OUTPUT NONE. To avoid overwriting, you cannot set the CACHEOUTFILE to be a file which already exists. (Disclaimer: on some systems, race conditions may very occasionally thwart this check. Also on a few systems, making the file writeable but not readable will allow it to be overwritten). You can include the date in the name of the CACHEFILE and CACHEOUTFILE in the same way as described earlier for the LOGFILE and OUTFILE.

You can read in a previously-made cache file with the CACHEFILE command, or with the +U command line option. This works exactly the same as the LOGFILE command, so you can use commas and wildcards to read in several cache files, and read compressed cache files using the UNCOMPRESS mechanism.

If the name of the CACHEFILE or the CACHEOUTFILE doesn't include a directory, it will be looked for, or written to, wherever analog expects to find its cache files. (This location is built in when the program is compiled.) For example, on Windows it would be in the same folder as the analog executable. But a cache file specified in a +U command line option is within the current directory.

It is possible (and useful) to analyse some CACHEFILEs and some LOGFILEs together. LOGFILE and CACHEFILE commands are basically cumulative, except that any logfiles and cache files in the mandatory configuration file or configuration files loaded from there override any on the command line or in configuration files specified on the command line, which themselves override any in the default configuration file or configuration files loaded from there, which in turn override compile-time options. Usually you don't need to worry about this, and it will do what you expect.

Sometimes you don't want to record all the types of item in the cache file. You might want to forget about which hosts had accessed your web site, for example, and only remember how many times each file was requested. You can choose not to include one type of item in the cache file by setting its LOWMEM to 3; for example, specify

HOSTLOWMEM 3
to exclude hosts from the cache file. Because this is a serious step, analog will produce a warning if you do this. You can even set all six LOWMEMs to 3 if you just want to remember the pattern of requests over time, not even which files were requested.

Procedures

Many people have trouble when they try and use cache files, and end up omitting data or double-counting. You have to be careful to make sure that each piece of data is recorded in exactly one cache file. One very common mistake is to use all the old cache files when making each new cache file. Because each piece of data is then in all of the cache files, when you make a new cache file, it will record each piece of data several times over. If analog gives you a "double-counting" warning when you create a cache file, you have probably done something of this sort wrong.

Here is one way to use the cache files correctly. It's not the only correct way, but I think it's conceptually the simplest. The idea is that whenever you start a new logfile, you make a cache file out of the old logfile. So each cache file contains all the data from one, and only one, logfile. You never use old cache files to make new ones: so you never have a CACHEFILE and a CACHEOUTFILE command in the same configuration file.

Here is the procedure.

  1. Rotate your logs: that means, archive the old logfile, and restart the server with a fresh logfile. (There are several standard tools to do this: or see your server documentation.)
  2. Make both a cache file and an ordinary report from the old logfile. You can do this simultaneously by using one LOGFILE command, one OUTFILE command, and one CACHEOUTFILE command.
  3. Make a test report from the cache file (using CACHEFILE and OUTFILE but no LOGFILE) and compare it against the report from the logfile to check it works. (This step really is worth doing!)
  4. Now you can throw away the old logfile, if you've really understood what data you're losing by doing so. (But please remember that I can take no responsibility if something goes wrong: see the licence.)
  5. When you want to make the main report, you can now use all your cache files and the current (not-yet-cached) logfile.
As explained above, all INCLUDE and EXCLUDE commands, including FROM and TO, and any ALIASes and LOGTIMEOFFSETs, must be applied when you create the cache file, not when you read it later. So you may want to create several cache files from each logfile with different sets of options. Of course, in this case, you musn't later mix cache files made with different options.
Go to the analog home page.

Stephen Turner
14 November 2004

Need help with analog? Use the analog-help mailing list.

[ Top | Up | Prev | Next | Map | Index ]