NovoSial.org: Log analysis with swatch

Immediate Notification | Ongoing Review

The swatch utility can assist with logfile analysis, providing immediate notification if log entries matching a regular expression are spotted, or to review logfiles for unknown data.

Consider also Simple Event Correlator (SEC), a “free and platform independent event correlation tool”.

Immediate Notification

Setup | Configuration

Use swatch running against logfiles to report on disk failures and other problems that need immediate response. This can be done by sending all logs to a combined /var/log/everything file swatch can read from. Administrators can also access this combined file using tail or other interactive utilities for live review.

Setup

syslogd

For stock Unix syslogd, use *.* to match all logs and send them to /var/log/everything. If possible, change the configuration to include the facility and priority information in the logfile, for example with the -v -v options to the FreeBSD syslogd.

*.* /var/log/everything

syslog-ng

Assuming a source statement called local, the following will route all logs sent to syslog-ng into /var/log/everything. The following example also logs the facility and priority information, which stock syslogd may discard by default. Including this information makes log analysis easier.

destination everything {
file("/var/log/archive/everything"
template("$DATE <$FACILITY.$PRIORITY> $HOST $MSG\n") template_escape(no)
);
};
log { source(local); destination(everything); };

To keep the size of the everything file down, truncate it periodically. Long term logs should be stored in other files or a database. The following logrotate configuration shows daily rotation of the everything file, along with the restart of a custom swatch service that runs swatch against the file.

/var/log/everything {
daily
copytruncate
rotate 1
postrotate
/sbin/service swatch restart
endscript
}

Configuration

The configuration of swatch will need to be updated periodically, especially as new hardware and software is added to the network, or after new log records are revealed following a new disaster. For example, to provide e-mail notification when a 3ware RAID array is no longer fault tolerant:

# 3ware logs
watchfor /(?i)3w-xxxx.+no longer fault tolerant/
mail=root,subject=LW warn: disk 3ware RAID not fault tolerant
throttle 1:00:00,use=regex

Use a consistent subject to allow easy filtering by mail clients. I use a prefix of LW on all such messages, followed by a info, note, or warn severity indicator. This in turn is followed by a category statement such as disk or network, then finally a description of the problem that has been triggered.

The use of swatch on a everything also allows the setup of notification for non-critical or testing events.

Job completion

Completion of long running jobs can be logged and reported on via the use of a custom tag, such as custom-notify:.

watchfor /custom-notify: /
mail=root,subject=LW info: misc custom notify message
throttle 15:00,use=regex

Then, when starting a long running job, follow it with a log message. The message could also reference a ticket number, so other administrators perhaps on a different shift could consult a trouble ticket system and complete the work.

$ long-running-job ; logger "custom-notify: job done: ticket=1234"

To allow better timing and tracking of jobs, wrap the command in two log entries, the first to record when the job starts, along with other metadata, and the second to trigger a notification and record when the event completed. For instance, via a Makefile entry.

index-rebuild:
@logger -it index-rebuild device=/dev/nst0
@scanner -i -v /dev/nst0
@logger -it index-rebuild custom-notify: job complete

Even if the server running the scanner process crashes or needs to be rebooted, the central loghost should have a record of index-rebuild entries that can be consulted to see whether the scanner completed and when. This method works well when a list of tasks is being worked through in turn.

while read animal; do
logger -it animal-notify name=$animal
do_something $animal
logger -it animal-notify name=$animal, custom-notify: do_something complete
done <'EOF'
cat
dog
fish
EOF

Logwatching service restarts

After restarting log watching services, have them log a message, which should trigger the following rule. That way, if an expected e-mail from the notification does not arrive following a manual restart of the logging system, the logging system is probably not working correctly.

watchfor /logwatch restart/
mail=root,subject=LW info: log logwatch restart
throttle 15:00,use=regex

Ongoing Review

Using swatch to look for unknown log data is harder and more time consuming. The reason for looking for unknown log data is to spot perhaps new hardware problems or security issues. However, the wide variety of software issuing many different types of logs makes this review difficult.

With swatch, a configuration is used to ignore known patterns, then report anything not known about. This leads to many ignore rules that winnow out non-significant data, followed by a reporting rule. The use of facility and priority information in the logs helps, as all common info and debug priorities can be excluded for well behaved applications. Some applications make no or bizarre use of the facility and priority information, and require many ignore statements.

ignore /\.(?:debug|info)> \S+ clamd(?:\[\d+\])?: /
ignore /\.notice> \S+ clamd(?:\[\d+\])?: clamd (startup|shutdown) succeeded/

watchfor /clamd\[\d+\]:.+Unable to open file or directory/
echo
throttle 1:00:00,use=regex

watchfor /./
echo

Reporting on problems with this sort of swatch configuration is difficult, as the throttle and threshold statements do not go very far to exclude or summarize repeated logs, such as permission denied errors.

The following example configuration files may serve as a useful reference.

macosx-everything.conf - configuration I use on my OS X laptop running custom syslog-ng logging daemon and other tools.