Log analysis with swatch

Immediate Notification | Ongoing Review

The swatch utility can assist with logfile analysis, providing immediate notification if log entries matching a regular expression are spotted, or to review logfiles for unknown data.

Consider also Simple Event Correlator (SEC), a “free and platform independent event correlation tool”.

Immediate Notification

Use swatch running against logfiles to report on disk failures and other problems that need immediate response. This can be done by sending all logs to a combined /var/log/everything file swatch can read from. Administrators can also access this combined file using tail or other interactive utilities for live review.

Setup

To keep the size of the everything file down, truncate it periodically. Long term logs should be stored in other files or a database. The following logrotate configuration shows daily rotation of the everything file, along with the restart of a custom swatch service that runs swatch against the file.

/var/log/everything {
daily
copytruncate
rotate 1
postrotate
/sbin/service swatch restart
endscript
}

Configuration

The configuration of swatch will need to be updated periodically, especially as new hardware and software is added to the network, or after new log records are revealed following a new disaster. For example, to provide e-mail notification when a 3ware RAID array is no longer fault tolerant:

# 3ware logs
watchfor /(?i)3w-xxxx.+no longer fault tolerant/
mail=root,subject=LW warn: disk 3ware RAID not fault tolerant
throttle 1:00:00,use=regex

Use a consistent subject to allow easy filtering by mail clients. I use a prefix of LW on all such messages, followed by a info, note, or warn severity indicator. This in turn is followed by a category statement such as disk or network, then finally a description of the problem that has been triggered.

The use of swatch on a everything also allows the setup of notification for non-critical or testing events.

Ongoing Review

Using swatch to look for unknown log data is harder and more time consuming. The reason for looking for unknown log data is to spot perhaps new hardware problems or security issues. However, the wide variety of software issuing many different types of logs makes this review difficult.

With swatch, a configuration is used to ignore known patterns, then report anything not known about. This leads to many ignore rules that winnow out non-significant data, followed by a reporting rule. The use of facility and priority information in the logs helps, as all common info and debug priorities can be excluded for well behaved applications. Some applications make no or bizarre use of the facility and priority information, and require many ignore statements.

ignore /\.(?:debug|info)> \S+ clamd(?:\[\d+\])?: /
ignore /\.notice> \S+ clamd(?:\[\d+\])?: clamd (startup|shutdown) succeeded/

watchfor /clamd\[\d+\]:.+Unable to open file or directory/
echo
throttle 1:00:00,use=regex

watchfor /./
echo

Reporting on problems with this sort of swatch configuration is difficult, as the throttle and threshold statements do not go very far to exclude or summarize repeated logs, such as permission denied errors.

The following example configuration files may serve as a useful reference.