Some ideas about monitoring

TODO: Update this document with edited ones.

Ad-hoc monitoring

Monitoring system that doesn't require configuring it to process new types of metrics allows an operator to write quick-and-dirty script (run from cron or a while loop in shell) to closely monitor some object while debugging it. The operator then can use a dashboard to correlate the ad-hoc defined metric with other ones, collected systematically.

Crisis monitoring

This is a variant of ad-hoc monitoring. When a crisis strikes, operator can easily add whatever metrics are needed. In post mortem analysis an archive with metrics (systematic and ad-hoc ones) can be useful thing, too.

Custom metric processing

Threshold checking

Static thresholds are too stiff and often don't provide the right model for metrics. There are several ways of checking whether a metric's value is OK or should be investigated by an operator.

Average on a period

Using history of metric's values one could build a model of change of this metric. One way of building such model is to decide on a period (e.g. one day or one week), take a history few times longer and calculate an average for every data point. The resulting function is used as a model of change.

DFT

Value of a metric can be treated as an amplitude of a signal. This opens a whole lot of digital signal processing techniques, beginning with discrete Fourier transform.

A model of metric's behaviour can be calculated by taking a historic track of changes and passing it through DFT, then dropping frequencies with amplitude below a threshold and passing the data back through inverse DFT.

This model has a property of finding change periods without expecting them to be of certain length.

Anomaly detection

Anomalies don't end on differing from model of behaviour. Operator could want to monitor peaks in metric. There are some ways based on observing short window of values.

PID controller

Wikipedia

This is very rough idea, as monitoring typically only receives data, so there's no obvious feedback channel.

Event correlation

Event correlation is usually applied to logs, but events from monitoring system, including the ones generated from metrics, extends this idea.

BYOD: Bring Your Own Dashboard

With a proper data bus, a user can direct an event stream to his own, locally installed dashboard. The same way user can set up custom alerting or even create one with a simple script. Administrator doesn't need to be involved here.

Monitoring as a Code

Monitoring agent running on a monitored server can be as smart as it needs: its configuration can be code, so the probe detects whole classes of services and aspects to monitor instead of having them configured one by one.