Monitoring data comes in a variety of forms—some systems pour out data continuously and others only produce data when rare events occur. Some data is most useful for identifying problems; some is primarily valuable for investigating problems. This post covers which data to collect, and how to classify that data so that you can:
- Receive meaningful, automated alerts for potential problems
- Quickly investigate and get to the bottom of performance issues
Metrics capture a value pertaining to your systems at a specific point in time—for example, the number of users currently logged in to a web application. Therefore, metrics are usually collected once per second, one per minute, or at another regular interval to monitor a system over time. There are two important categories of metrics in our framework: work metrics and resource metrics. For each system that is part of your software infrastructure, consider which work metrics and resource metrics are reasonably available, and collect them all.
Work metrics indicate the top-level health of your system by measuring its useful output. When considering your work metrics, it’s often helpful to break them down into four subtypes:
- throughput is the amount of work the system is doing per unit time. Throughput is usually recorded as an absolute number.
- success metrics represent the percentage of work that was executed successfully.
- error metrics capture the number of erroneous results, usually expressed as a rate of errors per unit time or normalized by the throughput to yield errors per unit of work. Error metrics are often captured separately from success metrics when there are several potential sources of error, some of which are more serious or actionable than others.
- performance metrics quantify how efficiently a component is doing its work. The most common performance metric is latency, which represents the time required to complete a unit of work. Latency can be expressed as an average or as a percentile, such as “99% of requests returned within 0.1s”.
Most components of your software infrastructure serve as a resource to other systems. Some resources are low-level—for instance, a server’s resources include such physical components as CPU, memory, disks, and network interfaces. But a higher-level component, such as a database or a geolocation microservice, can also be considered a resource if another system requires that component to produce work.
Resource metrics are especially valuable for investigation and diagnosis of problems. For each resource in your system, try to collect metrics that cover four key areas:
- utilization is the percentage of time that the resource is busy, or the percentage of the resource’s capacity that is in use.
- saturation is a measure of the amount of requested work that the resource cannot yet service, often queued.
- errors represent internal errors that may not be observable in the work the resource produces.
- availability represents the percentage of time that the resource responded to requests. This metric is only well-defined for resources that can be actively and regularly checked for availability.
There are a few other types of metrics that are neither work nor resource metrics, but that nonetheless may come in handy in diagnosing causes of problems. Common examples include counts of cache hits or database locks. When in doubt, capture the data.
In addition to metrics, which are collected more or less continuously, some monitoring systems can also capture events: discrete, infrequent occurrences that can provide crucial context for understanding what changed in your system’s behavior. Some examples:
- Changes: Internal code releases, builds, and build failures
- Alerts: Internally generated alerts or third-party notifications
- Scaling events: Adding or subtracting hosts
An event usually carries enough information that it can be interpreted on its own, unlike a single metric data point, which is generally only meaningful in context. Events capture what happened, at a point in time, with optional additional information.Events are sometimes used to generate alerts—someone should be notified of events such as the third example in the table above, which indicates that critical work has failed. But more often they are used to investigate issues and correlate across systems. In general, think of events like metrics—they are valuable data to be collected wherever it is feasible.
Conclusion: Collect them all
- Instrument everything and collect as many work metrics, resource metrics, and events as you reasonably can.
- Collect metrics with sufficient granularity to make important spikes and dips visible. The specific granularity depends on the system you are measuring, the cost of measuring and a typical duration between changes in metrics—seconds for memory or CPU metrics, minutes for energy consumption, and so on.
- To maximize the value of your data, tag metrics and events with several scopes, and retain them at full granularity for at least a year.