Good News: Your Monitoring Is All Wrong

This is the first in a series of articles on server monitoring techniques. If you’re responsible for a production service, this series is for you.

In this post, I’ll describe a technique for writing alerting rules. The idea is deceptively simple: alert when normal, desirable user actions suddenly drop. This may sound mundane, but it’s a surprisingly effective way to increase your alerting coverage (the percentage of problems for which you’re promptly notified), while minimizing false alarms.

Most alerting rules work on the opposite principle — they trigger on a spike in undesired events, such as server errors. But as we’ll see, the usual approach has serious limitations. The good news: by adding a few alerts based on the new technique, you can greatly improve your alerting coverage with minimal effort.

Why is this important? Ask Microsoft and Oracle.

With almost any monitoring tool, it’s pretty easy to set up some basic alerts. But it’s surprisingly hard to do a good and thorough job, and even the pros can get it wrong. One embarrassing instance: on October 13, 2012, for about an hour, Oracle’s home page consisted entirely of the words “Hello, World”. Oracle hasn’t said anything about this incident, but the fact that it was not corrected for over an hour suggests that it took a while for the ops team to find out that anything was wrong — a pretty serious alerting failure.

A more famous example occurred on February 28th, 2012. Microsoft’s Azure service suffered a prolonged failure during which no new VMs could be launched. We know from Microsoft’s postmortem that it took 75 minutes for the first alert to trigger. For 75 minutes, no Azure VM could launch anywhere in the world, and the Azure operations team had no idea anything was wrong. (The whole incident was quite fascinating. I dissected it in an earlier post; you can find Microsoft’s postmortem linked there.)

This is just the tip of the iceberg; alerting failures happen all the time. Also common are false positives — noisy alerts that disrupt productivity or teach operators the dangerous habit of ignoring “routine” alerts.

If you see something, say something

If you’ve ridden a New York subway in recent years, you’ve seen this slogan. Riders are encouraged to alert the authorities if they see something that looks wrong — say, a suspicious package left unattended. Most monitoring alerts are built on a similar premise: if the monitoring system sees something bad — say, a high rate of server errors, elevated latency, or a disk filling up — it generates a notification.

This is sensible enough, but it’s hard to get it right. Some challenges:

  1. It’s hard to anticipate every possible problem. For each type of problem you want to detect, you have to find a way to measure instances of that problem, and then specify an alert on that measurement.
  2. Users can wake you up at 3:00 AM for no reason. It’s hard to define metrics that distinguish between a problem with your service, and a user (or their equipment) doing something odd. For instance, a single buggy sync client, frantically retrying an invalid operation, can generate a stream of server “errors” and trigger an alert.
  3. Operations that never complete, never complain. If a problem causes operations to never complete at all, those operations may not show up in the log, and your alerts won’t see anything wrong. (Eventually, a timeout or queue limit might kick in and starting recording errors that your alerting system can see… but this is chancy, and might not happen until the problem has been going on for a while.) This was a factor in the Azure outage.
  4. Detecting incorrect content. It’s easy to notice when your server trips an exception and returns a 500 status. It’s a lot harder to detect when the server returns a 200 status, but due to a bug, the page is malformed or is missing data. This was presumably why Oracle didn’t spot their “Hello, World” homepage.

Let your users be your guide

The challenges with traditional server log alerting can be summarized as: servers are complicated, and it’s hard to distinguish desired from undesired behavior. Fortunately, you have a large pool of human experts who can make this determination. They’re called “users”.

Users are great at noticing problems on your site. They’ll notice if it’s slow to load. They’ll notice error pages. They’ll also notice subtler things — nonsensical responses, incomplete or incorrect data, actions that never complete.

Someone might come right out and tell you you have a problem, but you don’t want to rely on that — it might take a long time for the message to work its way through to your operations team. Fortunately, there’s another approach: watch for a dropoff in normal operations. If users can’t get to your site, or can’t read the page, or their data isn’t showing up, they’ll react by not doing things they normally do — and that’s something you can easily detect.

This might seem simple, but it’s a remarkably robust way of detecting problems. An insufficient-activity alert can detect a broad spectrum of problems, including incorrect content, as well as operations that don’t complete. Furthermore, it won’t be thrown off by a handful of users doing something strange, so there will be few false alarms.

Consider the real-world incidents mentioned above. In Oracle’s case, actions that would normally occur as a result of users clicking through from the home page would have come to a screeching halt. In Microsoft’s case, the rate of successful VM launches dropped straight to zero the moment the incident began.

Red alert: insufficient dollars per second

When I worked at Google, internal lore held that the most important alert in the entire operations system checked for a drop in “advertising dollars earned per second”. This is a great metric to watch, because it rolls up all sorts of behavior. Anything from a data center connectivity problem, to a code bug, to mistuning in the AdWords placement algorithms would show up here. And as a direct measurement of a critical business metric, it’s relatively immune to false alarms. Can you think of a scenario where Google’s incoming cash takes a sudden drop, and the operations team wouldn’t want to know about it?

Concrete advice

Alongside your traditional “too many bad things” alerts, you should have some “not enough good things” alerts. The specifics will depend on your application, but you might look for a dropoff in page loads, or invocations of important actions. It’s a good idea to cover a variety of actions. For instance, if you were in charge of operations for Twitter, you might start by monitoring the rate of new tweets, replies, clickthroughs on search results, accounts created, and successful logins. Think about each important subsystem you’re running, and make sure that you’re monitoring at least one user action which depends on that subsystem.

It’s often best to look for a sudden drop, rather than comparing to a fixed threshold. You might alert if the rate of events over the last 5 minutes is 30% lower than the average over the preceding half hour. This avoids false positives or negatives due to normal variations in usage.

Note that dropoff alerts are a complement to the usual “too many bad things” alerts, not a replacement. Each approach can find problems that will be missed by the other.

A quick plug

If you liked this article, you’ll probably like our hosted monitoring service. Scalyr is a comprehensive DevOps tool, combining server monitoring, log analysis, alerts, and dashboards into a single easy-to-use service. Built by experienced devops engineers, it’s designed with the practical, straightforward, get-the-job-done approach shown here.

It’s easy to create usage-based alerts in Scalyr. Suppose you want to alert if posts to the URL “/upload” drop by 30% in five minutes. The following expression will do the trick:

countPerSecond:5m(POST ‘/upload’) < 0.7 * countPerSecond:30m(POST ‘/upload’)

To check for drops in some other event, just change the query inside the two pairs of parentheses.

Further resources

Last year, I gave a talk on server monitoring which touched on this technique and a variety of others. You can watch it at youtube.com/watch?v=6NVapYun0Xc. Or stay tuned to this blog for more articles in this series.