Support Driven Development: Listen now so you don’t hear it later

Here at Scalyr, we’re big fans of Complaint-Driven Development, which I’ll summarize as “focus engineering effort on fixing the things users actually complain about.” We especially focus on issues that generate support requests, with such success that, as CEO, I’m still able to personally handle the majority of frontline support – even as we head toward eight-digit annual revenue.

An important consideration is that support requests cost money even if they aren’t your (product’s) fault. In this post, I’ll explore five common sources of support requests relating to the first piece of Scalyr software most users touch – our log collection agent and how we’ve sometimes had to think outside the box to address them. None of these were bugs, exactly. (We’ve had those as well, but you don’t need to read a blog post to know it’s a good idea to fix bugs.)

Arguably, none of these issues were “our fault.” But they generated a significant fraction of our support tickets. By eliminating them, we’ve reduced support costs significantly. Even more important, we’ve increased the probability that a user’s first experience with Scalyr is positive, especially for those users (a majority!) who will bounce off of a new product at the first sign of trouble, without bothering to ask for help.

Great Support Decreases Costs

Complaint-Driven Development is just a variation on “listen to your users,” and there are a lot of ways to do that – both active (interviews, usability studies) and passive (analytics tools, support tickets). Support tickets are an important channel.

When a user takes the trouble to reach out to your support line, you know they’re having a problem that matters to them. (As a rule of thumb, for every user who complains about an issue, another 100 may be suffering silently.) Also, problems that generate support tickets drive up your costs. Here’s a startling statistic: 100% of issues that motivate users to reach out to support, result in a support ticket. If the user made a configuration error… if the problem is a limitation of their OS… if they’re trying to do something your product is clearly documented as not doing… they’ve still reached out to you, and you still have to put time and energy into responding.

So “Support-Driven Development” isn’t just the part of “Complaint-Driven Development” where you mine support tickets for ideas for improving your product. It’s the systematic attempt to reduce support costs through changes to your product, documentation, and messaging. If you want to provide great support (we do!), and you want to keep support costs low (who doesn’t?), you have to constantly whittle away at the issues that generate support requests. Here’s how we tackled ours.

Java Installation Fail

Our original log collection agent was written in Java. This was handy for us (we’re mainly a Java shop), but not so handy for our users. Some people had philosophical or practical objections to using Java on their servers, often due to memory usage or security requirements. But more often, it simply didn’t work. The user didn’t notice that they needed to install Java (even though our instructions had a large notice to that effect), or they didn’t know how to install it, or the installation was broken.

Or they were just plain pissed that they had to manually install a dependency. And rightfully so; but different platforms use different packages or methods for installing Java, so declaring a package dependency wasn’t feasible.

We could have whittled away at these issues, but it would have been impossible to eliminate them entirely. Ultimately, we decided to rewrite the entire agent from scratch, in Python. This was a complete success: installation problems plummeted to approximately zero, memory usage plunged, and users became much happier.

Incorrect API Keys

We provide each user with a unique API key to authenticate their agent to our servers. When installing the agent, users must fill in this API key. Not infrequently, they forget to do so, or enter it incorrectly.

Without a valid API key, the agent can’t communicate with our servers, so it can’t tell us about the problem. If we don’t know there’s a problem, we can’t notify the user. And, because the agent runs in the background, it couldn’t notify the user directly, either. From the user’s point of view, Scalyr “just doesn’t work:” Their logs don’t appear on our server, with no explanation. That’s not good.

We tackled this problem from two directions. First, we made the agent tolerant of some common problems, such as extra spaces before or after the API key (a common copy / paste issue). Second, we now attempt a test connection when the user starts the agent. If this fails, we output an immediate message on the command line, where the user can see it.

Configuration Confusion

Our original agent, like many tools, required the user to execute a restart command after changing the configuration. This turned out to be a force multiplier for other problems. If a user is fiddling with their configuration, and they need to remember to restart the agent after each fiddle, then all sorts of things tend to go wrong—for example:

  • The user tries several configurations, but forgets to restart the agent after the one that happens to be correct. They conclude that it was actually incorrect, and proceed to flail around with increasingly incorrect configurations.
  • They enable a new integration (say, retrieving MySQL performance metrics), but forget to restart, and conclude that the integration doesn’t work.

This was an easy fix. We now monitor the configuration file, and automatically restart the agent whenever it changes.

Back-and-Forth Troubleshooting

Often, to diagnose problems, we need to collect information from the user. Each time we have to ask for more information, we’ve annoyed them at best; at worst, we’ve lost them. To minimize the number of round trips, we’d like to collect as much information as possible up front; but we don’t want to give users a long list of chores. Also, some of the information we might need is inherently difficult for users to access.

To address this, we added a “status -v” command to our agent. This retrieves almost everything we’d ever want in a single go. Now, when users ask why a particular log file isn’t uploading, or the agent “just doesn’t work,” we can simply tell them: please execute scalyr-agent-2 status -v and send us the output.

Broken Clocks

Some servers just can’t tell time. We’ve seen clocks that are fast or slow by a few minutes (drift), a few hours (incorrect time zone), or more (who knows). This can cause a remarkable variety of problems – if logs are timestamped in the past or future, all sorts of analysis goes wrong. (For instance: if you have an alerting rule based on the number of errors in the last five minutes, and your log timestamps are based on a clock that’s seven minutes slow, the alert can never trigger.)

To help identify when an issue is due to a misconfigured clock, we include the current time in the “status -v” output. We can’t always tell if the clock is off by a few minutes, but it has helped us find issues with timezone settings.

To hammer this problem into the ground, we’re going to extend the agent to report the current time to our servers. This will allow us to detect incorrect clocks and notify the user of any hosts that need correcting.

If the Customer Thinks It’s Broken, It’s Broken

As you can see, none of these support requests were true bugs. But they were stumbling blocks for many users, and added up to a major source of customer dissatisfaction and a major contributor to support time. By resolving them we’ve made our existing customers happier, and made our new customers blissfully unaware of their predecessors’ struggles.