Incident Management

Dec 14th, 2015

Mark Greene

culture

Software companies have outages, be it the entire product or parts of it. Since this is the reality of the situation, here are some concepts that we employ here at EverTrue that have helped us along the way.

Incident management success factors

When your company is going through a production incident, these are three strategies you can use to minimize customer impact and internal confusion:

Identification

It may seem obvious, engineering and operations are sometimes not the first to notice issues in production. There is nothing worse than learning about a production issue from a customer. It’s embarrassing.

An engineering team that can frequently answer “Yeah, we’re already on it” when a customer calls is a team that promotes confidence and earns the trust of the other teams inside the company.

No doubt, all startups will take some shortcuts when building MVPs and V1s. However, every engineer has a responsibility to facilitate introspection into the health and functionality of the services they create. You can get a sense of this in your own team by asking any random engineer a question like, “Hey, how much capacity do we have left in service X?”

Communication

They say communication is key, and owning up to a single, clear story about an incident is no exception. This benefits the users of your system but can also be beneficial to other teams within the company.

I’m consistently amazed at the reactions of our users when we are able to respond to an issue with a clear explanation of what happened, and what we plan to do to resolve it. It’s calm and collected vs. confused or even accusatory. The TL;DR is that users will often cut you a ton of slack if they understand what’s really going on.

There’s very little benefit to hiding production issues. Owning up to broken software is part of being a leading software company. We accomplish this by documenting all outages via StatusPage.io, which is a SaaS product that runs on redundant infrastructure independent from our own.

The severity of the outage should dictate the frequency of outbound communication. If your application’s core functionality is down, hourly updates are a good choice. For less essential functionality, we’ve found two to three times a day to be acceptable.

Ensure that the wording of your updates is broad enough to enable maximum flexibility when troubleshooting and addressing the issue. If you can avoid it, refrain from discussing specific resolutions or timelines unless you are absolutely sure that you will be able to use them. On previous occasions we have given specific timelines only to discover that third party APIs didn’t behave as expected. Indicating you are actively working on the issue is usually sufficient while you collect all of the facts.

Postmortems

Postmortems have proved to be an extremely valuable tool at EverTrue. They afford your team the opportunity to carefully examine what happened and analyze a root cause. Additionally, you can (and should) analyze how you responded to the incident.

The most important requirement of a good postmortem is to have everyone check their egos and desire to blame at the door. We make a strict point of this at EverTrue, not just because it makes you look like an asshole, but because it doesn’t add any value when analyzing the root cause of the current issue or improving responses to future issues. Finger pointing is virtually guaranteed to dilute the value of the process so take special care not to indulge in it.

Our methodology is usually to write a postmortem within a day or two of an issue’s resolution and send it out to the entire company. Postmortems are mainly intended for internal use, however externally facing postmortems can be a great idea as well and are usually appropriate for long outages of the entire product. Apologizing to your customers is one thing, demonstrating that you took special care to deeply understand the problem and address it in the future shows a level of professionalism that will help to ease the minds of wary customers.

While the process is regimented, it’s also fairly lightweight. Typically, the engineers involved in resolving (not causing) the incident meet and start with the question, “What was the root cause here?”. From there, we assess the scope and customer impact of the incident. Additionally, we articulate user-facing impact as well as a timeline to illustrate how engineering was responding to the issue and what decisions were made along the way. Once sufficiently completed, all postmortems are posted to the internal company wiki.

A major goal of the postmortem process is not to write the exact same one twice. You may encounter the same root cause but at least one aspect of the entire incident should be better. Whether it’s a reduction in severity/scope, time to identification, and/or time to recovery. If it’s exactly the same, improvements to the postmortem process are needed.

Efforts around prevention

There are times where this is an appropriate vector to spend your time. But there a lot of times where it is not.

As a contrived example, do spend lots of effort around not getting hacked. Do not spend lots of effort on preventing a really remote bug from being introduced into your system.

Why? Well, you can’t possibly address or even inventory all the corner cases that can bring your system down partially or entirely. As you accumulate incidents in production, following the prevention path blindly, will likely lead to esoteric and additional process that over time, slows your development cycle down. Death by a thousand cuts certainly applies here. Being comfortable with failure is an asset that successful software companies embrace. The flip side of course is regressions are hard to ignore.

Again, focus these efforts where it makes sense. Areas where your business could be suffer severe negative impact, sure, put the effort in. But be aware that chasing 100% uptime is a very costly endeavor and more often than not, will catch up with you in lack of agility. The postmortem process, mentioned above, is a great time to collect your thoughts and truly evaluate how severe the incident really was.

Time to identification & time to recovery

Two things that production incidents usually make apparent is the amount of time it took to identify the problem itself and/or the time it took to fix it. Chances are, this was a lot longer than you would of thought reasonable.

We’ve certainly been on the receiving end of that one-two punch, but that’s what motivates us to make the next one-two punch hurt a lot less.

This area is where we typically spend the greatest deal of effort. If it took 4 hours to recover, it should take one hour the next time. Thirty minutes after that and so on. That type of mantra makes you incrementally faster in the long run because you become more confident taking risks. Risk management is just as much about the severity as it is about the probability of it happening in the first place. Severity is reduced when you are in control of the timeline. Err on the side of over-investing in this area when you can.