Kaizen in infrastructure: Writing RCAs to improve system reliability and build customer trust

April 17th, 2018 Comments Off on Kaizen in infrastructure: Writing RCAs to improve system reliability and build customer trust

I shall endeavor to convince you today that your company should really regularly write Root Cause Analyses (RCAs), not just for yourselves but also as a tool to build trust with your customers.  The subject of RCAs can be a bit dry, so allow me to motivate with an example of how poorly-approached RCAs can easily become a hot-button issue that will cost you customers.

As the CTO of Lottery.com, I’m responsible for overseeing roughly 50 vendor relationships. For the vast majority of these vendors, Lottery.com (LDC) is a regular customer that doesn’t use their system much differently than any other customer, and things run smoothly. They charge a credit card every month, and we get a service that fulfills some business need. Yet, for a small handful of these vendors “business as usual” just can’t shake persistent issues.

Allow me to tell you of an incident that occurred recently with a vendor; let’s call them WebCorp. WebCorp offers a service that we depend on for one of our services to be up. If WebCorp’s service is up, then our service is up. If WebCorp’s service goes down, we go down. In these situations, it’s in everyone’s interest for WebCorp to be reliable. So we codify that dependency in a contract called a service level agreement (SLA). SLAs are measured quantitatively, in terms of uptime percentage.

A side note on SLAs: A high quality service measures their uptime in “nines,” as in: three nines is 99.9% uptime. That may seem like a lot, but over the course of a year 3 nines of uptime translates to nearly 9 hours of timedown, or 1.5 minutes of downtime per day. With WebCorp, Lottery.com has a five nines SLA, which translates to five minutes of downtime per year.

OK, back to the story: Lottery.com got an alert in the early afternoon around noon PDT that our service was down. Within about five minutes we had determined, conclusively, that the cause of the downtime was a service failure at WebCorp. I emailed WebCorp’s emergency helpline and, to their credit, within a few minutes they acknowledged the issue and indicated they were looking into it. About an hour later they had resolved the issue and our service was back online. Total downtime was about 64 minutes.

When a vendor has an outage it is my standard practice, once the issue is resolved, to write in inquiring about what went wrong and whether mitigation steps to prevent future outages are in place. In this case, WebCorp’s response was:

“It would appear that the cache flushed right before the system tried to restart. That flush wiped the contents of a Varnish file, which caused Varnish to restart with an error. That probably doesn’t mean much to someone on your end of things. Essentially, it was a really unusual conflict of a couple of automatic jobs happening on the server, so we’re fairly sure it’s not something you’ll be able to reproduce from your end of things, intentionally or unintentionally. Hope that clarifies a bit!”

While I appreciate the effort to lift the curtain a little bit on some of the technical details, this response doesn’t actually tell me how WebCorp is going to prevent the issue from happening again. And so I asked them what they planned to do to prevent future such outages.  

WebCorp’s response:

“We try our very best to prevent these things from happening. In order to be better prepared for a situation like this in the future, we’ve added extra monitoring […]. Now, our emergency support team will be immediately alerted whenever any downtime happens […].

Since the issue [your service] encountered is one that we have no record of having seen (either before or since), it might be premature to alter our Varnish caching processes at this time. If the issue proves to be reproduce-able and / or widespread, then we may indeed make an adjustment to our infrastructure to correct for it. For now, though, it appears to be an isolated incident.

While you do have a 99.999% SLA with us, it is actually for a different […] service!  The SLA agreement is tied to [Service2] and not [Service1]. However, you may be pleased to hear that the uptime of [Service1] has been at 99.87% over the last month!

Again, I apologize for the downtime yesterday. I hope this answers the questions you had for me and the rest of my team. If not, please feel free to reach out again so we can continue the conversation. I’m always happy to help!”

Again to WebCorp’s credit, this is an undeniably polite and professionally written response. The substance, however, did little to reassure me on a technical level.

What I read in the response, substantively, is:

We’re doing our best and will add more “monitoring”. In fact, our support team will now actually find out when downtime occurs. But this specific issue has never happened before, so it’s not in our interest to change business practices. Oh and as a reminder, the 99.999% SLA we have for your service doesn’t technically apply here and this service been at 99.87%. Isn’t that great?

By signing and paying for a five-nines SLA, my expectation as a customer is to have as close to 99.999% uptime as possible for all services WebCorp offers. The fact that WebCorp’s response seems to indicate that they find 99.87% to be a good uptime percentage serves to dramatically reduce the trust that I have in WebCorp’s future reliability.  A far more reassuring response would indicate that they take all downtime seriously, that their team is investigating ways to improve the robustness of the system to ensure no customer experiences these outages ever again and that they would reply to me in a few days when they understand exactly what went wrong in their procedures and how they’ll be improving.

In summary:

1) It is important that the vendor and customer have aligned expectations for service reliability.
2) If the vendor offers a contractual SLA, the customer’s expectation is that the vendor will make good faith best efforts to meet that SLA, and take any breaches seriously.

RCA The Right Way

By not performing and being transparent about a detailed RCA, it’s easy for a customer to lose faith in a company’s efforts to provide a highly-reliable service. The goal of the RCA is therefore twofold:

1) Document the failure and potential mitigations to improve service quality and reliability.
2) Provide a mechanism for being transparent about failures to build confidence, and trust, with customers.

A good RCA has a template roughly as follows:

Incident Start Time/Date:
Incident Received Time/Date:
Complete Incident Timeline:
Root cause(s):
Did we engage the right people at the right time?
Could we have avoided this?
Could we have resolved this incident faster?
Can we alert on this faster?
Identified issues for future prevention:

In this template are prompts for the pieces of information one needs to understand what happened, what was learned, and why it won’t happen again.  There are many great examples of RCAs out there:

https://blog.github.com/2012-12-26-downtime-last-saturday/
https://medium.com/netflix-techblog/lessons-netflix-learned-from-the-aws-outage-deefe5fd0c04
http://www.information-age.com/3-lessons-learned-amazons-4-hour-outage-123464916/
https://slackhq.com/this-was-not-normal-really-230c2fd23bdc

Kaizen, the japanese term for “continuous improvement,” is an ethos often cited in industry.  Building technology is hard; humans are imperfect and therefore, technology often is as well. That’s expected. The only way we get past our imperfect ways is to continuously work to get better, to own up to our mistakes, learn from them, and ensure we (and our technology) don’t make the same mistake twice.