— BEN BLOCK —

Case Study

Root Cause Analysis and The 5 Whys

The Problem

Stuff happens with web based apps. It’s inevitable. Even major sites like Facebook or Twitter have outages or release embarrassing bugs. Someone I know likes to say, humans have been building bridges for millennia, our field (Internet Technology) has been around for less than 50 years, it’s not surprising our products fall over more frequently than a bridge.

So we’re not going to catch every single issue and we may not be able to scale in every scenario, but the key is not to let the same mistake happen twice. I tell all my teams, I will understand an outage or a live bug that bites us once, but if the bug reappears post-fix, or the underlying issue that caused the outage causes another outage, that, in the immortal words of Vito Corleone, “is something that I do not forgive.”

The Solution

So how do we assure ourselves that the same issue never occurs twice? Easy, identify the root cause and fix it. It is surprising to me how difficult this simple solution is for so many.

Software bugs are generally straightforward, the root cause is almost always that some change introduced in a recent deploy broke a feature that wasn’t covered in a regression test. If there was a regression test, it would have been caught. Simple enough, add a test of this feature to your pre-deploy regression test and as long as you run the regression tests diligently before each deploy, you should not face the issue again.

Getting to the root cause of infrastructure issues is more complicated, enter the 5-whys. I’ve practiced the 5-whys for years, frankly even before I learned that it is a known and articulated technique for root cause analysis. The approach is straightforward enough, keep asking “why?” until you get to the root cause. Recently, a site I launched went down and I applied “5-whys” until I got to the bottom of it. Below illustrates the approach.

The website is down, why?
Because the PHP application was throwing an error. Why was it throwing an error?
Because the DB was rejecting all queries. Why was the DB rejecting all queries?
Because the DB servers had fallen out of sync and our cluster technology requires them to be in sync*. Why had the DBs fallen out of sync?
Because AWS had a network outage that broke connectivity between the boxes and caused replication to fail, thus throwing them out of sync.

* We were using Galera DB, which runs master-master, and the 2 servers must be in sync.

So the root cause was a network hiccup at AWS. In this case, we knew we had no control over the AWS network and it was unrealistic to believe there would never be another hiccup in the future. Therefore, we knew we needed to make our DB cluster more resilient to network outages. We updated Galera (a hint we found on Google), and ran a test simulating a network outage between the two machines, writing to one and not to the other, thus causing them to fall out of sync. The cluster took queries during the time the DBs were out of sync and the site stayed up. The update was replicated over once network connectivity was restored.

To some this may seem like a simple scenario, but it perfectly illustrates how the 5-whys can help you keep digging until you get to the bottom of any issue.

The Result

Identifying the root cause is half the battle. Applying the fix and testing it is of course necessary to get the outcome we all want. However, without knowing the root cause, you will never put yourself in a position to fix it and you will feel the pain again and again. In the example above, we have yet to experience any issues with our DB due to network hiccups at AWS.

Get in Touch

info@
benblock.com

New York,
New York

linkedin.com
/in/bblock