@DanLebrero.

software, simply

Testing or Monitoring? MTBF or MTTR? Make your choice!

What is more important testing or monitoring? Should you optimize for mean time between failures (MTBF) or mean time to repair (MTTR)?

Image attribution: Baby, You Knock Me Out, Family Guy

Let’s imagine a fictitious world where evil and clueless managers are the norm. In such a world, so far removed from our reality, managers devour news aggregation sites, since they only have time to read news headlines, not content.

The title of a recent trending article "Testing and monitoring: two sides of the same coin”, made it clear (to your manager at least) that developers were again wasting BigCorp money in redundant technical activities. So in your next project kickoff meeting, he picks a card and writes the following story:

As a Manager
I want to have monitoring or automated testing but not both
So that we save a ton of money"

As the echo of the manager’s maniacal laugh dies away, you comfort yourself with the thought that, at least, he filled in the “so that” part.

Your team is torn by the choice.

Half of your teammates vote for a fully automated test suite, the other half for having good monitoring in production.

You have the decisive vote. What will be your choice?

Give yourself a minute to decide.

Test Suite

If you choose a test suite, you are aiming to not have any bugs in production.

You will have your unit and acceptance tests to know that you are building the right thing, your integration and system tests to check that components can talk to each other, your performance and soak tests to know that you have enough capacity and your resilience tests to make sure your system can cope with failure.

You will be full of confidence during development.

But then you deploy to production and what? Did you forget some test case? Was the test data representative? What is the latency? Did you misunderstand the business? Social security number are not unique?? Did you test the login form in IE6? Did Amazon S3 just disappear???? Our site is in HackerNews …!

Too many permutations to test, too many possibilities to take into account, too many assumptions made, too many unknown unknowns.

A dark room full of terrors.

Monitoring

If you choose monitoring, you are trying to find any issue in production as soon as possible. Your alerts and dashboards will let you know.

But without the testing safety net, you know that bugs and issues are going to happen more often. Deploying to production is going to be scary but at least you are going to have the knowledge of its state.

With testing we hope we haven’t deployed any bug to production, with monitoring we know if we did.

Also, to reduce the risk, not having tests will encourage you to follow some of the continuous delivery practices: small changes, feature toggles and canary releases.

Basically, you will let your clients do the testing in a safe and controlled way. Your “tests” will run with production data, with production servers, and with production traffic. No assumptions to be made, no permutations not covered.

Last, once you have this capability available, it will allow you to do something that no automated test suite will ever give you: to test business ideas through A/B testing.

So monitoring will give you the ability to peek into a room, and if it is full of terrors, close the door.

Back to the real world

Thankfully, we do not live in such a world of gratuitous requirements, so we do not need to make a hard choice.

In the real world, we will be doing both testing and monitoring, but as we don’t have an infinite amount of time, we will need to prioritize our work.

If we prioritize testing, we are trying to increase our MTBF (mean time between failures) while prioritizing monitoring reduces our MTTR (mean time to repair).

Both Chad Fowler and John Allspaw think that for most business and failure types, optimizing for MTTR is better than optimizing for MTBF, but where is the balance?

If we look at the availability formula:

Availability = Uptime / (Uptime + Downtime)

Which can be translated to:

Availability = MTBF / (MTBF + MTTR)

Which allow us to calculate the MTBF or MTTR for a given availability target:

MTBF = Availability * MTTR / (1 - Availability)

MTTR = MTBF / Availability - MTBF

If you know how long it usually takes you to detect and fix an issue, that is, your know your MTTR, you can see in the following table how often you can have a production issue for your desired SLA (numbers are rounded):

Availability / MTTR 1h 4h 24h
90 % 9h 1d 12h 9d
95 % 19h 3d 4h 19d
99 % 4d 3h 2w 3.5mo
99.9 % 6w 6mo 2.8y
99.99 % 1y 10d 4.5y 28y
99.999 % 11.5y 45y 280y

So if your MTTR is one hour, the system can break every nine hours and still achieve a 90% availability, but if you are aiming for 99.99% availability, you cannot have more than one issue a year!

Similarly, if you know how often you usually deploy bugs into production or things go bad, you can calculate what should be your MTTR for the target availability:

Availability / MTBF 7d 1mo 3mo 6mo
90 % 19h 3d 1w 2d 2w 6d
95 % 9h 1d 12h 4d 10h 1w 2d
99 % 1.5h 7h 20h 1d 20h
99.9 % 10m 40m 2h 4.5h
99.99 % 1m 4m 12m 26m
99.999 % 7s 24s 1m 2m

So if you have one issue per month, you need to be able to detect it and fix it in less than 40 minutes to have an availability of 99.9%.

From both tables above, you can see that to achieve high availability, a very low MTTR is mandatory.

Also, low MTTR requires no human intervention, so no time for fixing and deploying a bugfix or jumping to boxes to restart things. Automatic detection is key, hence monitoring is key.

Note that canary releases will give you more time to detect the issue. If you are just sending 1% of the traffic, your time to detect increases by ~ x100. Of course, this assumes that 1% of the traffic is representative of the overall traffic and able to cause the issue.

Even if you don’t ever deploy any bugs, your production environment will change as your server will require to be updated, your network cables will be chewed, humans will make mistakes and your users’ behaviours will change over time.

Murphy’s law will make sure that your systems will fail. And because failure is inevitable, you need to learn to cope with it.

Conclusion

Hopefully at this point, you agree that monitoring is as important as an automated test suite.

But if you look back into your recent projects, how much time and energy did you expend thinking, building and coding your monitoring, versus the time you expend on your tests?

Is it proportional to its importance?

Monitoring should not be an afterthought that happens just before we deploy things to production.

Monitoring should be part of your definition of done, a core part of your development process.

And of course, don’t forget to have a very good test harness for your monitoring!

If you found this post interesting, there is an alternative and better way of calculating availability. See Error Budgets: Google’s solution for innovating at a sustainable pace.


Did you enjoy it? or share!

Tagged in : Architecture good practices testing