Error budget: Google's solution for innovating at a sustainable pace

New features always trump technical work. Can we objectively measure and decide when they should not?

Wouldn’t it be nice to spend the next sprint or two paying off some of that technical debt that your project had accrued? Wouldn’t it be nice to improve the logging to ease support? Or add some additional integration or end-to-end tests? Or maybe do those first steps to enable blue/green deployments?

But, when was the last time that your product owner willingly added any of those technical stories to the next sprint?

There is always a tension between building or improving business features and those technical items - some of them technical debt, some technical improvements - that don’t bring a very obvious business benefit to the product owner.

Features always seem to trump any technical work.

There is a minority of technical work that does have some obvious business benefit. For example, if we improve performance by 20%, we could reduce the AWS bill by 20%.

But most of the time, the benefit of the technical work for the product owner are as obvious as that upgrade from a 286 to a blazing fast 386 was to your parents. I never managed to make that sell.

So for the vast majority of technical work, what else can we do instead of relying on our poor negotiation skills?

Google’s Site Reliability Engineering (SRE) book comes with a possible answer: error budgets.

What are error budgets?

Each service should have some service level objective (SLO), which is like a soft SLA that has no $$$ penalty and no lawyers will be involved if it is missed.

The SLO of a service will depend on what would be the impact if it becomes unavailable. SLO should be defined by business as they should have an idea of what is the cost on reputation or money of the downtime.

The usual way of calculating the availability of a service is by looking at its uptime vs the unplanned downtime:

Availability = Uptime / (Uptime + Downtime)

But in the SRE book, they propose a different way of calculating the availability:

Availability = successful requests / (successful request + failed requests)

A failed request can be:

A 500 response, due to some bug.
No response, due to the service being down.
A slow response: if the client gives up before the response is available, it is as good as no response.
Incorrect data, due to some bug.

It is interesting to note that outages in common infrastructure count towards those failures, the idea being that the availability of that common infrastructure is everybody’s responsibility.

With this definition of availability, we can define the error budget as:

Error budget = (1 - availability) = failed requests / (successful requests + failed requests)

So if a service SLO is 99.9%, it has a 0.01% error budget. If the service is serves one million request per quarter, the error budget tells us that it can fail up to ten thousand times.

Google’s use of error budgets

The SRE book explains the tension between product teams, which are evaluated on how fast they deliver new features, and the SRE teams, which are evaluated on the reliability of the systems.

Now that we know what the error budget of a service is, the premise is that once the product team uses the error budget, they can no longer make any new release without spending time improving the reliability of the service.

Also, if the product team is close to using up their error budget, they will naturally be more cautious about what they release and how often they do, as they will want to reduce the risk of breaching the budget.

So error budgets act as a measurable, objective and self-policing mechanism to balance innovation versus reliability.

Negotiating technical work

I have never worked in a company with SRE teams, so in my experience, the tension described in the SRE book is really between the development team, who wants the time to do things properly and is accountable for the reliability of their services, and the product owner, who usually just wants to deliver more and more features.

So why not calculate your own error budget and use it to decide when the team needs to slow down and pay more attention to that technical work that keeps the project in good shape?

When the team has used or it is close to use the budget, the team must expend their time improving the reliability of the service, by improving the testing suite and the monitoring capabilities of the system, by implementing some resiliency patterns, by automating more of the deployment pipeline or by increasing the performance of the system.

But to have a reliable system we also have to have a simple and clean codebase, and a simple and clean architecture, because as Dijkstra said:

Simplicity is prerequisite for reliability. Edsger W. Dijkstra, How do we tell truths that might hurt?

So big refactorings and re-architectures will naturally fall into this reliability bucket.

But we can imagine some product owners pushing for more features even when the error budget has been depleted.

And that should be fine. The product owner owns the SLO so she may chose to lower it, choosing features over availability, but at least the decision and trade-off are clear.

So in a way, error budgets also help to determine the actual required availability of a service, makes the trade-off more obvious, the decision clear and the product owner accountable.

So do your math, look at your logs and find if you have a compelling argument to schedule some technical improvements.

Error budgets is just one of the many useful insights on the SRE book. It even details a quasi-outage very similar in cause and solution to the Amazon Feb 2017 S3 outage.

If interested, the book is available for free at https://landing.google.com/sre/book.html