Testing Performance

Writing a test that predicts “How will this code perform?” is one of the most difficult tests to write.

Throughout my career, some of the worst outages have been due to performance issues. Something that used to take 100’s of milliseconds is now taking several seconds, causing a cascade of failures.

As I’ve grown as a software engineer and as a human that would like to sleep, I’ve become more interested in the predictive power of writing tests. Can I do something today that will help me tomorrow? I’ve gone as far as record a 12-part video series on the topic.

But I have yet to find a test setup that precisely predicts production performance. This blog post dives into why, and what we can do about it.

It’s the System, Man

Any web application is a capital-S System. You’ve got at least a web server and a database. From there, things get more and more complicated as your customer base and team grow. Your app also requires human operation and oversight if it’s worth anything which pulls humans into the System, too.

Because even the smallest applications can be seen as complex systems, they play by a set of sometimes logic-defying rules. “Why did this code change do that?” Surprise reigns in a complex system.

And because our applications are complex systems, replicating them is asymptotically expensive. Every piece—the code, the SRE, the customer service rep, the infrastructure, and the customers—make up the production system.

Performance is one of the few things that is difficult to predict, because performance usually hinges on the scarcity of the resources in the production system. How big is your database? How many background processes do you have? How many idle web servers do you have? Replicating and simulating this scarcity is difficult.

Beyond that, your production system is usually the most expensive piece of infrastructure. Duplicating your production setup might be a cost too expensive to pay. There might also be privacy concerns with duplicating your infrastructure. Will this second system be as secure as your production system?

Furthermore, small deviations in performance can have catastrophic effects. If your system is much closer to a redline than you intended, a 5% degredation in performance could bring the whole thing down. You likely won’t know about this bottleneck until you observe it.

What Do We Do?

But the cause is not totally lost. We can still engage in a few activities to predict the performance changes of our code changes. We need to do the following:

Know that we can only predict—not guarantee—performance changes.
Become comfortable with qualitative assessments in our tests.
Test in production.

We Can Only Predict, Not Guarantee

The exercise of writing a test before shipping code feels silly. To some, it can feel like duplicated work.

And yet, it’s something most mature shops do. Why? For the same reason that accountants engage in double-entry bookkeeping, we write tests as software engineers. They provide balance to our logic. They are our co-pilot. They provide redundancy to human error.

But with this comes the zen of testing. No system can be fully tested. 100% test coverage is a lie. Tests might sometimes be red but production is just fine. How many times have seen tests green but we have a bug in production?

Thus, we need to keep in mind that we can only predict performance changes but not guarantee performance changes.

Qualitative Measurement

The best approach that I have found to preventing performance regressions is to employ qualitative assessments of the code.

With performance, we always reach for quantitative: “Will this change make my app 10% faster or slower?” These measurements will be highly dependent upon how our test is written and where it is running. Any sort of traditional benchmarking might forget that:

Our code will run different on different architectures.
Our code might have noisy neighbors in virtualized environments.
Our code might run uninterrupted in test but could be interrupted in production.

This will make our quantitative performance tests flakey or unpredictive; they may not accurately tell us what is about to happen. Furthermore, they require arbitrary decisions. What is considered a regression? 1% slower? 5%? 10%?

Instead of pursuing these quantitative assessments, I’ve become comfortable with more qualitative readouts. In a web application, here are the questions I encode into my tests:

How many database queries does this web request issue?
In an end-to-end test, how many web requests does this flow issue?
In a background job, how many network calls do I make?
In a web request, do I communicate with any 3rd party systems?
Are there any N+1’s in my web request or background job?
When creating a response, is the response paginated?
When processing large batches, how and when do we fan in or out?
When issuing a database query, does our query use an index?

Some of these quesions have numeric components to their answer, but they trend toward boolean answers. They are also much more boring.

Yet these questions tend to be the most predictive when it comes to changes in performance. If I introduce an N+1 query, I should not be surprised to see the performance degrade. If I remove an N+1 query, I would expect to see performance improve (assuming that that was the bottleneck). Exactly how much performance changes, I cannot precisely predict.

To our engineering brains, this can be a let-down. We cannot reliably say, “This will slow down this endpoint in production by 53ms.”

Test in Production

She says it a bit tongue in cheek, but Charity Majors highlights that testing in production is required. How code runs in production is the only Truth.

This is part of the zen of testing: Our tests can only predict but cannot guarantee how things will run in production. Given the choice and a low enough cost, we’ll happily make our test environment as similar to production as makes sense.

Where tests might fail us, observability can give us more confidence. Monitors and alerts allow us to respond quickly when things do go wrong. If performance is important, we should set up monitors and alerts for the things we care about. These will be the only truth we have when it comes to answering the question, “How are things today in production?”

There comes a time where we exhaust the benefits of testing and we need to refocus our efforts into production observability.

This requires taking on more risk. We need to get comfortable with letting things fail, and then committing to resolving failures quickly. This may require a culture change on your team.

Conclusion

This blog post is a bit one-sided based on my personal experience. It shuts the door for precise quantitative performance tests by effectively saying “Don’t bother.” I would love to be proven wrong here, but I have yet to see a situation in my career where a quantitative performance testing outweighed the cost of setup or the dangers of being close-enough-to-production-but-not-quite.

Rather than encouraging teams to set up near-production-like environments, I’ve argued that we should instead optimize for two things:

Develop qualitative indicators in tests to predict and prevent performance regressions. Run these in CI.
Introduce monitors and alerts for our production environment. Observe and respond to code changes as they relate to performance.

Special thanks to Toni Rib, Hector Virgen, Stephan Zaria, and Mandy Mitchell for reading early drafts of this post and providing feedback.