Flaky tests and computation caching

Flaky tests are as common as they are annoying. If you have some you should fix them, however this is easier said than done. Especially when working in a monorepo and making changs across many or all packages, luck and math are not on your side. I want to suggest using computation caching with tools like Turborepo to work around and mitigate the challenges of flaky tests in a monorepo.

(I'm wrinting this this from the perspective of a JS developer having mainly worked with Turborepo, but other tools like Nx and Bazel have good support for computation caching as well)

Rolling a dice, 6 dice, 100 dice

What's you chance of not rolling a 1 when rolling a single dice? Pretty high!

What's your chance of not rolling a 1 when rolling 6 dice instead? Still pretty high, but I wouldn't bet my life on it.

What's your chance of not rolling a 1 when rolling 100 dice? Yeah right, this is not very likely.

Now consider every dice is a package in a monorepo, and rolling a 1 is a test failure. Your chances of getting a green build are pretty good when you only change 1 package, but if you need to get a green build across all 100 packages it will be a painful experience.

Just try running the tests again

Working in a bigger monorepo it seems quite common having to re-run the CI pipeline a few times to get a successful build.

I've probably spent days retrying builds that failed on a flaky end-to-end test, and let me assure you it is a miserable experience. It's annoying when you want to release a new feature, but it becomes a real problem when you're trying to get the hotfix out for an incident at 3am on the weekend. But what are the probabilities when re-running tests, and how can we improve it?

To stick with our example, if we roll 100 dice it's pretty likely that one of them is a 1. If we roll 100 dice again, how likely is it to not roll a 1? Still pretty damn unlikely - it's exactly the same probability as on your first run.

Changing the game

We can adjust the rules of our game slightly to make rolling each dice without getting a 1 feasible - by re-rolling only the dice that were a 1 on each iteration. For example, we roll 100 dice and get three 1s. How likely to not roll a 1 after three such rounds? Pretty good actually.

Computation caching is exactly this game-changer - it means storing the output of a computation somewhere, and when we run the computation with the same input we can take the value from our cache. When talking about CI pipelines you can treat your code as the input. The whole concept is explained well in the Turborepo docs.

In our illustration, computation caching doesn't exactly take away the dice, but the outcome is equivalent. It makes sure that for each of your dice, if it received a value different from 1, it gets that same outcome whenever you roll it. Remember, rolling a 1 is our analogy for a test failure - our monorepo tooling will save the outcome of a task to a cache, unless there was a failure. On each computation of a task will take it from the cache if possible.

How high is "pretty high" really?

I hope this is not a mistake, but to really make my point I think it is important to actually evaluate the probabilities, with values more closely resembling what you'd see for test failures in a monorepo. Let's try and calculate the

With a fair, 6-sided dice you have a chance of 5 out of 6 that it does not roll a 1, or about 83%.

With six dice, it's (⁵⁄₆₎⁶ or about 33%.

With 100 dice, it's (⁵⁄₆₎¹⁰⁰ or about 0.000001%. I told you, pretty damn unlikely (I'm actually astonished how unlikely this is).

I actually have no idea how likely a failure is for a typical test suite, but I would hope that it is much lower than 5/6. Using the same calculation as above, in a monorepo with 100 packages where each package has a chance of 1% to fail, you have a probability of only 36% to get a green build.

If your probability per package is only 0.1%, you still only have a 90% probability to get a green build.

It's easy to get annoyed at another team's flaky tests but if you keep in mind the probabilities from the previous paragraph and think it through, for teams working on a small slice of a monorepo it's fairly easy to introduce a flaky test. After all it will only fail on every 100th or 1000th test run!

You'll only really experience the pain of flaky tests once you need to integrate a change that requires every a successful build in every single package - then every third or 10th build fails.

What are the probabilities with caching?

🤷 my math stops right about here, but to illustrate what difference the caching makes I have prepared this small interactive component. You can select the number of packages and chance of failure, and it will simulate test runs until all packages have passed tests. Each rectangle represents one test-run, light green rectangles are cached test results. Try setting the number of packages to 500 if you're lucky and don't see a big difference.

Did you just tell me flaky tests are fine?

No, I did not and I think it's important that you're aware of the tradeoffs when using computation caching.

You're generally trading sensitivity of your tests against the ability to integrate changes quickly. By sensitivity of tests I mean that when you pull a test result from a cache, it's important to stress that you're not actually testing anything in that moment. Especially with end-to-end tests, it's not unlikely that you're skipping a test run that would notify you of an error in production caused by service you depend on.

In my experience, especially when working on a large scale application this can be a good tradeoff though.

I don't have flaky tests

I'm happy for you! You might still want to look into computation caching for different reasons, mainly to speed up your builds and simplify pipelines.