Reasons to Repeat Tests
by James Bach
(with help from colleagues Doug Hoffman, Michael Bolton, Ken Pugh, Cem Kaner, Bret Pettichord, Jim Batterson, Geoff Sutton, plus numerous students who have participated in the "Minefield Debate" as part of my testing class. The minefield analogy as I talk about it was inspired by Brian Marick's talk Classic Testing Mistakes.)
Testing to find bugs is like searching a minefield for mines. If you just travel the same path through the field again and again, you won't find a lot of mines. Actually, that's a great way to avoid mines. The space represented by a modern software product is hugely more complex than a minefield, so it's even more of a problem to assume that some small number of "paths", say, a hundred, thousand, or million, when endlessly repeated, will find every important bug. As many tests as a team of testers can physically perform in a few weeks or months is still not that many tests compared to all the things that can happen to a product in the field.
The minefield analogy is really just another way of saying that testing is a sampling process, and we probably want a larger sample, rather than a tiny sample repeated over and over again. Hence the minefield heuristic is do different tests instead of repeating the same tests.
But what do I mean by repeat the same test? It's easy to see that no test can be repeated exactly, any more than you can exactly retrace your footsteps. You can get close, but you will always be a tiny bit off. Does repeating a test mean that the second time you run the test you have to make sure that sunlight is shining at the same angle onto your mousepad? Maybe. Don't laugh. I did experience a bug, once, that was triggered by sunlight hitting an optical sensor inside a mouse. You just can't say for sure what factors are going to affect a test. However, when you test you have a certain goal and a certain theory of the system. You may very well be able to repeat a test with respect to that goal and theory in every respect that A) you know about and B) you care about and C) isn't too expensive to repeat. Nothing is necessarily intractable about that.
Therefore, by a repeated test, I mean a test that includes elements already known to be covered in other tests. To repeat a test is to repeat some aspect of a previous test. The minefield heuristic is saying that it's better to try to do something you haven't yet done, then to do something you already have done.
If you disagree with this idea, or if you agree with it, please read further. Because...
...this analysis is too simplistic! In fact, even though diversity in testing is important and powerful, and even though the argument against repetition is generally valid, I do know of ten exceptions. There are ten specific reasons why, in some particular situation, it is not unreasonable to repeat tests. It may even be important to repeat some tests.
For technical reasons you might rationally repeat tests...
- Recharge: if there is a substantial probability of a new problem or a recurring old problem that would be caught by a particular existing test, or if an old test is applied to a new code base. This includes re-running a test to verify a fix, or repeating a test on successively earlier builds as you try to discover when a particular problem or behavior was introduced. This also includes running an old test on the same software that is running on a new O/S. In other words, a tired old test can be "recharged" by changes to the technology under test. Note that the recharge effect doesn't necessarily mean you should run the same old tests, only that it isn't necessarily irrational to do so.
- Intermittence: if you suspect that the discovery of a bug is not guaranteed by one correct run of a test, perhaps due to important variables involved that you can't control in your tests. Performing a test that is, to you, exactly the same as a test you've performed before, may result in discovery of a bug that was always there but not revealed until the uncontrolled variables line up in a certain way. This is the same reason that a gambler at a slot machine plays again after losing the first time.
- Retry: if you aren't sure that the test was run correctly the other time(s) it was performed. A variant of this is having several testers follow the same instructions and check to see that they all get the same result.
- Mutation: if you are changing an important part of the test while keeping another part constant. Even though you are repeating some elements of the test, the test as a whole is new, and may reveal new behavior. I mutate a test because although I have covered something before, I haven't yet covered it well enough. A common form of mutation is to operate the product the same way while using different data. The key difference between mutating a test and intermittence or retry is that with mutation the change is directly under your control. Mutation is intentional, intermittence results from incidental factors, and you retry a test because of accidental factors.
- Benchmark: if the repeated tests comprise a performance standard that gets its value by comparison with previous executions of the same exact tests. When historical test data is used as an oracle, then you must take care that the tests you perform are comparable to the historical data. Holding tests constant may not be the only way to make results comparable, but it might be the best choice available.
For business reasons you might rationally repeat tests...
- Inexpensive: if they have some value and are sufficiently inexpensive compared to the cost of new and different tests. These tests may not be enough to justify confidence in the product, however.
- Importance: if a problem that could be discovered by those tests is likely to have substantially more importance than problems detectable by other tests. The distribution of the importance of product behavior is not necessarily uniform. Sometimes a particular problem may be considered intolerable just because it's already impacted an important user once (a "never let it happen again" situation). This doesn't necessarily mean that you must run the same exact test, just something that is sufficiently similar to catch the problem (see Mutation). Be careful not to confuse the importance of a problem with the importance of a test. A test might be important for many reasons, even if the problems it detects are not critical ones. Also, don't make the mistake of spending so much effort on one test that looks for an important bug that you neglect other tests that might be just as good or better at finding that kind of problem.
- Enough: if the tests you repeat represent the only tests that seem worth doing. This is the virus scanner argument: maybe a repeated virus scan is okay for an ordinary user, instead of constantly changing virus tests. However, we may introduce variation because we don't know which tests truly are worth doing, or we are unable to achieve enoughness via repeated tests.
- Mandated: if, due to contract, management edict, or regulation, you are forced to run the same exact tests. However, even in these situations, it is often not necessary that the mandated tests be the only tests you perform. You may be able to run new tests without violating the mandate.
- Indifference/Avoidance: if the "tests" are being run for some reason other than finding bugs, such as for training purposes, demo purposes (such as an acceptance test that you desperately hope will pass when the customer is watching), or to put the system into a certain state. If one of your goals in running a test is to avoid bugs, then the principal argument for variation disappears.
I have collected these reasons in the course of probably a hundred hours of debate with testing students and colleagues. Many of my colleagues prefer different words or a different breakdown of reasons. There's nothing particularly sacred about my way of doing it (except that some breakdowns would lead to long lists of very similar items). The important thing is that when I hear a reason that seems not to fit within the ones I already have, I add that reason to this list. I started with two reasons, in 1997. I added the tenth one in late 2004.
Applying the Minefield: An Example
Ward Cunningham wrote "I believe the automation required of TDD [Test Driven Design] (and Fit) is exempt from the analogy because the searching we are doing is for the best expression of a program in the presence of tests, not the best tests."
Here's how I think it applies:
Your units tests might pass or they might fail. You write them so that they will fail in the event that some interesting expectation is violated. So, you call them tests and they seem to be tests.
We introduce the minefield criticism the first time you run any given test in your unit test suite. The first time you run it, it fails, right? Of course, since it wouldn't be TDD, otherwise. The questions below are inspired by the Minefield heuristic "vary your tests instead of repeating them."
Question: Why run it again?
Answer: Exception #1, "recharge." You run it again because you have added code to make the test pass, therefore running the test again is not merely redundant, the value of the test has been recharged by the code changing around it.
Question: During the course of development, but after the first time the test passes, why not delete it? Why bother to run it again?
Answer: Several reasons. Recharge still applies a little bit, since you may accidentally break the product during development, but it could be argued that most of those unit tests most of the time don't fail, and some of them are extremely unlikely to fail even if you change the code quite a bit. But here you have the second reason: exception #6, "inexpensive." It's so cheap to create these tests and to run them and to keep them running, while at the same time they do have some value, even if not a lot. And you have a third reason for some of the tests: exception #7, "importance." For a good many of the unit tests, failure would indicate a very serious problem, were it to occur. If you are testing something that is particularly complex, or involves many interacting sub-systems, you may also want to repeat because of exception #2, "intermittence". Perhaps something will fail after the forty-third run because of probabilistic factors in he test. Finally, there's #3, the "retry" exception, which reminds us that we might not have run the test correctly, before. As you once said, Ward, something might give off a bad smell only after you've seen the test run a hundred times or so. In other words, as a result of running a test many times, you might come to an insight about the product that reveals a failure that was there all along, but never noticed.
Question: Let's say that I'm a really good developer and though I write good tests, they just don't fail because I just don't put bugs into my code. I have a whole lot of tests and they don't fail. What was the sense in investing in such tests?
Answer: Two potential reasons. Exception #10, "avoidance/indifference." You may create the tests as a form of documentation for future developers and you like them to be exactly the same in order to minimize the chance that they will fail (and thus be less useful as documentation). Or maybe you want to impress a customer with your great software, and they won't be as impressed if the tests don't pass. A second reason is exception #9, "mandated." you may work this way because your peer group or your manager requires you to. This is a little like avoidance except that with a mandate you do, in fact, want to find bugs. You are searching for them, you just are required to use a certain technique to do so.
We therefore see that the fairly simple, often repeated unit tests of TDD may indeed be exempt from the minefield-based argument in favor of varying tests, inasmuch as the reasons I cited apply. But TDD is not exempt from this kind of heuristic analysis. It is always reasonable to question the value of repeated tests, and that's what the minefield invites us to do.