Reasons to Repeat Tests

(with help from colleagues Doug Hoffman, Michael Bolton, Ken Pugh, Cem Kaner, Bret Pettichord, Jim Batterson, Geoff Sutton, plus numerous students who have participated in the “Minefield Debate” as part of my testing class. The minefield analogy as I talk about it was inspired by Brian Marick’s talk Classic Testing Mistakes.)

Testing to find bugs is like searching a minefield for mines. If you only travel the same path through the field again and again, you won’t find a lot of mines. Actually, that’s a great way to avoid mines. The space represented by a modern software product is hugely more complex than a minefield, so it’s even more of a problem to assume that some small number of “paths”, say, a hundred, thousand, or million, when endlessly repeated, will find every important bug. As many tests as a team of testers can physically perform in a few weeks or months is still not that many tests compared to all the things that can happen to a product in the field.

The minefield analogy is really just another way of saying that testing is a sampling process, and we probably want a larger sample, with good variety, rather than a tiny, idiosyncratic sample repeated over and over again. Hence the basic minefield heuristic is do different tests instead of repeating the same tests.

But what do I mean by repeat the same test? It’s easy to see that no test can be repeated exactly, any more than you can exactly retrace your footsteps down to a micrometer level. You can get close, but you will always be a tiny bit off. Does repeating a test mean that the second time you run the test you have to make sure that sunlight is shining at the same angle onto your mousepad? Maybe. Don’t laugh. I did experience a “bug,” once, that was triggered by sunlight hitting an optical sensor inside a mouse. You just can’t say for sure what factors are going to affect a test. However, when you test you have a certain goal and a certain theory of the system. You may very well be able to repeat a test with respect to that goal and theory in every respect that A) you know about and B) you care about and C) isn’t too expensive to repeat. Nothing is necessarily intractable about that.

Therefore, by “repeat a test” I mean “repeats some part of a test that matters.” I think that’s really what people are talking about with repetition. Repeating something is possible. Repeating everything is not. Nevertheless, the Minefield Heuristic suggests that you not repeat tests even by that definition.

If you disagree with this idea, or if you agree with it, please read further. Because…

…this analysis is too simplistic. Even though diversity in testing is important and powerful, and even though the argument against repetition is generally helpful, I do know of ten exceptions. There are ten specific reasons why, in some particular situation, it is not unreasonable to repeat tests. It may even be important to repeat some tests.

Technical reasons you might rationally repeat tests…

Regression: if there has been a change to the product being tested (in any of its layers, including the underlying platform), such that there is a substantial probability of a new problem or a recurring old problem that would be caught by a particular existing test. This includes re-running a test to verify a fix, or repeating a test on successively earlier builds as you try to discover when a particular problem or behavior was introduced. This also includes running an old test on the same software that is running on a new O/S. In other words, a tired old test can be recharged by changes to the technology under test. Note that the regression argument doesn’t necessarily mean you should run the same old tests, only that it isn’t necessarily irrational to do so.
Intermittence: if you suspect that the discovery of a bug is not guaranteed by one correct run of a test, perhaps due to important variables involved that you can’t control in your tests. Performing a test that is, to you, exactly the same as a test you’ve performed before, may result in discovery of a bug that was always there but not revealed until the uncontrolled variables line up in a certain way. This is the same reason that a gambler at a slot machine plays again after losing the first time.
Retry: if you aren’t sure that the test was run correctly the other time(s) it was performed, or if information that could have been gathered the first time through was overlooked. This is why it can be a good idea to have several testers follow the same instructions and check to see that they all get the same result. It’s also a common reason why a developer might want to reproduce a bug that’s been reported.
Mutation: if you are changing an important part of the test while repeating another part. Even though you are repeating some elements of the test, the test as a whole is new, and may reveal new behavior. I mutate a test because although I have covered something before, I haven’t yet covered it well enough. A common form of mutation is to operate the product the same way while using different data. The key difference between mutating a test and intermittence or retry is that with mutation the change is directly under your control. Mutation is intentional, intermittence results from incidental factors, and you retry a test mainly because of accidental factors.
Benchmark: if the output of the tests comprises a standard that gets its value by comparison with previous executions of the same exact tests. The most obvious exmaple of this is a performance benchmark. When historical test data is used as an oracle, then you must take care that the tests you perform are comparable to the historical data. Holding tests constant may not be the only way to make results comparable, but it might be the best choice available.

Business reasons you might rationally repeat tests…

(these reasons don’t operate on their own, but combine with the technical reasons to make them better)

Importance: if a problem that could be discovered by those tests is likely to have substantially more importance than problems detectable by other tests. The distribution of the importance of product behavior is not necessarily uniform. Sometimes a particular problem may be considered intolerable just because it’s already impacted an important user once (a “never let it happen again” situation). This doesn’t necessarily mean that you must run the same exact test, just something that is sufficiently similar to catch the problem (see Mutation). Be careful not to confuse the importance of a problem with the importance of a test. It might be that there are better tests for detecting a particular important problem, and perhaps you should not be repeating this one.
Inexpensive: if they have some value and are sufficiently inexpensive compared to the cost of new and different tests. These tests may not be enough to cover the product, however.
Enough: if the tests you repeat represent the only tests that seem worth doing. However, we usually have no good reason to think we have the right test set. We may introduce variation because we don’t know which tests truly are worth doing.
Mandated: if, due to contract, management edict, or regulation, you are forced to run the same exact tests. However, even in these situations, it is often not necessary that the mandated tests be the only tests you perform. You may be able to run new tests without violating the mandate.
Indifference/Avoidance: if the “tests” are being run for some reason other than finding bugs, such as for training purposes, demo purposes (such as an acceptance test that you desperately hope will pass when the customer is watching), or to put the system into a certain state. If one of your goals in running a test is to avoid bugs, then the principal argument for variation disappears.

I have collected these reasons in the course of probably a hundred hours of debate with testing students and colleagues. Many of my colleagues prefer different words or a different breakdown of reasons. There’s nothing particularly sacred about my way of doing it (except that some breakdowns would lead to long lists of very similar items). The important thing is that when I hear a reason that seems not to fit within the ones I already have, I add that reason to this list. I started with two reasons, in 1997. I added the tenth one in late 2004.

Applying the Minefield: An Example

Ward Cunningham wrote “I believe the automation required of TDD [Test Driven Design] (and Fit) is exempt from the analogy because the searching we are doing is for the best expression of a program in the presence of tests, not the best tests.”

Here’s how I think it applies:

First, I don’t call the code you wrote “tests.” A test is an event; and instance of testing. Testing is a human process. Like any craftsman who wants to do a good job, I need clarity in my language and thought– in this case so that we don’t lose sight of the tester’s role. What you are referring to is a set of output checks. The “test” here is your process of designing or re-designing the whole set checks, performing them, and evaluating the output. You write them so that they will fail in the event that some interesting expectation is violated.

Are you repeating this test? You design the set of checks once, but you perform the checks many times and you have to evaluate them every time you do so. There is repetition. Your evaluation process will change a bit as you change, and you also may decide to purposefully modify this test as checks fail for reasons other than finding a product bug. But a lot of it will be repeated from run to run.

We introduce the minefield criticism the first time you run any given check in your unit checking suite. The first time you run it, it fails, right? Of course, otherwise it wouldn’t be TDD. Now, let’s examine the situation.

Question: Why run it again?

Answer: Regression. You run it again because you have added code to make the check pass, therefore running the test again is not merely redundant, the value of the test has been recharged by the product change.

Question: During the course of development, but after the first time the test passes, why not delete it? Why bother to run it again?

Answer: Several reasons. Regression still applies, since you may accidentally break the product during development, but it could be argued that most of those unit checks most of the time don’t fail, and some of them are extremely unlikely to fail even if you change the code quite a bit. But there is a second reason: Inexpensiveness. It’s so cheap to create these checks and to run them and to keep them running, while at the same time they do have some value, even if not a lot. And you have a third reason for some of the checks: Importance. For a good many of the unit tests, failure would indicate a very serious problem. If you are testing something that is particularly complex, or involves many interacting sub-systems, you may also want to repeat because of Intermittence. Perhaps something will fail after the forty-third run because of probabilistic factors in the test. Finally, there’s the Retry reasons, which reminds us that we might not have run the test correctly, before. As you once said, Ward, something might bother you only after you’ve performed the test a hundred times or so.

Question: Let’s say that I’m a really good developer and though I write good checks, they don’t fail because I just don’t put bugs into my code. I have a whole lot of tests and they don’t fail. What was the sense in investing in such tests?

Answer: Two potential reasons. One is Avoidance/Indifference. You may create the checks as a form of documentation for future developers and you like them to be exactly the same in order to minimize the chance that they will fail (and thus be less useful as documentation). Or maybe you want to impress a customer with your great software, and they won’t be as impressed if the checks don’t pass. A second reason is Mandated: you may work this way because your peer group or your manager requires you to. This is a little like avoidance except that with a mandate you do, in fact, want to find bugs. You are searching for them, you just are required to use a certain technique to do so.

We therefore see that the fairly simple, often repeated unit checks of TDD may indeed be exempt from the minefield-based argument in favor of varying a test, inasmuch as the reasons I cited apply. But TDD is not exempt from this kind of heuristic analysis. It is always reasonable to question the value of repeated tests, and that’s what the minefield invites us to do.

Technical reasons you might rationally repeat tests…

Business reasons you might rationally repeat tests…

Applying the Minefield: An Example

Footer