To Repeat Tests or Not to Repeat

One of the serious social diseases of the testing craft is the obsession with repetition. Is that test repeatable? Is that test process repeatable? Have we repeated those tests? These questions are often asked in a tone of worry or accusation, sometimes accompanied by rhetorical quips about the importance of a disciplined process– without explanation of how discipline requires repetition.

(Before you go on, I urge you to carefully re-read the previous paragraph, and notice that I used the word obsession. I am not arguing with repeatability, as such. Just as one can argue against an addiction to food without being against eating, what I’m trying to do is wipe out obsession. Please help me.)

There is one really good reason not to repeat a test: the value of a new test is greater than the value of an old test (all other things being equal). It’s greater because a new test can find problems that have always been in the product, and not yet found, while an old test has a non-zero likelihood of revealing the same old thing it revealed the last time you performed it. New tests always provide new information. Old tests sometimes do.

This one powerful reason to run new tests is based on the idea that testing is a sampling process, and that running a single test, whatever the test, is to collect a tiny sample of behavior from a very large population of potential behaviors. More tests means a bigger sample. Re-running tests belabors the same sample, over and over.

Test repetition is often justified based on arguments that sound like blatant discrimination against the unborn test, as if manifested tests have some kind of special citizenship denied to mere potential tests. One reason for this bias may be a lack of appreciation for the vastness of testing possibilities. If you believe that your tests already comprise all the tests that matter, you won’t have much urgency about making new ones.

Another reason may be an innappropriate analogy to scientific experiments. We were all told in 5th grade science class about the importance of the controlled, repeatable experiment to the proper conduct of science. But what we weren’t told is that a huge amount of less controlled and less easily repeated exploratory work precedes the typical controlled experiment. Otherwise, an amazing amount of time would be wasted on well controlled, but uninteresting experiments. Science embraces exploratory as well as confirmatory research.

One thought experiment I find useful is to take the arguments for repetition to their logical extreme and suppose that we have just one and only one test for a complex product. We run that test again and again. The absurdity of that image helps me see reasons to run more tests. No complex product with a high quality standard can be considered well tested unless a wide variety of tests have been performed against it.

(You probably have noticed that it’s important to consider what I mean by “test” and “run that test again and again”. Depending on how you think of it, it may well be one test would be enough, but then it would have to be an extremely complex test or one that incorporates within itself an extreme amount of variation.)

The Product is a Minefield

In order to replace obsession with informed choice, we need a way to consider a situation and decide if repetition is warranted, and how much repetition. I have found that the analogy of a minefield helps me work through those considerations.

The minefield is an evocative analogy that expresses the sampling argument: if you want to avoid stepping on a mine, walk in the footsteps of the last successful person to traverse the minefield. Repetition avoids finding a mine by limiting new contact between your feet and the ground. By the same principle, variation will increase the possibility of finding a mine.

I like this analogy because it is a meaningful and valid argument that also has important flaws that help us argue in favor of repetition. The analogy helps us explore both sides of the issue.

In my classes, I make the minefield argument and then challenge students to find problems in it. Each problem is then essentially a reason why, in a certain context, repetition might be better than variation.

I won’t force you to go through that exercise. Although, before you click on the link, below, you may want to think it through for yourself.

I know of nine interestingly distinct reasons to repeat tests. How many can you think of?

Click this link when you are ready to see my list and how the argument applies to test-driven design: Ten Reasons to Repeat Tests

Test Messy with Microbehaviors

James Lyndsay sent me a little Flash app once that was written to be a testing brainteaser. He challenged me to test it and I had great fun. I found a few bugs, and have since used it in my testing class. “More, more!” I told him. So, he recently sent me a new version of that app. But get this: he fixed the bugs in it.

In a testing class, a product that has known bugs in it make a much better working example than a product that is has only unknown bugs. The imperfections are part of its value, so that testing students have something to find, and the instructor has something to talk about if they fail to find them.

So, Lyndsay’s new version is not, for me, an improvement.

This has a lot to do with a syndrome in test automation: automation is too clean. Now, unit tests can be very clean, and there’s no sin in that. Simple tests that do a few things exactly the same way every time can have value. They can serve the purposes of change detection during refactoring. No, I’m talking about system-level, industrial strength please-find-bugs-fast test automation.

It’s too clean.

It’s been oversimplified, filed down, normalized. In short, the microbehaviors have been removed.

The testing done by a human user interacting in real time is messy. I use a web site, and I press the “back” button occasionally. I mis-type things. I click on the wrong link and try to find my way back. I open additional windows, then minimize them and forget them. I stop in the middle of something and go to lunch, letting my session expire. I do some of this on purpose, but a lot of it is by accident. My very infirmity is a test tool.

I call the consequences of my human infirmity “microbehaviors”, those little ticks and skips and idiosyncrasies that will be different in the behavior of any two people using a product even if they are trying to do the same exact things.

Test automation can have microbehavior, too, I suppose. It would come from subtle differences in timing and memory use due to other processes running on the computer, interactions with peripherals, or network latency. But nothing like the gross variations inherent in human interaction, such as:

  • Variations in the order of apparently order independent actions, such as selecting several check boxes before clicking OK on a dialog box. (But maybe there is some kind of order dependence or timing relationship that isn’t apparent to the user)
  • The exact path of the mouse, which triggers mouse over events.
  • The exact timing and sequence of keyboard input, which occurs in patterns that change relative to the typing skill and physical state of the user.
  • Entering then erasing data.
  • Doing something, then undoing it.
  • Navigating the UI without “doing” anything other than viewing windows and objects. Most users assume this does not at all affect the state of an application.
  • Clicking on the wrong link or button, then backing out.
  • Leaving an application sitting in any state for hours on end. (My son leaves his video games sitting for days, I hope they are tested that way.)
  • Experiencing error messages, dismissing them (or not dismissing them) and trying the same thing again (or something different).
  • Navigating with the keyboard instead of the mouse, or vice versa.
  • Losing track of the application, assuming it is closed, then opening another instance of it.
  • Selecting the help links or the customer service links before returning to complete an activity.
  • Changing browser or O/S configuration settings in the middle of an operation.
  • Dropping things on the keyboard by accident.
  • Inadvertantly going into hibernation mode while using the product, because the batteries ran out on the laptop.
  • Losing network contact at the coffee shop. Regaining it. Losing it again…
  • Accidentally double-clicking instead of single-clicking.
  • Pressing enter too many times.
  • Running other applications at the same time, such as anti-virus scanners, that may pop up over the application under test and take focus.

What make a microbehavior truly micro is that it’s not supposed to make a difference, or that the difference it makes is easily recoverable. That’s why they are so often left out of automated tests. They are optimized away as irrelevant. And yet part of the point of testing is to challenge ideas about what might be relevant.

In a study done at Florida Tech, Pat McGee discovered that automated regression tests for one very complex product found more problems when the order of the tests was varied. Everything else was kept exactly the same. And, anecdotally, every tester with a little experience can probably cite a case where some inadvertent motion or apparently irrelevant variation uncovered a bug.

Even a test suite with hundreds of simple procedural scripts in it cannot hope to flush out all and probably not most of the bugs that matter, in any complex product. Well, you could hope, but your hope would be naive.

So, that’s why I strive to put microbehaviors into my automation. Among the simplest measures is to vary timing and ordering of actions. I also inject idempotent actions (meaning that they end in the same apparent state they started with) on a random basis. These measures are usually very cheap to implement, and I believe they greatly improve my chances of finding certain state-related or timing-related bugs, as well as bugs in exception handling code.

What about those Flash applications that Mr. Lyndsay sent me? He might legitimately assert that his purpose was not to write a buggy Flash app for testers, but a nice clean brainteaser. That’s fine, but the “mistakes” he made in execution turned into bonus brainteasers for me, so I got the original, plus more. And that’s the same with testing.

I want to test on purpose AND by accident, at the same time.