Round Earth Test Strategy

The “test automation pyramid” (for examples, see here, here, and here) is a popular idea, but I see serious problems with it. I suggest in this article an alternative way of thinking that preserves what’s useful about the pyramid, while minimizing those problems:

  1. Instead of a pyramid, model the situation as concentric spheres, because the “outer surface” of a complex system generally has “more area” to worry about;
  2. ground it by referencing a particular sphere called “Earth” which is familiar to all of us because we live on its friendly, hospitable surface;
  3. illustrate it with an upside-down pyramid shape in order to suggest that our attention and concern is ultimately with the surface of the product, “where the people live” and also to indicate opposition to the pyramid shape of the Test Automation Pyramid (which suggests that user experience deserves little attention);
  4. incorporate dynamic and well as static elements into the analogy (i.e. data, not just code);
  5. acknowledge that we probably can’t or won’t directly test the lowest levels of our technology (i.e. Chrome, or Node.js, or Android OS). In fact, we are often encouraged to trust it, since there is little we can do about it;
  6. use this geophysical analogy to explain more intuitively why a good tooling strategy can access and test the product on a subterranean level, though not necessarily at a level below that of the platforms we rely upon.

Good analogies afford deep reasoning.

The original pyramid (really a triangle) was a context-free geometric analogy. It was essentially saying: “Just as a triangle has more area in its lower part than its upper part, so you should make more automated tests on lower levels than higher levels.” This is not an argument; this is not reasoning. Nothing in the nature of a triangle tells us how it relates to technology problems. It’s simply a shape that matches an assertion that the authors wanted to make. It’s semiotics with weak semantics.

It is not wrong to use semantically arbitrary shapes to communicate, of course (the shapes of a “W” and an “M” are opposites, in a sense, and yet nobody cares that what they represent are not opposites). But at best, it’s a weak form of communication. A stronger form is to use shapes that afford useful reasoning about the subject at hand.

The Round Earth model tries to do that. By thinking of technology as concentric spheres, you understand that the volume of possibilities– the state space of the product– tends to increase dramatically with each layer. Of course, that is not necessarily the case, because a lot of complexity may be locked away from the higher levels by the lower levels. Nevertheless that is a real and present danger with each layer you heap upon your technology stack. An example of this risk in action is the recent discovery that HTML emails defeat the security of PGP email. Whoops. The more bells, whistles, and layers you have, the more likely some abstraction will be fatally leaky. (One example of a leaky abstraction is the concept of “solid ground,” which can both literally and figuratively leak when hot lava pours out of it. Software is built out of things that are more abstract and generally much more leaky than solid ground.)

When I tell people about the Round Earth model they often start speaking of caves, sinkholes, landslides, and making jokes about volcanoes and how their company must live over a “hot spot” on that Round Earth. These aren’t just jokes, they are evidence that the analogy is helpful, and relates to real issues in technology.

Note: If you want to consider what factors make for a good analogy, Michael Bolton wrote a nice essay about that (Note: he calls it metaphor, but I think he’s referring to analogies).

The Round Earth model shows testing problems at multiple levels.

The original pyramid has unit testing at the bottom. At the bottom of the Round Earth model is the application framework, operating environment, and development environment– in other words, the Platform-That-You-Don’t-Test. Maybe someone else tests it, maybe they don’t. But you don’t know and probably don’t even think about it. I once wrote Assembler code to make video games in 16,384 bytes of memory. I needed to manage every byte of memory. Those days are long gone. Now I write Perl code and I hardly think about memory. Magic elves do that work, for all I know.

Practically speaking, all development rests on a “bedrock” of assumptions. These assumptions are usually safe, but sometimes, just as hot lava or radon gas or toxified groundwater breaks through bedrock, we can also find that lower levels of technology undermine our designs. We must be aware of that general risk, but we probably won’t test our platforms outright.

At a higher level, we can test the units of code that we ourselves write. More specifically, developers can do that. While it’s possible for non-developers to do unit-level checks, it’s a much easier task for the devs themselves. But, realize that the developers are working “underground” as they test on a low level. Think of the users as living up at the top, in the light, whereas the developers are comparatively buried in the details of their work. They have trouble seeing the product from the user’s point of view. This is called “the curse of expertise:”

“Although it may be expected that experts’ superior knowledge and experience should lead them to be better predictors of novice task completion times compared with those with less expertise, the findings in this study suggest otherwise. The results reported here suggest that experts’ superior knowledge actually interferes with their ability to predict novice task performance times.”

[Hinds, P. J. (1999). The curse of expertise: The effects of expertise and debiasing methods on prediction of novice performance. Journal of Experimental Psychology: Applied, 5(2), 205–221. doi:10.1037/1076-898x.5.2.205]

While geophysics can be catastrophic, it can also be more tranquil than a stormy surface world. Unit level checking generally allows for complete control over inputs, and there usually aren’t many inputs to worry about. Stepping up to a higher level– interacting sub-systems– still means testing via a controlled API, or command-line, rather than a graphical interface designed for creatures with hands and eyes and hand-eye coordination. This is a level where tools shine. I think of my test tools as submarines gliding underneath the storm and foam, because I avoid using tools that work through a GUI.

The Round Earth model reminds us about data.

Data shows up in this model, metaphorically, as the flow of energy. Energy flows on the surface (sunlight, wind and water) and also under the surface (ground water, magma, earthquakes). Data is important. When we test, we must deal with data that exists in databases and on the other side of micro-services, somewhere out in the cloud. There is data built into the code, itself. So, data is not merely what users type in or how they click. I find that unit-level and sub-system-level testing often neglects the data dimension, so I feature it prominently in the Round Earth concept.

The Round Earth model reminds us about testability.

Complex products can be designed with testing in mind. A testable product is, among other things, one that can be decomposed (taken apart and tested in pieces), and that is observable and controllable in its behaviors. This usually involves giving testers access to the deeper parts of the product via command-line interfaces (or some sort of API) and comprehensive logging.


  • Quality above requires quality below.
  • Quality above reduces dependence on expensive high-level testing.
  • Inexpensive low-level testing reduces dependence on expensive high-level testing.
  • Risk grows toward the user.



The Next Step In “Test Automation” is Pure Bullshitting

I defy any responsible, sober technical professional to visit this website and discover what the “MABL” tool is and does without reaching out to the company to beg for actual details. It has an introduction video, for instance, that conveys no information whatsoever about the product. Yes, it is teeming with sentences that definitely contain words. But the words representing irresponsible, hyperbolic summarizing that could be applied, equally irresponsibly and hyperbolically to lots of different tools.

My favorite moments in the video:

0:33 “write tests… just like a really smart QA engineer would.” Huh. I would like to see a QA engineer go on the video and say “I’m really smart QA engineer, and MABL does just what I do.” I would like to interview such a person.

0:44 “She uses machine intelligence to…” Yes, the talking man is using the female pronoun to imply that “MABL” has the tacit knowledge of a female human engineer. Isn’t that nice? He speaks with a straight face and an even tone. He must have a lot of respect for this imaginary woman he is marketing. (Note: no human women speak on the video, but there is one in a non-speaking role for about a half-second.)

Ultimately, I am left not knowing what specific functionalities their tool has that they are lying about. Yes, lying. Because their claims cannot possibly be true, and they cannot possibly believe they are true– kind of like one of those infomercials about 18-year-old girls in your area that would love to talk to you. Except in this case, her name is MABL and she wants to test your product.

What is really going on?

Apparently the industry has reached a point where testing services can be sold the same way miracle weight loss programs or anti-aging face creams (with micro-beads!) are sold. This can only happen in an industry that holds testing craftsmanship in utter contempt. The testing industry is like a failed state ruled by roving gangs.

Maybe this MABL tool does something interesting, but it seems they don’t want us to worry our pretty little heads about it. And that is something that should worry us all.

Regression Test Tool for Trash Walking

My recent flirtation with trash-pickup-as-physical-exercise has led me down a familiar path. Even though it is not my responsibility to clean a public road in the first place, once I do it, I find that I feel irrational ownership of it. I want it to stay clean. But since I’ve adopted about 9 miles of road so far, it takes too long to walk the whole route in a day (remember I have to make one pass for each side of the road, or else I am going to miss a lot of trash). Regression trash walking takes too much effort!

I want automation!

I can travel faster in a car, but there are few places I can safely stop the car. I was thinking maybe I should get a motor-scooter instead; a Vespa or something. But that defeats the primary purpose of my trash walking– which is supposed to be exercise. So, now I’m thinking about maybe a bike will be the ticket. I could combine this with the Steel Grip grabber tool to quickly nab the trash and get back on the road.

Just as with software testing, a big problem with introducing tools to a human process is that it can change the process to make it less sensitive (or far too sensitive). In this case, any vehicle that moves fast will cause me to miss some trash. On the other hand, I will still catch a lot of the trash. It’s probably a good enough solution.

On the whole I think it is a good idea to use a bicycle. The remaining problem is that my wife is terrified I will be hit by a car.

The Unnecessary Tool

My wife bought a Steel Grip 36in Lightweight Aluminum Pick Up Tool.

I saw it on our combination dining room/craft/office table and asked her what it was for.

“My eye pillow fell behind the bed and I can’t reach it.” she told me. (This led to some confusion for me at first because I thought she was referring to an iPillow, presumably an Apple product I had never heard of.)

“I can easily get that for you.” I eventually replied while reaching behind the bed and retrieving her iPillow.

That seemed to end the conversation. But I was still surprised that she bought an entire new gadget to accomplish something that is pretty easy to solve with ordinary human effort– such as asking her husband. I couldn’t resist teasing her about it as I discovered that the squeaky gripper was also a good tool for annoying my dogs. Lenore is usually the epitome of sensible practicality. She’s usually the one restraining me from buying unnecessary things. So, it felt good to see her have a little lapse, for once.

In testing, I see a lot of that: introducing tools that aren’t needed and mostly just clutter up the place. All over the industry, technocrats seem to turn to tools at the slightest excuse. Tools will save us! More tools. Never mind the maintenance costs. Never mind what we lose by distancing ourselves from our problems. Automation!

(Please don’t bother commenting about your useful tool kit. I’m not talking about useful tools, here. I’m talking about a tool that was purchased specifically to solve a problem that was already easily solved without it. I am talking about an unnecessary tool.)

So then what happens…?

A few weeks later, I am getting bored with my walks. Well, let me back up: I am at the age where physical fitness is no longer about looking sharp, or even feeling good. It’s becoming a matter of do I want to keep living or what? The answer is yes I want to live, Clarence. That means I must exercise. This year I have been walking intensively.

But it’s boring. I can’t get anything done when I’m walking. I don’t like listening to music, and anyway I feel uncomfortable being cut off from the sounds of my surroundings. Therefore, I trudge along: bored.

One day I realized I can have more fun walking if I picked up garbage along my way. That way I would be making the world better as I walked. At first I carried a little trash sack at my waist, but my ambitions soon grew, and within days I decided it was time to walk the main road into town with a 50-gallon industrial trash bag and a high viz vest.

As I was leaving on my first mission, Lenore handed me the gripper.

It was the perfect tool.

It was exactly what I needed.

It would save my back and knees.

My gripper gets a lot of use, now. I’m wondering if I need to upgrade to a titanium and carbon fiber version. I’m thinking of crafting a holster for it.

Is There a Moral Here? Yes.

One of the paradoxes of Context-Driven testing is that on the one hand, you must use the right solution for the situation; while, on the other hand, you can only know what the right solution can be if you have already learned about it, and therefore used it, BEFORE you needed it. In other words, to be good problem solvers, we also need to dabble with and be curious about potential solutions even in the absence of a problem.

The gripper spent a few weeks lying around our home until suddenly it became my indispensable friend.

I guess what that means is that it’s good to have some tolerance and playfulness about experimenting with tools. Even useless ones.


We. Use. Tools.

Context-Driven testers use tools to help ourselves test better. But, there is no such thing as test automation.

Want details? Here’s the 10,000 word explanation that Michael Bolton and I have been working on for months.

Editor’s Note: I have just posted version 1.03 of this article. This is the third revision we have made due to typos. Isn’t it interesting how hard it is to find typos in your own work before you ship an article? We used automation to help us with spelling, of course, but most of the typos are down to properly spelled words that are in the wrong context. Spelling tools can’t help us with that. Also, Word spell-checker still thinks there are dozens of misspelled words in our article, because of all the proper nouns, terms of art, and neologisms. Of course there are the grammar checking tools, too, right? Yeah… not really. The false positive rate is very high with those tools. I just did a sweep through every grammar problem the tool reported. Out of the five it thinks it found, only one, a missing hyphen, is plausibly a problem. The rest are essentially matters of writing style.

One of the lines it complained about is this: “The more people who use a tool, the more free support will be available…” The grammar checker thinks we should not say “more free” but rather “freer.” This may be correct, in general, but we are using parallelism, a rhetorical style that we feel outweighs the general rule about comparatives. Only humans can make these judgments, because the rules of grammar are sometimes fluid.

Behavior-Driven Development vs. Testing

The difference between Behavior-Driven Development and testing:

This is a BDD scenario (from Dan North, a man I respect and admire):

+Scenario 1: Account is in credit+
Given the account is in credit
And the card is valid
And the dispenser contains cash
When the customer requests cash
Then ensure the account is debited
And ensure cash is dispensed
And ensure the card is returned

This is that BDD scenario turned into testing:

+Scenario 1: Account is in credit+
Given the account is in credit
And the card is valid
And the dispenser contains cash
When the customer requests cash
Then check that the account is debited
And check that cash is dispensed
And check that the card is returned
And check that nothing happens that shouldn’t happen and everything else happens that should happen for all variations of this scenario and all possible states of the ATM and all possible states of the customer’s account and all possible states of the rest of the database and all possible states of the system as a whole, and anything happening in the cloud that should not matter but might matter.

Do I need to spell it out for you more explicitly? This check is impossible to perform. To get close to it, though, we need human testers. Their sapience turns this impossible check into plausible testing. Testing is a quest within a vast, complex, changing space. We seek bugs. It is not the process of  demonstrating that the product CAN work, but exploring if it WILL.

I think Dan understands this. I sometimes worry about other people who promote tools like Cucumber or jBehave.

I’m not opposed to such tools (although I continue to suspect that Cucumber is an elaborate ploy to spend a lot of time on things that don’t matter at all) but in the face of them we must keep a clear head about what testing is.

Avoiding My Curse on Tool Vendors

Adam Goucher noticed that I recently laid a curse upon commercial test tool vendors (with the exception of Hexawise, Blueberry Consultants, and Atlassian). He wondered to me how a tool vendor might avoid my curse.

First, I’m flattered that he would even care who I curse. But, it’s a good question. Here’s my answer:

Test tool vendors that bug me:

  • Any vendor who wants me to pay for every machine I use their tool upon. Guys, the nature of testing is that I need to work with a lot of machines. Sell me the tool for whatever you want to charge, but you are harming my testing by putting obstacles between me and my test lab.
  • Any vendor that sell tools conceived and designed by a goddamn developer who hates to goddamn test. How do I know about the developer of a test tool? Well, when I’m looking at a tool and I find myself asking “Have these vendor bozos ever actually had to test something in their lives? Did they actually want a tool like this to help them? I bet this tool will triple the amount of time and energy I have to put into testing, and make me hate every minute of it” then I begin to suspect there are no great lovers of testing in the house. This was my experience when I worked with Rational Test Manager , in 2001. I met the designer of that tool: a kid barely out of MIT with no testing or test management experience who informed me that I, a Silicon Valley test management veteran, wasn’t qualified to criticize his design.
  • Any vendor selling me the opportunity, at great cost, to simulate a dim-witted test executioner. Most tool vendors don’t understand the difference between testing and checking, and they think what I want is a way to “test while I sleep.” Yes, I do want the ability to extend my power as a tester, but that doesn’t mean I’m happy to continually tweak and maintain a brittle set of checks that have weak oracles and weak coverage.
  • Any vendor who designs tools by guessing what will impress top managers in large companies who know nothing about testing. In other words: tools to support ceremonial software testing. Cem and I once got a breathless briefing about a “risk-based test management” tool from Compuware. Cem left the meeting early, in disgust. I lingered and tried to tell them why their tool was worthless. (Have you ever said that to someone, and they reacted by saying “I know it’s not perfect” and you replied by saying “Yes, it’s not perfect. I said it’s worthless, therefore it would follow that it’s also not perfect. You could not pay me to use this tool. This tool further erodes my faith in the American public education system, and by extension the American experiment itself. I’m saying that you just ruined America with your stupid stupid tool. So yeah, it’s not perfect.”) I think what bugged Cem and me the most is that these guys were happy to get our endorsement, if we wanted to give it, but they were not at all interested in our advice about how the tool could be re-designed into being a genuine risk-based testing tool. Ugh, marketers.
  • Vendors who want to sell me a tool that I can code up in Perl in a day. I don’t see the value of Cucumber. I don’t need FIT (although to his credit, the creator of FIT also doesn’t see the big deal of FIT). But if I did want something like that, it’s no big deal to write a tool in Perl. And both of those tools require that you write code, anyway. They are not tools that take coding out of our hands. So why not DIY?

Tool vendors I like:

  • Vendors who care what working testers think of their tools and make changes to impress them. Blueberry, Hexawise, and Sirius Software have all done that.
  • Vendors who have tools that give me vast new powers. I love the idea of virtual test labs. VMWare, for instance.
  • Vendors who don’t shackle me to restrictive licenses. I love ActivePerl, which I can use all over the place. And I happily pay for things like their development kit.
  • Vendors who enjoy testing. Justin Hunter, of Hexawise, is like that. He’s the only vendor speaking at CAST, this year, you know.

We Need Better Testing Bloggers

I don’t understand the mentality of bloggers like this guy. His view of the history of testing is a fantasy that seems designed to insult people who study testing. It applies at most to certain companies, not to the field itself.

He says we need a better way to test. Those of us who are serious testers have actually been developing and demonstrating better ways to test for decades, as we keep up with technology. Where have you been, Steve? Get out much do ya?

He thinks automation is the answer. What a surprise that a programmer would say that. But the same thing was said in 1972 at the Chapel Hill Symposium. We’ve tried that already. Many many times we’ve tried it.

We know why automation is not the grand solution to the testing problem.

As a board member of AST, I should mention the upcoming CAST Conference— the most advanced practitioner’s testing conference I know. Go to CAST, Steve, and tell Jerry Weinberg to his face (the programmer who started the first independent test group, made up of programmers) all about your theory of testing history.

Also, Jerry’s new book Perfect Software and Other Illusions About Testing, will be available soon. It addresses misconceptions like “Just automate the testing!” along with many others. Jerry is not just an old man of testing. He’s the oldest among us.

The Future Will Need Us to Reboot It

I’ve been reading a bit about the Technological Singularity. It’s an interesting and chilling idea conceived by people who aren’t testers. It goes like this: the progress of technology is increasing exponentially. Eventually the A.I. technology will exist that will be capable of surpassing human intelligence and increasing its own intelligence. At that point, called the Singularity, the future will not need us… Transhumanity will be born… A new era of evolution will begin.

I think a tester was not involved in this particular project plan. For one thing, we aren’t even able to define intelligence, except as the ability to perform rather narrow and banal tasks super-fast, so how do we get from there to something human-like? It seems to me that the efforts to create machines that will fool humans into believing that they are smart are equivalent to carving a Ferrari out of wax. Sure you could fool someone, but it’s still not a Ferrari. Wishing and believing doesn’t make it a Ferrari.

Because we know how a Ferrari works, it’s easy to understand that a wax Ferrari is very different from a real one. Since we don’t know what intelligence really is, even smart people easily will confuse wax intelligence for real intelligence. In testing terms, however, I have to ask “What are the features of artificial intelligence? How would you test them? How would you know they are reliable? And most importantly, how would you know that human intelligence doesn’t possess secret and subtle features that have not yet been identified?” Being beaten in chess by a chess computer is no evidence that such a computer can help you with your taxes, or advise you on your troubles with girls. Impressive feats of “intelligence” simply do not encompass intelligence in all the forms that we routinely experience it.

The Google Grid

One example is the so-called Google Grid. I saw a video, the other day, called Epic 2014. It’s about the rise of a collection of tools from Google that create an artificial mass intelligence. One of the features of this fantasy is an “algorithm” that automatically writes news stories by cobbling pieces from other news stories. The problem with that idea is that it seems to know nothing about writing. Writing is not merely text manipulation. Writing is not snipping and remixing. Writing requires modeling a world, modeling a reader’s world, conceiving of a communication goal, and finding a solution to achieve that goal. To write is to express a point of view. What the creators of Epic 2014 seemed to be imagining is a system capable of really really bad writing. We already have that. It’s called Racter. It came out years ago. The Google people are thinking of creating a better Racter, essentially. The chilling thing about that is that it will fool a lot of people, whose lives will be a little less rich for it.

I think the only way we can get to an interesting artificial intelligence is to create conditions for certain interesting phenomena of intelligence to emerge and self-organize in some sort of highly connectionist networked soup of neuron-like agents. We won’t know if it really is “human-like”, except perhaps after a long period of testing, but growing it will have to be a delicate and buggy process, for the same reason that complex software development is complex and buggy. Just like Hal in 2001, maybe it’s really smart, or maybe it’s really crazy and tells lies. Call in the testers, please.

(When Hal claimed in the movie that no 9000 series computers had ever made an error, I was ready to reboot him right then.)

No, you say? You will assemble the intelligence out of trillions of identical simple components and let nature and data stimulation build the intelligence automatically? Well, that’s how evolution works, and look how buggy THAT is! Look how long it takes. Look at how narrow the intelligences are that it has created. And if we turn a narrow and simplistic intelligence to the task of redesigning itself, why suppose that it is more likely to do a good job than a terrible job?

Although humans have written programs, no program yet has written a human. There’s a reason for that. Humans are oodles more sophisticated than programs. So, the master program that threatens to take over humanity would require an even more masterful program to debug itself with. But there can’t be one, because THAT program would require a program to debug itself… and so on.

The Complexity Barrier

So, I predict that the singularity will be drowned and defeated by what might be called the Complexity Barrier. The more complex the technology, the more prone to breakdown. In fact much of the “progress” of technology seems to be accompanied by a process of training humans to accept increasingly fragile technology. I predict that we will discover that the amount of energy and resources needed to surmount the complexity barrier will approach infinity.

In the future, technology will be like weather. We will be able to predict it somewhat, but things will go mysteriously wrong on a regular basis. Things fall apart; the CPU will not hold.

Until I see a workable test plan for the Singularity, I can’t take it seriously.

Confused Methodology Talk #1

This posting by Corey Goldberg illustrates an interesting and all too common kind of confusion people get into when discussing methods and practices. It’s worth pondering.

On SQAForums, someone stated:

“ISEB defines automated tested as useful only in mature testing environments and where functionality is not changing i.e. at regression testing.”

to which Corey replied:

“…and ISEB would be completely wrong on that point. web services testing should be fully automated, as there is no UI, just an API.”

Let’s analyze these statements. The first writer seems to be under the sway of ISEB, which immediately induces a heavy sigh in the pit of my soul.

(There are now thousands of people who might be called “certification zombies” lurching around in an ISEB or ISTQB-induced fog, trying to apply what they learned in a few days of memorizing to the complex reality of testing.)

When the first writer says that ISEB “defines” automation as useful only in a certain context, that’s a perfect example of the inability to separate context and method. To think clearly about methodology, you must be able to sift these things apart. Best practice thinking can’t help you do this, and in fact discourages you from trying.

I don’t know if ISEB actually defines or discusses test automation in that way, but if it does, I can tell you what ISEB is probably thinking.

(BTW, one of the big problems with certification programs is the depersonalization of convictions. I say “ISEB” when what I want to say is Dorothy Graham or one of those people who support and edit the ISEB syllabus. You can’t argue with a document. Only people can have a point of view. To argue with ISEB itself is to argue with an anonymous sock puppet. But that’s the way they want it. Certificationists quite purposefully create a bureaucratic buffer of paper between themselves and any dissenters. To pick someone whom I believe advocates the ISEB way, I will choose Dorothy Graham.)

If Dot advocates that belief, then she is probably thinking about GUI-level automation of some aspects of test execution; a set of detailed scripted actions programmed into a software agent to exercise a system under test. If so then it is indeed likely that modifying the system under test in certain ways will break the test automation. This often leads to a situation where you are constantly trying to fix the automation instead of enjoying the benefits of it. This is especially a problem when the testing is happening via a GUI, because little changes that don’t bother a human will instantly disable a script.

So, even though the first writer appears to be reading off the ISEB script, there is some validity to his claim, in some context.

Now look at Corey’s reply. Corey is not under the sway of ISEB, but I worry that he may be under the sway of a typical affliction common among programmers who talk about testing: the reification fallacy. This is the tendency to think of an abstraction or an emergent process as if it were a fixed concrete thing. Hence if a programmer sees me punch a few keys in the course of my testing, and writes a program that punches those same keys in the same order, he might announce that he as “automated the test”, as if the test were nothing more than a pattern of input and output. Certainly, it is possible to automate some aspects of testing, but the aspect of it that requires human reflection cannot be automated. In fact, it can’t even be precisely duplicated by another human. It is an emergent phenomenon.

(Some would say that I am splitting hairs too finely, and that imprecise duplication may be close enough. I agree that it may be close enough in certain contexts. What I caution against is taking the attitude that most of what is valuable about testing, most of the time, is easy to automate. When I have seen that attitude in practice, the resulting automation has generally been too expensive and too shallow. Rich, interesting, cost-effective test automation, in my experience, is a constructive partnership between human thinkers and their tools. I believe, based on my knowledge of Corey, that he actually is interacting constructively with his tools. But in this case, he’s not talking that way.)

What Corey can do is use tools to interact with a system under test. He uses his un-automatable human mind to program those tools to provide certain input and look for certain output. His tools will be able to reveal certain bugs. His tools in conjunction with un-automatable human assistance during and after execution and un-automatable human assistance to re-program the tests as needed will reveal many more bugs.

The reification fallacy leads to certain absurdities when you consider different frames of reference. Corey points out that a web service has no “user interface”, and therefore is accessible only via a tool, and anything that is accessible only by a tool must therefore require “fully automated” testing. By that reasoning, we can say that all testing is always fully automated because in all cases there is some kind of hardware or software that mediates our access to the object of our test. Therefore, the fact that I am using a keyboard to type this blog posting and a screen to view it, by Cory’s logic, must be fully automated writing! I wonder what will be written next by my magic keyboard?

From one frame of reference, a web service has no user interface. From another frame of reference we can say that it does have a user interface, just not a human interface– its user is another program. How we test such a thing is to write or employ a program that does have a human interface to manipulate the web service. We can operate this interface in batch mode: write a program to submit data, run it, review the results, and re-write the program as needed. Or we can operate the interface interactively: write a program to submit data, present results, then wait for us to type in a new query.

Corey and the first writer are not in a helpful dialog, because they are talking about different things. I would tell the first writer to treat ISEB as having no authority or wisdom, and to instead learn to reason for himself. The relevant reasoning here, I think, is to wonder what kind of tool we could find or write that would allow us to interact with the web service. At the same time, we need to consider how the web service interface might change. We might stick to highly interactive testing for a while, instead of investing in a batching system with lot of automatic oracles, if we feel that the interface and functionality is changing too fast. On the other hand, one of the nice things about testing through an API is that it is often rather inexpensive to script sequences and batches and simple oracles; and consequently inexpensive to fix them when the system under test changes. I suspect that belief informed Corey’s response, although I wish he would make that belief more apparent to people who are used to thinking of testing as a human-driven process.

As a programmer, I am aware of the urge, sometimes, to say “I didn’t do it, my program did.” In testing this naturally turns into “I didn’t test that, my program I wrote to test that did.” The crucial difficulty with this way of speaking, when it comes to testing, is the way it obscures the many, many choices the programmer made while designing the program, as if the program itself made those choices, or as if there were no choices to be made. The thing is, I don’t care, for a regular program, how many other ways it could have been written, or how any other things it could have done. But these are vital concerns when the program is meant to test another program.

Manual Tests Cannot Be Automated (DEPRECATED)

[Note: This post is here only to serve as a historical example of how I used to speak about “automated testing.” My language has evolved. The sentiment of this post is still valid, but I have become more careful– and I think more professional– in my use of terms.]

I enjoy using tools to support my testing. As a former production coder, automated tests can be a refreshing respite from the relatively imponderable world of product analysis and heuristic test design (I solve sudoku puzzles for the same reason). You know, the first tests I ever wrote were automated. I didn’t even distinguish between automated and manual tests for the first couple of years of my career.

Also for the first six years, or so, I had no way to articulate the role of skill in testing. Looking back, I remember making a lot of notes, reading a lot of books, and having a feeling of struggling to wake up. Not until 1993 did my eyes start to open.

My understanding of cognitive skills of testing and my understanding of test automation are linked, so it was some years before I came to understand what I now propose as the first rule of test automation:

Test Automation Rule #1: A good manual test cannot be automated.

No good manual test has ever been automated, nor ever will be, unless and until the technology to duplicate human brains becomes available. Well, wait, let me check the Wired magazine newsfeed… Nope, still nothing human brain scanner/emulators.

(Please, before you all write comments about the importance and power of automated testing, read a little bit further.)

It is certainly possible to create a powerful and useful automated test. That test, however, will never have been a good manual test. If you then read and hand-execute the code– if you do exactly what it tells you– then congratulations, you will have performed a poor manual test.

Automation rule #1 is based on the fact that humans have the ability to do things, notice things, and analyze things that computers cannot. This is true even of “unskilled” testers. We all know this, but just in case, I sprinkle exercises to demonstrate this fact throughout my testing classes. I give students products to test that have no specifications. They are able to report many interesting bugs in these products without any instructions from me, or any other “programmer.”

A classic approach to process improvement is to dumb down humans to make them behave like machines. This is done because process improvement people generally don’t have the training or inclination to observe, describe, or evaluate what people actually do. Human behavior is frightening to such process specialists, whereas machines are predictable and lawful. Someone more comfortable with machines sees manual tests as just badly written algorithms performed ineptly by suger-carbon blobs wearing contractor badges who drift about like slightly-more-motivated-than-average jellyfish.

Rather than banishing human qualities, another approach to process improvement is to harness them. I train testers to take control of their mental models and devise powerful questions to probe the technology in front of them. This is a process of self-programming. In this way of working, test automation is seen as an extension of the human mind, not a substitute.

A quick image of this paradigm might be the Mars Rover program. Note that the Mars Rovers are completely automated, in the sense that no human is on Mars. Yet they are completely directed by humans. Another example would be a deep sea research submarine. Without the submarine, we couldn’t explore the deep ocean. But without humans, the submarines wouldn’t be exploring at all.

I love test automation, but I rarely approach it by looking at manual tests and asking myself “how can I make the computer do that?” Instead, I ask myself how I can use tools to augment and improve the human testing activity. I also consider what things the computers can do without humans around, but again, that is not automating good manual tests, it is creating something new.

I have seen bad manual tests be automated. This is depressingly common, in my experience. Just let me suggest some corollaries to Rule #1:

Rule #1B: If you can truly automate a manual test, it couldn’t have been a good manual test.

Rule #1C: If you have a great automated test, it’s not the same as the manual test that you believe you were automating.

My fellow sugar blobs, reclaim your heritage and rejoice in your nature. You can conceive of questions; ask them. You are wonderfully distractable creatures; let yourselves be distracted by unexpected bugs. Your fingers are fumbly; press the wrong keys once in while. Your minds have the capacity to notice hundreds of patterns at once; turn the many eyes of your minds toward the computer screen and evaluate what you see.

Quick Oracle: Blink Testing


  1. In testing, an “oracle” is a way to recognize a problem that appears during testing. This contrasts with “coverage”, which has to do with getting a problem to appear. All tests cover a product in some way. All tests must include an oracle of some kind or else you would call it just a tour rather than a test. (You might also call it a test idea, but not a complete test.)
  2. A book called Blink: The Power of Thinking Without Thinking has recently been published on the subject of snap decisions. I took one look at it, flipped quickly through it, and got the point. Since the book is about making decisions based on little information, I can’t believe the author, Malcolm Gladwell, seriously expected me to sit down and read every word.

“Blink testing” represents an oracle heuristic I find quite helpful, quite often. (I used to call it “grokking”, but Michael Bolton convinced me that blink is better. The instant he suggested the name change, I felt he was right.)

What you do in blink testing is plunge yourself into an ocean of data– far too much data to comprehend. And then you comprehend it. Don’t know how to do that? Yes you do. But you may not realize that you know how.

You can do it. I can prove this to you in less than one minute. You will get “blink” in a wink.

Imagine an application that adds two numbers together. Imagine that it has two fields, one for each number, and it has a button that selects random numbers to be added. The numbers chosen are in the range -99 to 99.

Watch this application in action by looking at this movie (which is an interactive EXE packaged in a ZIP file) and ask yourself if you see any bugs. Once you think you have it, click here for my answer.

  • How many test cases do you think that was?
  • Did it seem like a lot of data to process?
  • How did you detect the problem(s)?
  • Isn’t it great to have a brain that notices patterns automatically?

There are many examples of blink testing, including:

  • Page through a long file super rapidly (holding your thumb on the Page Down button, notice the pattern of blurry text on the screen, and look for strange variations in that pattern.
  • Take a 60,000 line log file, paste it into Excel, and set the zoom level to 8%. Scroll down and notice the pattern of line lengths. You can also use conditional formatting in Excel to turn lines red if they meet certain criteria, then notice the pattern of red flecks in the gray lines of text, as you scroll.
  • Flip back and forth rapidly between two similar bitmaps. What catches your eye? Astronomers once did this routinely to detect comets.
  • Take a five hundred page printout (it could be technical documentation, database records, or anything) and flip quickly through it. Ask yourself what draws your attention most about it. Ask yourself to identify three interesting patterns in it.
  • Convert a huge mass of data to sound in some way. Listen for unusual patterns amidst the noise.

All of these involve pattern recognition on a grand scale. Our brains love to do this; our brains are designed to do this. Yes, you will miss some things; no, you shouldn’t care that you are missing some things. This is just one technique, and you use other techniques to find those other problems. We already have test techniques that focus on trees, it also helps to look at the forest.

Test Messy with Microbehaviors

James Lyndsay sent me a little Flash app once that was written to be a testing brainteaser. He challenged me to test it and I had great fun. I found a few bugs, and have since used it in my testing class. “More, more!” I told him. So, he recently sent me a new version of that app. But get this: he fixed the bugs in it.

In a testing class, a product that has known bugs in it make a much better working example than a product that is has only unknown bugs. The imperfections are part of its value, so that testing students have something to find, and the instructor has something to talk about if they fail to find them.

So, Lyndsay’s new version is not, for me, an improvement.

This has a lot to do with a syndrome in test automation: automation is too clean. Now, unit tests can be very clean, and there’s no sin in that. Simple tests that do a few things exactly the same way every time can have value. They can serve the purposes of change detection during refactoring. No, I’m talking about system-level, industrial strength please-find-bugs-fast test automation.

It’s too clean.

It’s been oversimplified, filed down, normalized. In short, the microbehaviors have been removed.

The testing done by a human user interacting in real time is messy. I use a web site, and I press the “back” button occasionally. I mis-type things. I click on the wrong link and try to find my way back. I open additional windows, then minimize them and forget them. I stop in the middle of something and go to lunch, letting my session expire. I do some of this on purpose, but a lot of it is by accident. My very infirmity is a test tool.

I call the consequences of my human infirmity “microbehaviors”, those little ticks and skips and idiosyncrasies that will be different in the behavior of any two people using a product even if they are trying to do the same exact things.

Test automation can have microbehavior, too, I suppose. It would come from subtle differences in timing and memory use due to other processes running on the computer, interactions with peripherals, or network latency. But nothing like the gross variations inherent in human interaction, such as:

  • Variations in the order of apparently order independent actions, such as selecting several check boxes before clicking OK on a dialog box. (But maybe there is some kind of order dependence or timing relationship that isn’t apparent to the user)
  • The exact path of the mouse, which triggers mouse over events.
  • The exact timing and sequence of keyboard input, which occurs in patterns that change relative to the typing skill and physical state of the user.
  • Entering then erasing data.
  • Doing something, then undoing it.
  • Navigating the UI without “doing” anything other than viewing windows and objects. Most users assume this does not at all affect the state of an application.
  • Clicking on the wrong link or button, then backing out.
  • Leaving an application sitting in any state for hours on end. (My son leaves his video games sitting for days, I hope they are tested that way.)
  • Experiencing error messages, dismissing them (or not dismissing them) and trying the same thing again (or something different).
  • Navigating with the keyboard instead of the mouse, or vice versa.
  • Losing track of the application, assuming it is closed, then opening another instance of it.
  • Selecting the help links or the customer service links before returning to complete an activity.
  • Changing browser or O/S configuration settings in the middle of an operation.
  • Dropping things on the keyboard by accident.
  • Inadvertantly going into hibernation mode while using the product, because the batteries ran out on the laptop.
  • Losing network contact at the coffee shop. Regaining it. Losing it again…
  • Accidentally double-clicking instead of single-clicking.
  • Pressing enter too many times.
  • Running other applications at the same time, such as anti-virus scanners, that may pop up over the application under test and take focus.

What make a microbehavior truly micro is that it’s not supposed to make a difference, or that the difference it makes is easily recoverable. That’s why they are so often left out of automated tests. They are optimized away as irrelevant. And yet part of the point of testing is to challenge ideas about what might be relevant.

In a study done at Florida Tech, Pat McGee discovered that automated regression tests for one very complex product found more problems when the order of the tests was varied. Everything else was kept exactly the same. And, anecdotally, every tester with a little experience can probably cite a case where some inadvertent motion or apparently irrelevant variation uncovered a bug.

Even a test suite with hundreds of simple procedural scripts in it cannot hope to flush out all and probably not most of the bugs that matter, in any complex product. Well, you could hope, but your hope would be naive.

So, that’s why I strive to put microbehaviors into my automation. Among the simplest measures is to vary timing and ordering of actions. I also inject idempotent actions (meaning that they end in the same apparent state they started with) on a random basis. These measures are usually very cheap to implement, and I believe they greatly improve my chances of finding certain state-related or timing-related bugs, as well as bugs in exception handling code.

What about those Flash applications that Mr. Lyndsay sent me? He might legitimately assert that his purpose was not to write a buggy Flash app for testers, but a nice clean brainteaser. That’s fine, but the “mistakes” he made in execution turned into bonus brainteasers for me, so I got the original, plus more. And that’s the same with testing.

I want to test on purpose AND by accident, at the same time.

Counterstrings: Self-Describing Test Data

I was at a conference some months ago when Danny Faught showed me a Perl package for manipulating the Windows clipboard. I turned it into a little tool for helping me test text fields.

It’s called PerlClip. Feel free to download it. You don’t need Perl to run it.

One of the things PerlClip does is allow you to produce what I call “counterstrings”. A counterstring is a graduated string of arbitrary length. No matter where you are in the string, you always know the character position. This comes in handy when you are pasting huge strings into fields and they get truncated at a certain point. You want to know how many characters that is.

Here is a 35 character counterstring:

Each asterisk in the string occurs at a position specified by the immediately preceding number. Thus, the asterisk following the 29 is the 29th character in that string. So, you can chop the end of the string anywhere, and you know exactly where it was cut. Without having to count, you know that the string “2*4*6*8*11*14*17*2” has exactly 18 characters in it. This saves some effort when you’re dealing with a half million characters. I pasted a 4000 character counterstring into the address field of Explorer and it was truncated at “2045*20”, meaning that 2047 characters were pasted.

I realize this is may not be a very interesting sort of testing, except perhaps for security purposes or when you’re first getting to know the app. But security is an increasingly important issue in our field, and sometimes when no one tells you the limits and dynamics of text fields, this can come in handy.

Testability Through Audibility

I was working with a client today who complained that there were hidden errors buried in a log file produced by the product he was testing. So, I wrote him a tool that continuously monitors any text file, such as a server log (as long as it is accessible through the file system, as in the case of a test server running locally) and plays WAV files whenever certain string patterns appear in the stream.

With this little tool, a streaming verbose log can be rendered as a stream of clicks and whirrs, if you want, or you can just have it yell “ERROR!” when an error pops up in the log. All this in real time without taking your eyes off the application. Using this, I found a bug in a browser based app whereby perfectly ordinary looking HTML displayed on the screen coincided with a Java null pointer exception in the log.

I released this bit of code with the GPL 2.0 license and you can find it here:

By the way, this is an example of what I call agile test tooling. I paired with a tester. I heard a complaint. I offered a tool idea. The tester said “yes, please.” I delivered the tool the next day. As we were playing with it, I added a couple of features. I don’t believe you have to be a programmer to be a great tester, but it helps to have a programmer or two on the testing staff. It’s nice work for programmers like me, who get bored with long term production coding.