Rethinking Equivalence Class Partitioning, Part 1

Wikipedia’s article on equivalence class partitioning (ECP) is a great example of the poor thinking and teaching and writing that often passes for wisdom in the testing field. It’s narrow and misleading, serving to imply that testing is some little game we play with our software, rather than an open investigation of a complex phenomenon.

(No, I’m not going to edit that article. I don’t find it fun or rewarding to offer my expertise in return for arguments with anonymous amateurs. Wikipedia is important because it serves as a nearly universal reference point when criticizing popular knowledge, but just like popular knowledge itself, it is not fixable. The populus will always prevail, and the populus is not very thoughtful.)

In this article I will comment on the Wikipedia post. In a subsequent post I will describe ECP my way, and you can decide for yourself if that is better than Wikipedia.

“Equivalence partitioning or equivalence class partitioning (ECP)[1] is a software testing technique that divides the input data of a software unit into partitions of equivalent data from which test cases can be derived.”

Not exactly. There’s no reason why ECP should be limited to “input data” as such. The ECP thought process may be applied to output, or even versions of products, test environments, or test cases themselves. ECP applies to anything you might be considering to do that involves any variations that may influence the outcome of a test.

Yes, ECP is a technique, but a better word for it is “heuristic.” A heuristic is a fallible method of solving a problem. ECP is extremely fallible, and yet useful.

“In principle, test cases are designed to cover each partition at least once. This technique tries to define test cases that uncover classes of errors, thereby reducing the total number of test cases that must be developed.”

This text is pretty good. Note the phrase “In principle” and the use of the word “tries.” These are softening words, which are important because ECP is a heuristic, not an algorithm.

Speaking in terms of “test cases that must be developed,” however, is a misleading way to discuss testing. Testing is not about creating test cases. It is for damn sure not about the number of test cases you create. Testing is about performing experiments. And the totality of experimentation goes far beyond such questions as “what test case should I develop next?” The text should instead say “reducing test effort.”

“An advantage of this approach is reduction in the time required for testing a software due to lesser number of test cases.”

Sorry, no. The advantage of ECP is not in reducing the number of test cases. Nor is it even about reducing test effort, as such (even though it is true that ECP is “trying” to reduce test effort). ECP is just a way to systematically guess where the bigger bugs probably are, which helps you focus your efforts. ECP is a prioritization technique. It also helps you explain and defend those choices. Better prioritization does not, by itself, allow you to test with less effort, but we do want to stumble into the big bugs sooner rather than later. And we want to stumble into them with more purpose and less stumbling. And if we do that well, we will feel comfortable spending less effort on the testing. Reducing effort is really a side effect of ECP.

“Equivalence partitioning is typically applied to the inputs of a tested component, but may be applied to the outputs in rare cases. The equivalence partitions are usually derived from the requirements specification for input attributes that influence the processing of the test object.”

Typically? Usually? Has this writer done any sort of research that would substantiate that? No.

ECP is a process that we all do informally, not only in testing but in our daily lives. When you push open a door, do you consciously decide to push on a specific square centimeter of the metal push plate? No, you don’t. You know that for most doors it doesn’t matter where you push. All pushable places are more or less equivalent. That is ECP! We apply ECP to anything that we interact with.

Yes, we apply it to output. And yes, we can think of equivalence classes based on specifications, but we also think of them based on all other learning we do about the software. We perform ECP based on all that we know. If what we know is wrong (for instance if there are unexpected bugs) then our equivalence classes will also be wrong. But that’s okay, if you understand that ECP is a heuristic and not a golden ticket to perfect testing.

“The fundamental concept of ECP comes from equivalence class which in turn comes from equivalence relation. A software system is in effect a computable function implemented as an algorithm in some implementation programming language. Given an input test vector some instructions of that algorithm get covered, ( see code coverage for details ) others do not…”

At this point the article becomes Computer Science propaganda. This is why we can’t have nice things in testing: as soon as the CS people get hold of it, they turn it into a little logic game for gifted kids, rather than a pursuit worthy of adults charged with discovering important problems in technology before it’s too late.

The fundamental concept of ECP has nothing to do with computer science or computability. It has to do with logic. Logic predates computers. An equivalence class is simply a set. It is a set of things that share some property. The property of interest in ECP is utility for exploring a particular product risk. In other words, an equivalence class in testing is an assertion that any member of that particular group of things would be more or less equally able to reveal a particular kind of bug if it were employed in a particular kind of test.

If I define a “test condition” as something about a product or its environment that could be examined in a test, then I can define equivalence classes like this: An equivalence class is a set of tests or test conditions that are equivalent with respect to a particular product risk, in a particular context. 

This implies that two inputs which are not equivalent for the purposes of one kind of bug may be equivalent for finding another kind of bug. It also implies that if we model a product incorrectly, we will also be unable to know the true equivalence classes. Actually, considering that bugs come in all shapes and sizes, to have the perfectly correct set of equivalence classes would be the same as knowing, without having tested, where all the bugs in the product are. This is because ECP is based on guessing what kind of bugs are in the product.

If you read the technical stuff about Computer Science in the Wikipedia article, you will see that the author has decided that two inputs which cover the same code are therefore equivalent for bug finding purposes. But this is not remotely true! This is a fantasy propagated by people who I suspect have never tested anything that mattered. Off the top of my head, code-coverage-as-gold-standard ignores performance bugs, requirements bugs, usability bugs, data type bugs, security bugs, and integration bugs. Imagine two tests that cover the same code, and both involve input that is displayed on the screen, except that one includes an input which is so long that when it prints it goes off the edge of the screen. This is a bug that the short input didn’t find, even though both inputs are “valid” and “do the same thing” functionally.

The Fundamental Problem With Most Testing Advice Is…

The problem with most testing advice is that it is either uncritical folklore that falls apart as soon as you examine it, or else it is misplaced formalism that doesn’t apply to realistic open-ended problems. Testing advice is better when it is grounded in a general systems perspective as well as a social science perspective. Both of these perspectives understand and use heuristics. ECP is a powerful, ubiquitous, and rather simple heuristic, whose utility comes from and is limited by your mental model of the product. In my next post, I will walk through an example of how I use it in real life.

We. Use. Tools.

Context-Driven testers use tools to help ourselves test better. But, there is no such thing as test automation.

Want details? Here’s the 10,000 word explanation that Michael Bolton and I have been working on for months.

Editor’s Note: I have just posted version 1.03 of this article. This is the third revision we have made due to typos. Isn’t it interesting how hard it is to find typos in your own work before you ship an article? We used automation to help us with spelling, of course, but most of the typos are down to properly spelled words that are in the wrong context. Spelling tools can’t help us with that. Also, Word spell-checker still thinks there are dozens of misspelled words in our article, because of all the proper nouns, terms of art, and neologisms. Of course there are the grammar checking tools, too, right? Yeah… not really. The false positive rate is very high with those tools. I just did a sweep through every grammar problem the tool reported. Out of the five it thinks it found, only one, a missing hyphen, is plausibly a problem. The rest are essentially matters of writing style.

One of the lines it complained about is this: “The more people who use a tool, the more free support will be available…” The grammar checker thinks we should not say “more free” but rather “freer.” This may be correct, in general, but we are using parallelism, a rhetorical style that we feel outweighs the general rule about comparatives. Only humans can make these judgments, because the rules of grammar are sometimes fluid.

Reinventing Testing: What is Integration Testing? (part 2)

These thoughts have become better because of these specific commenters on part 1: Jeff Nyman, James Huggett, Sean McErlean, Liza Ivinskaia, Jokin Aspiazu, Maxim Mikhailov, Anita Gujarathi, Mike Talks, Amit Wertheimer, Simon Morley, Dimitar Dimitrov, John Stevenson. Additionally, thank you Michael Bolton and thanks to the student whose productive confusion helped me discover a blindspot in my work, Anita Gujarathi.

Integration testing is a term I don’t use much– not because it doesn’t matter, but because it is so fundamental that it is already baked into many of the other working concepts and techniques of testing. Still, in the past week, I decided to upgrade my ability to quickly explain integration, integration risk, and integration testing. This is part of a process I recommend for all serious testers. I call it: reinventing testing. Each of us may reinvent testing concepts for ourselves, and engage in vigorous debates about them (see the comments on part 1, which is now the most commented of any post I have ever done).

For those of you interested in getting to a common language for testing, this is what I believe is the best way we have available to us. As each of us works to clarify his own thinking, a de facto consensus about reasonable testing ontology will form over time, community by community.

So here we go…

There several kinds of testing that involve or overlap with or may even be synonymous with integration testing, including: regression testing, system testing, field testing, interoperability testing, compatibility testing, platform testing, and risk-based testing. Most testing, in fact, no matter what it’s called, is also integration testing.

Here is my definition of integration testing, based on my own analysis, conversations with RST instructors (mainly Michael Bolton), and stimulated by the many commenters from part 1. All of my assertions and definitions are true within the Rapid Software Testing methodology namespace, which means that you don’t have to agree with me unless you claim to be using RST.

What is integration testing?

Integration testing is:
1. Testing motivated by potential risk related to integration.
2. Tests designed specifically to assess risk related to integration.


1. “Motivated by” and “designed specifically to” overlap but are not the same. For instance, if you know that a dangerous criminal is on the loose in your neighborhood you may behave in a generally cautious or vigilant way even if you don’t know where the criminal is or what he looks like. But if you know what he looks like, what he is wearing, how he behaves or where he is, you can take more specific measures to find him or avoid him. Similarly, a newly integrated product may create a situation where any kind of testing may be worth doing, even if that testing is not specifically aimed at uncovering integration bugs, as such; OR you can perform tests aimed at exposing just the sort of bugs that integration typically causes, such as by performing operations that maximize the interaction of components.

The phrase “integration testing” may therefore represent ANY testing performed specifically in an “integration context”, or applying a specific “integration test technique” in ANY context.

This is a special case of the difference between risk-based test management and risk-based test design. The former assigns resources to places where there is potential risk but does not dictate the testing to be performed; whereas the latter crafts specific tests to examine the product for specific kinds of problems.

2. “Potential risk” is not the same as “risk.” Risk is the danger of something bad happening, and it can be viewed from at least three perspectives: probability of a bad event occurring, the impact of that event if it occurs, and our uncertainty about either of those things. A potential risk is a risk about which there is substantial uncertainty (in other words, you don’t know how likely the bug is to be in the product or you don’t know how bad it could be if it were present). The main point of testing is to eliminate uncertainty about risk, so this often begins with guessing about potential risk (in other words, making wild guesses, educated guesses, or highly informed analyses about where bugs are likely to be).

Example: I am testing something for the first time. I don’t know how it will deal with stressful input, but stress often causes failure, so that’s a potential risk. If I were to perform stress testing, I would learn a lot about how the product really handles stress, and the potential risk would be transformed into a high risk (if I found serious bugs related to stress) or a low risk (if the product handled stress in a consistently graceful way).

What is integration?

General definition from the Oxford English Dictionary: “The making up or composition of a whole by adding together or combining the separate parts or elements; combination into an integral whole: a making whole or entire.”

Based on this, we can make a simple technical definition related to products:

Integration is:
v. the process of constructing a product from parts.
n. a product constructed from parts.

Now, based on General Systems Theory, we make these assertions:

An integration, in some way and to some degree:

  1. Is composed of parts:
  • …that come from differing sources.
  • …that were produced for differing purposes.
  • …that were produced at different times.
  • …that have differing attributes.
  1. Creates or represents an internal environment for its parts:
  • …in which its parts interact among themselves.
  • …in which its parts depend on each other.
  • …in which its parts interact with or depend on an external environment.
  • …in which these things are not visible from the outside.
  1. Possesses attributes relative to its parts:
  • …that depend on them.
  • …that differ from them.

Therefore, you might not be able to discern everything you want to know about an integration just by looking at its parts.

This is why integration risk exists. In complex or important systems, integration testing will be critically important, especially after changes have been made.

It may be possible to gain enough knowledge about an integration to characterize the risk (or to speak more plainly: it may be possible to find all the important integration bugs) without doing integration testing. You might be able to do it with unit testing. However, that process, although possible in some cases, might be impractical. This is the case partly because the parts may have been produced by different people with different assumptions, because it is difficult to simulate the environment of an integration prior to actual integration, or because unit testing tends to focus on what the units CAN do and not on what they ACTUALLY NEED to do. (If you unit test a calculator, that’s a lot of work. But if that calculator will only ever be asked to add numbers under 50, you don’t need to do all that work.)

Integration testing, although in some senses being complex, may actually simplify your testing since some parts mask the behavior of other parts and maybe all you need to care about is the final outputs.


1. “In some way and to some degree” means that these assertions are to be interpreted heuristically. In any specific situation, these assertions are highly likely to apply in some interesting or important way, but might not. An obvious example is where I wrote above that the “parts interact with each other.” The stricter truth is that the parts within an integration probably do not EACH directly interact with ALL the other ones, and probably do not interact to the same degree and in the same ways. To think of it heuristically, interpret it as a gentle warning such as  “if you integrate something, make it your business to know how the parts might interact or depend on each other, because that knowledge is probably important.”

By using the phrase “in some way and to some degree” as a blanket qualifier, I can simplify the rest of the text, since I don’t have to embed other qualifiers.

2. “Constructing from parts” does not necessarily mean that the parts pre-existed the product, or have a separate existence outside the product, or are unchanged by the process of integration. It just means that we can think productively about pieces of the product and how they interact with other pieces.

3. A product may possess attributes that none of its parts possess, or that differ from them in unanticipated or unknown ways. A simple example is the stability of a tripod, which is not found in any of its individual legs, but in all the legs working together.

4. Disintegration also creates integration risk. When you takes things away, or take things apart, you end up with a new integration, and that is subject to the much the same risk as putting them together.

5. The attributes of a product and all its behaviors obviously depend largely on the parts that comprise it, but also on other factors such as the state of those parts, the configurations and states of external and internal environments, and the underlying rules by which those things operate (ultimately, physics, but more immediately, the communication and processing protocols of the computing environment).

6. Environment refers to the outside of some object (an object being a product or a part of a product), comprising factors that may interact with that object. A particular environment might be internal in some respects or external in other respects, at the same time.

  • An internal environment is an environment controlled by the product and accessible only to its parts. It is inside the product, but from the point vantage point of some of parts, it’s outside of them. For instance, to a spark plug the inside of an engine cylinder is an environment, but since it is not outside the car as a whole, it’s an internal environment. Technology often consists of deeply nested environments.
  • An external environment is an environment inhabited but not controlled by the product.
  • Control is not an all-or-nothing thing. There are different levels and types of control. For this reason it is not always possible to strictly identify the exact scope of a product or its various and possibly overlapping environments. This fact is much of what makes testing– and especially security testing– such a challenging problem. A lot of malicious hacking is based on the discovery that something that the developers thought was outside the product is sometimes inside it.

7. An interaction occurs when one thing influences another thing. (A “thing” can be a part, an environment, a whole product, or anything else.)

8. A dependency occurs when one thing requires another thing to perform an action or possess an attribute (or not to) in order for the first thing to behave in a certain way or fulfill a certain requirement. See connascence and coupling.

9. Integration is not all or nothing– there are differing degrees and kinds. A product may be accidentally integrated, in that it works using parts that no one realizes that it has. It may be loosely integrated, such as a gecko that can jettison its tail, or a browser with a plugin. It may be tightly integrated, such as when we take the code from one product and add it to another product in different places, editing as we go. (Or when you digest food.) It may preserve the existing interfaces of its parts or violate them or re-design them or eliminate them. The integration definition and assertions, above, form a heuristic pattern– a sort of lens– by which we can make better sense of the product and how it might fail. Different people may identify different things as parts, environments or products. That’s okay. We are free to move the lens around and try out different perspectives, too.

Example of an Integration Problem


This diagram shows a classic integration bug: dueling dependencies. In the top two panels, two components are happy to work within their own environments. Neither is aware of the other while they work on, let’s say, separate computers.

But when they are installed together on the same machine, it may turn out that each depends on factors that exclude the other. Even though the components themselves don’t clash (the blue A box and the blue B boxes don’t overlap). Often such dependencies are poorly documented, and may be entirely unknown to the developer before integration time.

It is possible to discover this through unit testing… but so much easier and probably cheaper just to try to integrate sooner rather than later and test in that context.


Re-Inventing Testing: What is Integration Testing? (Part 1)

(Thank you, Anne-Marie Charrett, for reviewing my work and helping with this post.)

One of the reasons I obsessively coach other testers is that they help me test my own expertise. Here is a particularly nice case of that, while working with a particularly bright and resilient student, Anita Gujrathi, (whose full name I am using here with her permission).

The topic was integration testing. I chose it from a list of skills Anita made for herself. It stood out because integration testing is one of those labels that everyone uses, yet few can define. Part of what I do with testers is help them become aware of things that they might think they know, yet may have only a vague intuition about. Once we identify those things, we can study and deepen that knowledge together.

Here is the start of our conversation (with minor edits for grammar and punctuation, and commentary in brackets):

What do you mean by integration testing?
[As I ask her this question I am simultaneously asking myself the same question. This is part of a process known as transpection. Also, I am not looking for “one right answer” but rather am exploring and exercising her thought processes, which is called the Socratic Method.]

Integration test is the test conducted when we are integrating two or more systems.
[This is not a wrong answer, but it is shallow, so I will press for more details.

By shallow, I mean that leaves out a lot of detail and nuances. A shallow answer may be fine in a lot of situations, but in coaching it is a black box that I must open.]

What do you mean by integrated?

That means kind of joining two systems such that they give and take data.
[This is a good answer but again it is shallow. She said “kind of” which I take as a signal that she may be not quite sure what words to use. I am wondering if she understands the technical aspects of how components are joined together during integration. For instance, when two systems share an operating space, they may have conflicting dependencies which may be discovered only in certain situations. I want to push for a more detailed answer in order to see what she knows about that sort of thing.]

What does it mean to join two systems?
[This process is called “driving to detail” or “drilling down”. I just keep asking for more depth in the answer by picking key ideas and asking what they mean. Sometimes I do this by asking for an example.]

For example, there is an application called WorldMate which processes the itineraries of the travellers and generates an XML file, and there is another application which creates the trip in its own format to track the travellers using that XML.
[Students will frequently give me an example when they don’t know how to explain a concept. They are usually hoping I will “get it” and thus release them from having to explain anything more. Examples are helpful, of course, but I’m not going to let her off the hook. I want to know how well she understands the concept of joining systems.

The interesting thing about this example is that it illustrates a weak form of integration– so weak that if she doesn’t understand the concept of integration well enough, I might be able to convince her that no integration is illustrated here.

What makes her example a case of weak integration is that the only point of contact between the two programs is a file that uses a standardized format. No other dependencies or mode of interaction is mentioned. This is exactly what designers do when they want to minimize interaction between components and eliminate risks due to integration.]

I still don’t know what it means to join two systems.
[This is because an example is not an explanation, and can never be an explanation. If someone asks what a flower is and you hold up a rose, they still know nothing about what a flower is, because you could hold up a rose in response to a hundred other such questions: what is a plant? what is a living thing? what is botany? what is a cell? what is red? what is carbon? what is a proton? what is your favorite thing? what is advertising? what is danger? Each time the rose is an answer to some specific aspect of the question, but not all aspects, but how do you know what the example of a rose actually refers to? Without an explanation, you are just guessing.]

I am coming to that. So, here we are joining WorldMate (which is third-party application) to my product so that when a traveller books a ticket from a service and receives the itinerary confirmation email, it then goes to WorldMate which generates XML to give it to my product. Thus, we have joined or created the communication between WorldMate and my application.
[It’s nice that Anita asserts herself, here. She sounds confident.

What she refers to is indeed communication, although not a very interesting form of communication in the context of integration risk. It’s not the sort of communication that necessarily requires integration testing, because the whole point of using XML structures is to cleanly separate two systems so that you don’t have to do anything special or difficult to integrate them.]

I still don’t see the answer to my question. I could just as easily say the two systems are not joined. But rather independent. What does join really mean?
[I am pretending not to see the answer in order to pressure her for more clarity. I won’t use this tactic as a coach unless I feel that the student is reasonably confident.]

Okay, basically when I say join I mean that we are creating the communication between the two systems
[This is the beginning of a good answer, but her example shows only a weak sort of communication.]

I don’t see any communication here. One creates an XML, the other reads it. Neither knows about the other.
[It was wrong of me to say I don’t see any communication. I should have said it was simplistic communication. What I was trying to do is provoke her to argue with me, but I regret saying it so strongly.]

It is a one-way communication.
[I agree it is one-way. That’s part of why I say it is a weak form of integration.]

Is Google integrated with Bing?
[One major tactic of the Socratic method is to find examples that seem to fit the student’s idea and yet refute what they were trying to prove. I am trying to test what Anita thinks is the difference between two things that are integrated and two things that are simply “nearby.”]

Ah no?

According to you, they are! Because I can Google something, then I can take the output and feed it to Bing, and Bing will do a search on that. I can Google for a business name and then paste the name into Bing and learn about the business. The example you gave is just an example of two independent programs that happen to deal with the same file.


So, if I test the two independent programs, haven’t I done all the testing that needs to be done? How is integration testing anything more or different or special?

At this point, Anita seems confused. This would be a good time to switch into lecture mode and help her get clarity. Or I could send her away to research the matter. But what I realized in that moment is that I was not satisfied with my own ideas about integration. When I asked myself “what would I say if I were her?” my answers sounded not much deeper than hers. I decided I needed to do some offline thinking about integration testing.

Lots of things in out world are slightly integrated. Some things are very integrated. This seems intuitively obvious, but what exactly is that difference? I’ve thought it through and I have answers now. Before I blog about it, what do you think?

TestInsane’s Mindmaps Are Crazy Cool

Most testing companies offer nothing to the community or the field of testing. They all seem to say they hire only the best experts, but only a very few of them are willing to back that up with evidence. Testing companies, by and large, are all the same, and the sameness is one of mediocrity and mendacity.

But there are a few exceptions. One of them is TestInsane, founded by ex-Moolyan co-founder Santosh Tuppad. This is a company to watch.

The wonderful thing about TestInsane is their mindmaps. More than 100 of them. What lovelies! Check them out. They are a fantastic public contribution! Each mindmap tackles some testing-related subject and lists many useful ideas that will help you test in that area.

I am working on a guide to bug reporting, and I found three maps on their site that are helping me cover all the issues that matter. Thank you TestInsane!

I challenge other testing companies to contribute to the craft, as well.

Note: Santosh offered me money to help promote his company. That is a reasonable request, but I don’t do those kinds of deals. If I did that even once I would lose vital credibility. I tell everyone the same thing: I am happy to work for you if you pay me, but I cannot promote you unless I believe in you, and if I believe in you I will promote you for free. As of this writing, I have not done any work for TestInsane, paid or otherwise, but it could happen in the future.

I have done paid work for Moolya, and Per Scholas, both of which I gush about on a regular basis. I believe in those guys. Neither of them pay me to say good things about them, but remember, anyone who works for a company will never say bad things. There are some other testing companies I have worked for that I don’t feel comfortable endorsing, but neither will I complain about them in public (usually… mostly).

Justifying Real Acceptance Testing

This post is not about the sort of testing people talk about when nearing a release and deciding whether it’s done. I have another word for that. I call it “testing,” or sometimes final testing or release testing. Many projects perform that testing in such a perfunctory way that it is better described as checking, according to the distinction between testing and checking I have previously written of on this blog. As Michael Bolton points out, that checking may better be described as rejection checking since a “fail” supposedly establishes a basis for saying the product is not done, whereas no amount of “passes” can show that it is done.

Acceptance testing can be defined in various ways. This post is about what I consider real acceptance testing, which I define as testing by a potential acceptor (a customer), performed for the purpose of informing a decision to accept (to purchase or rely upon) a product.

Do we need acceptance testing?

Whenever a business decides to purchase and rely upon a component or service, there is a danger that the product will fail and the business will suffer. One approach to dealing with that problem is to adopt the herd solution: follow the thickest part of the swarm; choose a popular product that is advertised or reputed to do what you want it to do and you will probably be okay. I have done that with smartphones, ruggedized laptops, file-sharing services, etc. with good results, though sometimes I am disappointed.

My business is small. I am nimble compared to almost every other company in the world. My acceptance testing usually takes the form of getting a trial subscription to service, or downloading the “basic” version of a product. Then I do some work with it and see how I feel. In this way I learned to love Dropbox, despite its troubling security situation (I can’t lock up my Dropbox files), or the fact that there is a significant chance it will corrupt very large files. (I no longer trust it with anything over half of a gig).

But what if I were advising a large company about whether to adopt a service or product that it will rely upon across dozens or hundreds or thousands of employees? What if the product has been customized or custom built specifically for them? That’s when acceptance testing becomes important.

Doesn’t the Service Level Agreement guarantee that the product will work?

There are a couple of problems with relying on vendor promises. First, the vendor probably isn’t promising total satisfaction. The service “levels” in the contract are probably narrowly and specifically drawn. That means if you don’t think of everything that matters and put that in the contract, it’s not covered. Testing is a process that helps reveal the dimensions of the service that matter.

Second, there’s an issue with timing. By the time you discover a problem with the vendor’s product, you may be already relying on it. You may already have deployed it widely. It may be too late to back out or switch to a different solution. Perhaps your company negotiated remedies in that case, but there are practical limitations to any remedy. If your vendor is very small, they may not be able to afford to fix their product quickly. If you vendor is very large, they may be able to afford to drag their feet on the fixes.

Acceptance testing protects you and makes the vendor take quality more seriously.

Acceptance testing should never be handled by the vendor. I was once hired by a vendor to do penetration testing on their product in order to appease a customer. But the vendor had no incentive to help me succeed in my assignment, nor to faithfully report the vulnerabilities I discovered. It would have been far better if the customer had hired me.

Only the accepting party has the incentive to test well. Acceptance testing should not be pre-agreed or pre-planned in any detail– otherwise the vendor will be sure that the product passes those specific tests. It should be unpredictable, so that the vendor has an incentive to make the product truly meet its requirements in a general sense. It should be adaptive (exploratory) so that any weakness you find can be examined and exploited.

The vendor wants your money. If your company is large enough, and the vendor is hungry, they will move mountains to make the product work well if they know you are paying attention. Acceptance testing, done creatively by skilled testers on a mission, keeps the vendor on its toes.

By explicitly testing in advance of your decision to accept the product, you have a fighting chance to avoid the disaster of discovering too late that the product is a lemon.

My management doesn’t think acceptance testing matters. What do I do?

1. Make the argument, above.
2. Communicate with management, formally, about this. (In writing, so that there is a record.)
It is up to management to make decisions about business risk. They may feel the risk is not worth worrying about. In that case, you must wait and watch. People are usually more persuaded by vivid experiences, rather than abstract principles, so:
1. Collect specific examples of the problems you are talking about. What bugs have you experienced in vendor products?
2. Collect news reports about bugs in products of your vendors (or other vendors) that have been disruptive.
3. In the event you get to do even a little acceptance testing, make a record of the problems you find and be ready to remind management of that history.


Programmer Pairing with a Tester

My sister, Erica, is not a programmer. Normally she’s not a tester, either. But recently she paired with me, playing a tester role, and spotted bugs while I wrote in Perl. In the process, it became clear to me that testers do not need to become programmers in order to help programmers write programs in real-time.

The Context

While working on the report for the Rapid Testing Intensive, recently, I needed a usable archive of the materials. That meant taking all of the pages, comments, and attachments out of my Confluence site and putting them in a form easier to shuffle, subdivide, organize, refer to, and re-distribute. It would be great if that were a feature of Confluence, but the closest I can get to that is either manually downloading each item or downloading an entire archive and dealing with a big abstract blob of XML and cryptically named files with no extensions.

(Note to Atlassian: Please enhance Confluence to include a archivist-friendly (as opposed to system administrator-friendly) archive function that separates pages, attachments, and comments into discrete viewable units with reasonable names.)

The Deflection

While Erica catalogued the names of all the attachments representing student work and the person or persons who created them, I was supposed to write a program to extract the corresponding material from the archive. Instead, I procrastinated. I think I checked email, but I admit it’s possible I was playing Ghost Recon or watching episode 13 of Arang and the Magistrate on Hulu. So, when she was ready with the spreadsheet, I hadn’t even started my program.

To cover my laziness, I thought I’d invite her to sit with me while I wrote it… you know, as if I had been waiting for her on purpose to show her the power of code or whatever. I expected her to decline, since like many computer power users, she has no interest in programming, and no knowledge of it.

The Surprising Outcome

She did not decline. She sat with me and we wrote the program together. She found six or seven important bugs while I typed, and many other little ones. The programming was more fun and social for me. I was more energized and focused. We followed up by writing a second, bigger program together. She told me she wants to do more of this kind of work. We both want to do more.

A Question

How does someone who knows nothing about Perl programming, and isn’t even a tester, productively find bugs almost immediately by looking at Perl code?

That’s kind of a misleading question, because that’s not what really happened. She didn’t just look at my code. She looked at my code in the context of me talking to her about what I was trying to do as I was trying to do it. The process unfolded bit by bit, and she followed the logic as it evolved. It doesn’t take any specific personal quality on the part of the “coding companion,” just general ones like alertness, curiosity, and basic symbolic intelligence. It doesn’t take any particular knowledge, although it can help a lot.

Perhaps this would not work well for all kinds of coding. We weren’t working on something that required heaps of fiddly design, or hours of doodling in a notebook to conceive of some obscure algorithm.

My Claim

A completely non-technical but eager and curious companion can help me write code in real-time by virtue of three things:

  1. The dynamic and interactive legibility of the coding process. I narrate what I’m doing as it comes together step-by-step. The companion doesn’t eat the whole elephant in a bite; and the companion encounters the software mediated by my continuous process of interpretation. I tell him what and why and how. I do this repeatedly, and answer his questions along the way. This makes the process accessible (or in the parlance I like to use “legible” because that word specifically means the accessibility of information). The legibility is not that of a static piece of code, sitting there, but rather a property of something that is changing within a social environment. It’s the same experience as watching a teacher fill a whiteboard with equations. If you came at the end of the class, it would look bewildering, but if you watched it in process, it looks sensible.
  2. The conceptual simplicity of many bugs. Some bugs are truly devious and subtle, but many have a simple essence or an easily recognized form. As I fix my own bugs and narrate that process, my coding companion begins to pick up on regularities and consistency relationships that must be preserved. The companion programs himself to find bugs, as I go.
  3. The sensemaking faculties of a programmer seeking to resolve the confusion of a companion. When my dogs bark, I want to know why they are barking. I don’t know if there’s a good reason or a bad reason, but I want to resolve the mystery. In the course of doing that, I may learn something important (like “the UPS guy is here”). Similarly, when my coding companion says “I don’t understand why you put the dollar sign there and there, but not over there” my mind is directed to that situation and I need to make sense of it. It may be a bug or not a bug, but that process helps me be clear about what I’m doing, no matter what.

And Therefore…

A tester of any kind can contribute early in a development process, and become better able to test, by pairing with a programmer regardless of his own ability to code.

Behavior-Driven Development vs. Testing

The difference between Behavior-Driven Development and testing:

This is a BDD scenario (from Dan North, a man I respect and admire):

+Scenario 1: Account is in credit+
Given the account is in credit
And the card is valid
And the dispenser contains cash
When the customer requests cash
Then ensure the account is debited
And ensure cash is dispensed
And ensure the card is returned

This is that BDD scenario turned into testing:

+Scenario 1: Account is in credit+
Given the account is in credit
And the card is valid
And the dispenser contains cash
When the customer requests cash
Then check that the account is debited
And check that cash is dispensed
And check that the card is returned
And check that nothing happens that shouldn’t happen and everything else happens that should happen for all variations of this scenario and all possible states of the ATM and all possible states of the customer’s account and all possible states of the rest of the database and all possible states of the system as a whole, and anything happening in the cloud that should not matter but might matter.

Do I need to spell it out for you more explicitly? This check is impossible to perform. To get close to it, though, we need human testers. Their sapience turns this impossible check into plausible testing. Testing is a quest within a vast, complex, changing space. We seek bugs. It is not the process of  demonstrating that the product CAN work, but exploring if it WILL.

I think Dan understands this. I sometimes worry about other people who promote tools like Cucumber or jBehave.

I’m not opposed to such tools (although I continue to suspect that Cucumber is an elaborate ploy to spend a lot of time on things that don’t matter at all) but in the face of them we must keep a clear head about what testing is.

Technique: Paired Exploratory Survey

I named a technique the other day. It’s another one of those things I’ve been doing for a while, but only now has come crisply into focus as a distinct heuristic of testing: the Paired Exploratory Survey (PES).

Definition: A paired exploratory survey is a process whereby two testers confront one product at the same time for the purpose of learning a product, preparing for formal testing, and/or characterizing its quality as rapidly as possible, whereby one tester (the “driver”) is responsible for open-ended play and all direct interaction with the product while the other tester (the “navigator” or “leader”) acts as documentarian, mission-minder, and co-test-designer.

Here’s a story about it..

Last week, I was on my way home from the CAST conference with my 17 year-old son Oliver when a client called me with an emergency assignment: “Get down to L.A. and test our product right away!” I didn’t have time to take Oliver home, so we bought some clean clothes, had Oliver’s ID flown in from Orcas Island by bush plane, and headed to SeaTac.

(I love emergencies. They’re exciting. It’s like James Bond, except that my Miss Moneypenny is named Lenore. I got to the airport and two first class tickets were waiting for us. However, a gentle note to potential clients: making me run around like a secret agent can be expensive.)

This is the first time I had Oliver with me while doing professional testing, so I decided to make use of him as an unpaid intern. Basically, this is the situation any tester is in when he employs a non-tester, such as a domain expert, as a partner. In such situations, the professional tester must assure that the non-tester is strongly engaged and having good fun. That’s why I like to make that “honorary tester” drive. I get them twiddling the knobs, punching the buttons, and looking for trouble. Then they’ll say “testing is fun” and help me the next time I ask.

(Oliver is a very experienced video gamer. He has played all the major offline games since he was 3 or 4, and the online ones for the last 5 years. I know from playing with him what this means: he can be relentless once he decides to figure out how a system works. I was hoping his gamer instinct would kick in for this, but I was also prepared for him to get bored and wander off. You shouldn’t set your expectations too high with teenagers.)

The client gave us a briefing about how the device is used. I have already studied up on this, but it’s new for Oliver. The scene reminded me of that part in the movie Inception where Leonardo DiCaprio explains the dynamics of dream invasion.We have a workstation that controls a power unit and connects to a probe which is connected to a pump. It all looks Frankenstein-y.

(I can’t tell you much about the device, in this case. Let’s just say it zaps the patient with “healing energy” and has nothing whatsoever to do with weaponized subconscious projections.)

I set up a camera so that all the testing would be filmed.

(Video is becoming an indispensable tool in my work. My traveling kit consists of a little solid state Sony cam that plugs into the wall so I don’t have to worry about battery life, a micro-tripod so I can pose the camera at any desired angle, and a terabyte hard drive which stores all the work.)

Then, I began the testing just to demonstrate to Oliver the sort of thing I wanted to do. We would begin with a sanity check of the major functions and flows, while letting ourselves deviate as needed to pursue follow-up testing on anything we find that was anomalous. After about 15 minutes, Oliver became the driver, I became the navigator, and that’s how we worked for the next 6 or 7 hours.

Oliver quickly distinguished himself as as a remarkable observer. He noticed flickers on the screen, small changes over time, quirks in the sound the device made. He had a good memory for what he had just been doing, and quickly constructed a mental model of the product.

From the transcript:

“What?!…That could be a problem…check this out…dad…look, right now…settings, unclickable…start…suddenly clickable, during operation…it’s possible to switch its entire mode to something else, when it should be locked!”

and later

“alright… you can’t see the error message every single time because it’s corrupted… but the error message… the error message is exactly what we were seeing before with the sequence bug… the error message comes up for a brief moment and then BOOM, it’s all gone… it’s like… it makes the bug we found with the sequence thing (that just makes it freeze) destructive and takes down the whole system… actually I think that’s really interesting. It’s like this bug is slightly more evolved…”

(You have to read this while imagining the voice of a triumphant teenager who’s just found an easter egg in HALO3. From his point of view, he’s finding ways to “beat the boss of the level.”)

At the start, I frequently took control of the process in order to reproduce the bugs, but as I saw Oliver’s natural enthusiasm and inquisitiveness blossom, I gave him room to run. I explained bug isolation and bug risk and challenged him to find the simplest, yet most compelling form of each problem he uncovered.

Meanwhile, I worked on my notes and noted time stamps of interesting events. As we moved along, I would redirect him occasionally to collect more evidence regarding specific aspects of the evolving testing story.

How is this different from ordinary paired testing?

Paired testing simply means two testers testing one product on the same system at the same time. A PES is a kind of paired testing.

Exploratory testing means an approach to testing whereby learning, test design, and test execution are mutually supportive activities that run in parallel. A PES is exploratory testing, too.

A “survey session,” in the lingo of Session-Based Test Management, is a test session devoted to learning a product and characterizing the general risks and challenges of testing it, while at the same time noticing problems. A survey session contrasts with analysis sessions, deep coverage sessions, and closure sessions, among possible others that aren’t yet identified as a category. A PES is a survey test session.

It’s all of those things, plus one more thing: the senior tester is the one who takes the notes and makes sure that the right areas are touched and right general information comes out. The senior tester is in charge of developing a compelling testing story. The senior tester does that so that his partner can get more engaged in the hunt for vital information. This “hunt” is a kind of play. A delicious dance of curiosity and analysis.

There are lots of ways to do paired testing. A PES is one interesting way.

Hey, I’ve done this before!

While testing with my son, I flashed back to 1997, in one of my first court cases, in which I worked with my brother Jon (who is now a director of testing at eBay, but was then a cub tester). Our job was to apply my Good Enough model of quality analysis to a specific product, and I let Jon drive that time, too. I didn’t think to give a name to that process, at the time, other than ET. The concept of paired testing hadn’t even been named in our community until Cem Kaner suggested that we experiment with it at the first Workshop on Heuristic and Exploratory Techniques in 2001.

I have seen different flavors of a PES, too. I once saw a test lead who stepped to the keyboard specifically because he wanted his intern to design the tests. He felt that that letting the kid lean back in his chair and talk ideas to the ceiling (as he was doing when I walked in) would be the best way to harness certain technical knowledge the intern had which the test lead did not have. In this way, the intern was actually the driver.

I’m feeling good about the name Paired Exploratory Survey. I think it may have legs. Time will tell.

Here’s the report I filed with the client (all specific details changed, but you can see what the report looks like, anyway).

When Does a Test End?

The short answer is: you never know for sure that a test has ended.

Case in point. The license plate on my car is “tester.” It looks like this:

On December 20th, I received this notice in the mail:

As you see, it seems that the city of Everett, which is located between Orcas Island (where I live) and Seattle (where I occasionally visit) felt that I owed them for a parking violation. This is strange because I have never before parked in Everett, much less received a ticket there. A second reason this is strange is that the case number, apparently, is “111111111”.

At first I thought this was a hoax, but the phone number and address is real. The envelope was sent from Livonia, Michigan, and that turns out to be where Alliance One Receivables Management, Inc. is based. They collect money on behalf of many local governments, so that makes sense. It all looked legitimate, except that I’m not guilty, and the case number is weird.

Then it occurred to me that this may have been a TEST! Imagine a tester checking out the system. He might type “tester” for a license plate, not realizing (or not caring) that someone in Washington actually has that plate. He keys in a fake case number of “111111111” because that’s easy to type, and then he forgets to remove that test data from the database.

Praise the Humans

I called the county clerk’s office to ask about this. At first I was worried, because they used an automated phone service. But I quickly got through to a competent human female. What can humans do? Troubleshoot. She told me that there indeed was a record in their system that I owed them money, but that the case number did not refer to a real case. In fact, she said that the number was incorrectly formatted: all their case number start with a “10.”

“This can’t be right,” she said.

“Could it be test data? Are you just starting to use Alliance One?” I asked.

“We’ve been using Alliance One for years. Oh, but we’re just starting to use their electronic ticketing system.”

She told me I was probably right about it being a test, but that she would investigate and get back to me.

A few days later I received this notice:

So, there you have it. Someone ran a test on November 9th that did not conclude until December 23rd when it is stopped via a court order! Thank you, Judge Timothy B. Odell.

I’m sure this will appear on an episode of Law and Order: Clerical Intent one of these days.

Just imagine if this hadn’t been a parking ticket program, but rather something that told the FBI  to go and break down my door…

Morals of the Story

  1. Beware of testing on the production system.
  2. Always give the humans a way to correct the automation when it goes out of control. (Hear that, Skynet?)
  3. You never know when your test is over.
  4. If your name is “tester” or “test” or “testing”, eventually you will show up as test data in somebody’s project. Beware also if your name is “12345”, “asdf”, “qwerty”, “foobar”, or “999999999999999999999999.”

How Challenging Each Other Helps the Craft

Regular readers know that I’m dissatisfied with the state of the testing industry. It’s a shambles, and will continue to be as long as middle managers in big companies continue to be fat juicy targets for scam-artists (large tool vendors, consulting firms, and certain “professional” organizations) and well-meaning cargo cultists (such as those who think learning testing is the same as memorizing definitions of words and filling in templates).

What I can do about it is to develop my personal excellence, and associate myself with others who wish to do likewise. Someday, perhaps we will attain a critical mass. Perhaps the studious will inherit the Earth.

In that spirit, I’m constantly looking for colleagues, and bouncing ideas off of them to make us all better. I challenge people, and to me this is a virtue. It’s how I separate those who will help the craft from those who probably won’t. Sometimes people don’t react well to my challenges. Sometimes that’s because they are bad people (in my estimation); sometimes it’s because they are good people having a bad day; sometimes it’s because I’m having a bad day; or may it’s because I’m a bad person (in their estimation).

Nevertheless, this is a big part of what I do, and I will continue to do it. You have been warned. Also, you have been cheerfully invited to participate.

An Example Challenge and What Came of It:

Lanette Creamer, unlike me, is not a werewolf (though she describes me by the politically correct term “hairy”). If you read her tweets and her blog then you also know she’s, uh, what’s the opposite of brutal? Anyway, I bet she owns at least one calendar featuring pictures of kittens in hilarious costumes.

I met Lanette a few years ago but as I do with most people, I forgot about her (fun fact: I suffer from a mild case of associative prosopagnosia which, for instance, is why I didn’t recognize my own wife consistently until a few months after we were married). Then I met Lanette again at PNSQC, last year, where she made an impression on me as someone easy to talk to. I checked out her blog and liked what I saw.

2009-11-12 11:51:49 jamesmarcusbach: @lanettecream I’ll go look at your blog.

2009-11-12 12:16:23 jamesmarcusbach: Another must read blog for testers. This one by Lanette Creamer (@lanettecream). It’s the attack of the tester ladies.

2009-11-12 12:19:22 jamesmarcusbach: I think I encouraged @lanettecream to blog a couple of years ago, and then forgot to follow-up. Guess that worked out.

One thing I liked is that she identified herself, on her blog, as being a member of the Context-Driven School of testing. It means that I can reasonably expect such a person to be self-critical and to accept a challenge from me– a leader in that school.

A couple of days later I happened to see a paper Lanette wrote about “reducing test case bloat.” It was sitting on the desk of Matt Osborn while I was visiting him at Microsoft. I flipped through it and found a definition of “test case” that bugged me.

“Clinically defined a test case is an input and an expected result. For my purposes it doesn’t matter if a test case is automated or manual so long as it is a planned test. For the purpose of reducing test case bloat, I’d go further and say that it is a test you plan to execute a minimum of once in the product lifecycle.”

Lanette was referencing the IEEE with her definition. I hate the IEEE definition of test case. If I ever meet the guy who wrote it, I will bite him on the nose. It’s a narrow-minded supercilious idea of test cases, straight from Colonel Blimp. I prefer a broad definition that encompasses the actual field use of the term. For instance: “An instance or variation of a test or test idea.” By my definition, you can point to a list of test ideas, in bullet form, and call them test cases, just like real people at real companies already do, and not be committing a crime against terminology. Also, my definition does not attempt to enforce a specific organization’s notion of what a test must look like. It has to have inputs or it’s not a test? It has to have specific planned expected result? Not when I test, buddy.

I also didn’t like that Lanette presented it as if it were a universally accepted definition. That’s an appeal to authority, which we in the Context-Driven community do our best to avoid.

From Twitter:

2009-11-14 19:40:16 jamesmarcusbach: @lanettecream Yes, listening to the IEEE is fine if you’re not a true student of testing. But people like us ARE the IEEE (or better).

2009-11-14 19:41:10 jamesmarcusbach: @lanettecream I followed the IEEE, too, for a few years, and then realized that whoever came up with those defs wasn’t very thoughtful.

2009-11-14 19:41:44 jamesmarcusbach: @lanettecream Welcome to software testing leadership, where there is no appeal to authority allowed.

2009-11-14 19:43:22 jamesmarcusbach: @lanettecream The reason I bring it up is that I’ve generalized it myself, and I’m curious if your analysis will reveal something new.

2009-11-14 19:45:18 jamesmarcusbach: @lanettecream One way to frame the question: What exactly do you mean by “input?” What exactly do you mean by “expectation?”

2009-11-14 19:46:50 jamesmarcusbach: @lanettecream I think it’s shallow. I think you can do a lot better. Anyway, I’d be interested to see your analysis of that definition.

2009-11-14 19:48:34 jamesmarcusbach: @lanettecream Another way to say it: maybe that definition is okay– but what does it MEAN? Do you know? Have you really thought it through?

2009-11-14 19:50:49 jamesmarcusbach: @lanettecream IEEE is not a person we can cross-examine. It doesn’t think anything. But for the record, it’s totally wrong about planning!

2009-11-14 19:51:22 jamesmarcusbach: @lanettecream That planning stuff is just propaganda. Ask yourself “what does planning MEAN?”

2009-11-14 19:51:57 jamesmarcusbach: @lanettecream They throw around a lot of words without really thinking about them, it seems to me.

2009-11-14 19:52:41 jamesmarcusbach: @lanettecream I can tell you my opinions of all this. But I’d really love to see you blog about it, first. I’m following your blog now.

[sadly, I cannot obtain Lanette’s side of the conversation because Twitter sucks in that particular way…]

I did worry a little bit that Lanette would freak out and think I was attacking her. I’m a little nervous when engaging women this way, especially, since I have more concern about being seen as a big bully. (A man might see me that way, too, but as a fellow man I would have little sympathy. He just has to learn to cope.)

Dialog with Michael Bolton

While waiting to see what Lanette would come up with, I decided to transpect with Michael Bolton on the same topic in the hope that our good natured arguing would help Lanette feel better about the challenge.

James Bach: hey, to help Lanette, could we transpect through IM?
Michael Bolton: Heh.
Michael Bolton: Sure.
James Bach: then I can show her the transcript
Michael Bolton: If you like.  Pray, proceed.

Michael subsequently edited and published the transcript of that conversation.

During that interaction I came up with a thought experiment with which to question the Lanette/IEEE view of test cases. Can you test a clock when you can’t give it input?

The Clock Problem

I have since used this scenario to help explain to my students what I mean by a test.

Lanette’s Response

To my surprise, she wrote two entries. The first one worried me: What Did I Say a Test Case Was?

I went into damage control mode on Twitter…

2009-11-15 03:27:16 jamesmarcusbach: @lanettecream Your post seems a bit defensive. I wasn’t attacking you, I was trying to find out what you meant by what you said.

2009-11-15 03:30:44 jamesmarcusbach: @lanettecream I want to help real testers, too, and when I seek clarity in myself and other testers, it’s because that helps us avoid waste.

2009-11-15 05:09:55 jamesmarcusbach: @lanettecream I feel better hearing that. Questioning you is, from me, a sign of respect. But I don’t mean to push too hard.

2009-11-15 05:22:46 jamesmarcusbach: @lanettecream From your blog, I can tell you are talented. I’m eager to help your talent blossom. One requirement for that is confidence.

2009-11-15 05:25:55 jamesmarcusbach: @lanettecream One source of confidence is to practice working through ideas with your colleagues.

Lanette tried again. Her second post embraced the spirit of my challenge: What is a Test?

Notice how her second post is in the classic form of an exploratory essay. That’s perfect! I wasn’t asking for an ultimate argument and perfect analysis. I was looking for inquiry, insight, and self-examination.

Why should anyone put up with my challenges?

Well, how about career advancement? This can happen in a couple of ways. First, by publicly accepting and responding to my challenge, she improved her reputation for all to see. She shows that she is someone to be taken seriously, because she takes her own learning seriously. Second, she gained the first level of my gratitude and respect, and these things can be redeemed for professional favors of various kinds. When you are part of a community in good standing, you may holler for support and your fellow citizens will turn out in force to help you. When Lanette puts out a question on Twitter, lots of people will try to answer. It’s a great feeling to know you aren’t alone in the industry.

Plus Lanette was later interviewed by uTest. That was partly from how she impressed some of us on Twitter. I also profiled her in my talk on “buccaneer-testers” at Star East.

I hear that Lanette and my brother are collaborating on something together. I’m eager to see what comes out of that.

Another reason people should put up with challenges is that it makes the industry better. We practice our rhetoric and rapid learning. We grow. I’ve said it many times: the major reason all the terrible misconceptions about testing persist after all these years is that there is a worldwide conspiracy among testing writers and consultants not to debate with each other. Live and let live. Don’t rock the boat that feeds you, etc. Yech.

Finally, there’s personal pride. You feel good about yourself when you can take the heat.

When People Run Away

I don’t mind when people say no to a challenge, unless they are claiming to be expert testers. When a consultant or writer in the field won’t engage me, then I have to dismiss him. I can’t take him seriously. Just as I would not expect to be taken seriously if I held myself above the duty of defending my ideas in public. There’s a pretty substantial list of well known people who are professionally non-existent to me, but I don’t know how else to deal with them. We have to have intellectual standards or we can’t get anywhere.

(I know of a couple of exceptions to that rule, both women, whom I won’t name here. They are people who have strong aversions to debate (at least to debating me) and yet have great ideas and have contributed lots of good to the field. I can never be a close colleague of people like that, but I’m glad that they’re out there.)

Remembering Anna Allison

All this reminds me of Anna Allison. She was a rising star in 2001. I had dinner with her after she approached me at a conference and begged for a conversation (anyone can talk to me, at any time, if they give me food). At dinner, she mentioned that she was a bug metrics expert. I rolled my eyes and drew a bug metrics graph, daring her to tell me what it meant. What followed was a tour de force of questioning and analysis. She uncovered every trap that I had put into the graph. I told her she should write an article about our conversation and she did!

Tragically, she was on one of the planes that went into the Twin Towers on 9/11, on her way to a consulting gig in LA. This affected me more than I expected it to, because while I didn’t know her well, personally, professionally she was one of the few people I’ve known for whom debate was great fun. The Context-Driven community lost a happy tigress in her. We need more leaders like that. We really couldn’t spare her when she left us, and no one like her has yet stepped up: a non-threatening personality who is a role model for debate. I think that may be why I have high hopes for Lanette. (Also for Meeta Prakash, BTW.)

What happened yesterday?

Yesterday I issued a challenge to new blogger Michael Alexander. He responded promptly and in admirable fashion.

Lanette subsequently did a video blog about why she reacted to my challenge so constructively.

These events inspired me to explain all this. And so, I call upon all testers to challenge me, challenge yourselves, and challenge each other. Let’s blow out the cobwebs. Let’s be testers, not followers.

Three New Testing Heuristics

A lot of what I do is give names to testing behaviors and patterns that have been around a long time but that people are not systematically studying or using. I’m not seeking to create a standard language, but simply by applying some kind of terminology, I want to make these patterns easier to apply and to study.

This is a quick note about three testing heuristics I named this week:

Steeplechase Heuristic (of exploratory boundary testing)

When you are exploring boundaries, think of your data as having to get to the boundary and then having to go other places down the line. Picture it as one big obstacle course with the boundary you are testing right in the middle.

Then consider that very large, long, extreme data that the boundary is designed to stop might founder on some obstacle before it ever gets to the boundary you want to test. In other words, a limit of 1,000 characters on a field might work fine unless you paste 1,000,000 characters in, in which case it may crash the program instantly before the boundary check ever gets a chance to reject the data.

But also look downstream, and consider that extreme data which barely gets by your boundary may get mangled on another boundary down the road. So don’t just stop testing when you see one boundary is handled properly. Take that data all around to the other functions that process it.

Galumphing (style of test execution)

Galumphing means doing something in a deliberately over-elaborate way. I’ve been doing this for a long time in my test execution. I add lots of unnecessary but inert actions that are inexpensive and shouldn’t (in theory) affect the test outcome. The idea is that sometimes– surprise!– they do affect it, and I get a free bug out of it.

An example is how I frequently click on background areas of windows while moving my mouse pointer to the button I intend to push. Clicking on blank space shouldn’t matter, right? Doesn’t hurt, right?

I actually learned the term from the book “Free Play” by Stephen Nachmanovitch, who pointed out that it is justified by the Law of Requisite Variety. But I didn’t connect it with my test execution practice until jogged by a student in my recent Sydney testing class, Ted Morris Dawson.

Creep & Leap (for pattern investigation)

If you think you understand the pattern of how a function works, try performing some tests that just barely violate that pattern (expecting an error or some different behavior), and try some tests that boldly take that behavior to an extreme without violating it. The former I call creeping; the latter is leaping.

The point here is that we are likely to learn a little more from a mildly violating test than from a hugely violating test because the mildly violating test is much more likely to surprise us, and the surprise will be easier to sort out.

Meanwhile, stretching legal input and expectations as far as they can reasonably go also can teach us a lot.

Creep & Leap is useful for investigating boundaries, of course, but works in situations without classic boundaries, too, such as when we creep by trying a different type of data in a function that is supposed to be rejected.

Advice to Lawyers Suing Toyota

A press release by Toyota recently stated:

Toyota’s electronic systems have multiple fail-safe mechanisms to shut off or reduce engine power in the event of a system failure. Extensive testing of this system by Toyota has not found any sign of a malfunction that could lead to unintended acceleration.

Here are some notes for the lawyers suing Toyota. Here is what your testing experts should be telling you:

  • Whoever wrote this, even if he is being perfectly honest, is not in a position to know the status of the testing of Toyota’s acceleration, braking, or fault handling systems. The press release was certainly not written by the lead tester on the project. Toyota would be crazy to let the lead tester anywhere near a keyboard or a microphone.
  • Complete testing of complex hardware/software systems is not possible. But it is possible to do a thorough and responsible job of testing, in conjunction with hazard analysis, risk mitigation, and post-market surveillance. It is also quite expensive, difficult, and time consuming. So it is normal for management in large companies to put terrible pressure of the technical staff to cut corners. The more management levels between the testers and the CEO, the more likely this is to occur.
  • “Extensive testing” has no fixed meaning. To management, and to anyone not versed in testing, ALL testing LOOKS extensive. This is because testing bores the hell out of most people, and even a little of it seems like a lot. You need to find out exactly what the testing was. Look at the production-time testing but focus on the design-time testing. That’s where you’ll be most likely to find the trouble.
  • Even if testing is extensive in general, you need to find out the design history of the software and hardware, because the testing that was done may have been limited to older versions of the product. Inadequate retesting is a common problem in the industry.
  • If Toyota is found to have used an automated “regression suite” of tests, then you need to look for the problem of inadequate sampling. What happens is that the tests are only covering a tiny fraction of the operational space of the product (a fraction of the states it can be in), and then they just run those over and over. It looks like a lot of testing, but it’s really just the same test again and again. Excellent testing requires active inquiry at all times, not just recycling old actions.
  • If Toyota is found not to have used test automation at all, look for a different kind of sampling problem: limited human resources not being able to retest very extensively.
  • Most testers are not very ambitious and not well trained in testing. No university teaches a comprehensive testing curriculum. Testing is an intellectually demanding craft. In some respects it is an art. Examine the training and background of the testing staff.
  • Examine the culture of testing, too. If the corporate environment is one in which initiative is discouraged or all actions are expected to be explicitly justified (especially using metrics such as test case counts, pass/fail rates, cyclomatic complexity, or anything numerical), then testing will suffer. During discovery, subpoena the actual test reports and test documentation and evaluate that.
  • Any argument Toyota makes about extensiveness of testing that is based on numbers can be easily refuted. Numbers are a smoke-screen.
  • Examine the internal defect tracking systems and specifically look to see how intermittent bugs were handled. A lack of intermittent bug reports certainly would indicate something fishy going on.
  • Examine how the design team handled reports from the field of unintended acceleration. Were they systematically reviewed and researched?
  • Depositions of the testers will be critical (especially testers who left the company). It is typical in large organizations for testers to feel intimidated into silence on critical quality matters. It is typical for them to be cut off from the development team. You want to specifically look for the “normalization of risk” problem that was identified in both the Columbia and Challenger shuttle disasters.
  • If the depositions or documentation show that no one raised any concerns about the acceleration or braking systems, that is a potential smoking gun. What you expect in a healthy organization is a lot of concerns being raised and then dealt with forthrightly.
  • Find out what specific organizational mechanisms were used for “bug triage”, which is the process of examining problems reported and decided what to do about them. If there was no triage process, that is either a lie or a gross form of negligence.
  • If Toyota claims to have used “proofs of correctness” in their development of the software controllers, that means nothing. First, obviously they would have to have correctly used proofs of correctness. But secondly, proofs of correctness are simply the modern Maginot line of software safety: defects drive right around them. Imagine that the makers of the Titanic provided “proof” that water cannot penetrate steel plates, and therefore the Titanic cannot sink. Yes steel isn’t porous, but so what? It’s the same with proofs of correctness. They rely on confusing a very specific kind of correctness with the general notion of “things done right.”
  • The anecdotal evidence surrounding unintended acceleration is that it does not only involve acceleration, but also a failure of braking. Furthermore, it’s a very rare occurrence, therefore it’s probably a combination of factors that work together to cause the problem. It’s not surprising at ALL that internal testing under controlled conditions would not reproduce the problem. Look at the history of the crash of US Air flight427, which for years went unsolved until the transient mechanism of thermal shock was discovered.
  • You need to get hold of their code and have it independently inspected. Look at the comments in the code, and examine any associated design documentation.
  • Look at how the engineering team was constituted. Were there dedicated full-time testers? Were they co-located with the development team or stuffed off in another location? How often did the testers and developers speak?
  • What were the change control and configuration management processes? How was the code and design modified over time? Were components of it outsourced? Is it possible that no one was responsible for testing all the systems as a whole?
  • What about testability? Was the system designed with testing in mind. Because, if it wasn’t, the expense and difficulty of comprehensive testing would have been much much higher. Ask if simulators, log files, or any other testability interfaces were used.
  • How did their testing process relate to applicable standards? Was the technical team aware of any such standards?
  • In medical device development, manufacturers are required to do “single-fault condition” testing, where specific individual faults are introduced into the product, and then the product is tested. Did Toyota do this?
  • What specific test techniques and tools did Toyota employ? Compare that to the corpus of commonly known techniques.
  • Toyota cars have “black box” logs that record crucial information. Find out what those logs contain, how to read them, and then subpoena the logs from all cars that may have experienced this problem. Compare with logs from similar unaffected cars.

The best thing would be to reproduce the problem in an unmodified Toyota vehicle, of course. In order to do that, you not only need an automotive engineer and an electrical engineer and a software engineer, you need someone who thinks like a tester.

The unfortunate fact of technological progress is that companies are gleefully plunging ahead with technologies that they can’t possibly understand or fully control. They hope they understand them, of course, but only a few people in the whole company are even competent to decide if that understanding is adequate for the task at hand. Look at the crash of Swiss Air flight 111, for instance: a modern aircraft brought down by its onboard entertainment system, killing all aboard. The pilots had no idea it was even possible for an electrical fire to occur in the entertainment system. Nothing on their checklists warned them of it, and they had no way in the cockpit to disable it even if they’d had the notion to. This was a failure of design; a failure of imagination.

Toyota’s future depends on how they take seriously the possibility of novel, multivariate failure modes, and aggressively update their ideas of safe design and good testing. Sue them. Sue their pants off. This is how they will take these problems seriously. Let’s hope other companies learn from no-pants Toyota.

Question: How to Rapidly Test Maintenance Releases?

A correspondent writes:

“I have a test management problem. We have a maintenance project. It contains about 20 different applications. Three of them are bigger in terms of features and also the specs that are available. I am told that these applications had more than 1-2 testers on each of these applications. But in this maintenance project we are only 6-7 testers who are responsible to do the required testing. There will be a maintenance release every month and what it will deliver is a few bug fixes and a few CRs. What those bugs and CRs would be is not known in advance. Could you please suggest how to go about managing such kind of assignment?”

My first-order (direct) answer to this question runs about 3,000 words. It’s a slightly improved version of what was originally published in the TASSQ Quarterly, 9/06. It’s a bit too long for comfortable reading in blog form, so here it is as a PDF.

An Aside About Context-Driven Methodology

I also want to say that this PDF is an example of context-driven methodology. Responsible context-driven driven advice begins by inquiring about the context of the reader or specifying the relevant aspects of context in which the presented methods are believed to be helpful. The advice is then presented in a heuristic tone (“this might help”) rather than with an imperative tone (“you should do this”). I can use an imperative tone only when I am taking responsibility for the quality of the work; when I am the boss.

Part of the heuristic way of giving advice is to help the reader see reasons, causes, caveats, and alternatives. As a context-driven methodologist, I know my method ideas are tools, not facts. They must be interpreted and applied in specific situations by specific people. For the same reason, I see all methodology as embedded in a dialog. If I tell you “it is helpful to minimize documentation, because documentation is expensive”, I anticipate that you may reply with a question or a challenge, I try to make it easy for you to do that, and I try to be ready and able to answer you. Through the dialog, we all learn and develop better ways of working here and now on this project. In philosophical lingo, context-driven methodologists understand that each practitioner contructs (in a sense, invents) the craft for himself while immersed in a rich world of signs and signals that may guide us.

If you ever see me straying away from the context-driven path of methodology, I hope you will bring that to my attention. It is through the help of my colleagues and clients that I learn to do this better.

Designing Experiments

I experience intellectual work, such as testing, as a web of interconnected activities. If I were to suggest what is at the center of the testing web, on my short list would be: designing experiments. A good test is, ultimately, an experiment.

I’ve been looking around online for some good references about how to design experiments (since most testers I talk to have a lot of trouble with it). Here is a good one.

If you know of any other straightforward description of the logic of experiments, please let me know. I have some good books. I just need more online material.