The CAST Testing Competition

I sponsored the testing competition at CAST, last week, awarding $1,426.00 of my own money to the winners.

My game, my rules, of course, but I tried to be fair and give out the prizes to deserving winners.

There was some controversy…

We set it up with simple rules, and put the onus on the contestants to sort themselves out. The way it worked is that teams signed up during the day (a team could be one tester or many), then at 6pm they received a link to the software. They had to download it, test it, report bugs, and write a report in 4 hours. We set up a website for them to submit reports and receive updates. The developer of the product was sitting in the same ballroom as the contestants, available to anyone who wished to speak with him.

I left the scoring algorithm unexplained, because I wanted the teams to use their testing skills to discover it (that’s how real life works, anyway). A few teams investigated the victory conditions. Most seemed to guess at them. No one associated with the conference organizers could compete for a prize.

During the competition, I made several rounds with my notebook, asking each team what they were doing and challenging them to justify their strategy. Most teams were not particularly crisp or informative in their answers (this is expected, since most testers do not practice their stand-up reporting skills). A few impressed me. When I felt good about an answer, I wrote another star in my notebook next to their name. My objective was partly to help me decide the winner, and partly to make myself available in case a team had any questions.

David reviewed the 350 bug reports, while I analyzed the final test reports. We created a multi-dimensional ordinal scale to aid in scoring:

Awards:

  • Worst Bug Report: Happy Purples ($26)
  • Best Bug Report: In 1st Place ($400)
  • Developer’s Choice Award: Springaby ($200)
  • Best Test Report: Springaby ($800)

These rankings don’t follow any algorithm. We used a heuristic approach. We translated the raw experience data into 1-5 scales (where 5 is OMG and 1 is WTF). David and I discussed and agreed to each assessment, then we looked at the aggregates and decided who would get the awards. My final orderings for best test report (where report means the overall test report, not just the written summary report) are on the left.

Note: I don’t have all the names of the testers involved in these teams (I’ll add them if they are sent to me).

Now for the special notes and controversies.

Happy Purples

Happy Purples won $26 for the worst report made of a bug (which was actually for two bug reports: that it was too slow to download the software at the start of the competition AND that a tooltip was inconsistent with a button title because it wasn’t a duplicate of the button title).

The Happies were not a very experienced team, and that showed in their developer relations. I thought their overall bug list was not terrible, although it wasn’t very deep, either. They earned the ire of the developer because they tried to defend the weird bug reports, mentioned above, and that so offended to David that he flipped the bozo bit on them as a team. Hey, that’s realistic. Developers do that. So be careful, testers.

TestMuse

Keith Stobie was the solo tester known as TestMuse. He was a good explainer when I stopped by to challenge him on what he was doing, but I don’t think he took his written test report very seriously. I had a hard time judging from that what he did and why he did it. I know Keith well enough that I think he’s capable of writing a good report, so maybe he didn’t realize it was a major part of the score.

In 1st Place

They didn’t report many bugs (9, I think). But the ones they reported were just the kind the developer was looking for. I don’t remember which report David told me was the bug that won them the Best Bug Report award, but each bug on their list was a solid functionality problem, rather than a nitpicky UI thing. We called these guys the sniper testers, because they picked their shots.

Springaby

A portmanteau of “springbok” and “wallaby”, Springaby consisted of Australian tester Ben Kelly and South African Louise Perold. Like “Hey David!”, the winner of the previous CAST competition (2007), they used the tactic of sitting right next to the developer during the whole four hours. Just like last time, this method worked. It’s so simple: be friendly with the developer, help the developer, ask him questions, and maybe you will win the competition. Springaby won Developer’s Choice, which goes to David’s favorite team based on personal interactions during the competition, and they won for best test report… But mainly that was because Miagi-Do wiped out.

Note: Springaby reported on of their bugs in Japanese. However, the developer took this as a jest and did not mark them down.

Miagi-Do

Miagi-Do was kind of an all-star team, reminiscent of the Canadian team that should have won the competition in 2007 before they were disqualified for having Paul Holland (a conference organizer) on their side. This time we were very clear that no conference organizer could compete for a prize. But Miagi-Do, which consisted mainly of the friends and proteges of Matthew Heusser (a conference organizer) decided they would rather have him on their team and lose prize money than not have him and lose fun. Ah sportsmanship!

The Miagi-Do team was serious from the start. Some prominent names were on it: Markus Gaertner, Ajay Balamurugadas, and Michael Larsen, to name three. Also, gutsy newcomers to our community, Adam Yuret and Elena Houser.

Miagi-Do got the best rating from my walkaround interviews. They were using session-based test management with facilitated debriefings, and Matt grilled me about the scoring. They talked with the developer, and also consulted with my brother Jon about their final report. I expected them to cruise to a clear victory.

In the end, they won the “Spectacular Wipeout” award (an honorary award made up on the spot), for the best example of losing at the last minute. More about that, below.

The Controversy: Bad Report or Bad Call?

Let’s contrast the final reports of Miagi-Do and Springaby.

This is the summary from Miagi-Do. Study it carefully:

Now this is the summary from Springaby:

Bottom line is this: They both criticize the product pretty strongly, but Miagi-Do insulted the developer as well. That’s the spectacular wipeout. David was incensed. He spouted unprintable replies to Miagi-Do.

The reason why Miagi-Do was the goat, while Springaby was the pet, is that Springaby did not impose their own standard of quality onto the product. They did not make a quality judgment. They made descriptive statements about the product. Calling it unstable, for instance, is not to say it’s a bad product. In fact, Springaby was the ONLY team who checked with David about what quality standard is appropriate for his product. The other teams made assumptions about what the standard should be. They were generally reasonable assumptions, but still, the vendor of the product was right there– why assume you know what the intended customer, use, and quality standard is when you can just ask?

Meanwhile, Miagi-Do claimed the product was not “worthy” to be tested. Oh my God. Don’t say things like that, my fellow testers. You can say that the effort to test the product further at this time may not be justified, but that’s not the same thing as questioning the “worthiness” of the product, which is a morally charged word. The reference to black flagging, in this case, also seems gratuitous. I coined the concept of black flagging bugs (my brother came up with the term itself, by borrowing from NASCAR). I like the idea, but it’s not a term you want to pull out and use in a test report unless everyone is already fully familiar with it. The attempt to define it in the test report makes it appear as if the tester is reaching for colorful metaphors to rub in how much the programmer, and his product, sucks.

Springaby did not presume to know whether the product was bad or good, just that it was unstable and contained many potentially interesting bugs. They came to a meeting of minds with the developer, instead of dictating to him. Thus, even those both teams concurred in their technical findings, one team pleased their client, the other infuriated him.

This judgment of mine and David’s is controversial, because Adam Yuret, the up and coming tester who actually wrote the report, consulted with my brother Jon on the wording. Jon felt that the wording was good, and that the developer should develop a thicker skin. However, Jon wasn’t aware that Miagi-Do was working on the basis of their own imagined quality standard, rather than the one their client actually cared about. I think Adam did the right thing consulting with Jon (although if they had been otherwise eligible to win a prize, that consultation would have disqualified them). Adam tried hard and did what he thought was right. But it turns out the rest of the Miagi-Do team had not fully reviewed the test report, and perhaps if they did, they would have noticed the logical and diplomatic issues with it.

Well, there you go. I feel good about the scoring. I also learned something: most testers are poorly practiced at writing test reports. Start practicing, guys.

Who says ET is good for Medical Devices? The FDA!

In a new guidance document discussing the clinical testing of medical devices, the FDA includes a long section about the value of exploratory testing:

The Importance of Exploratory Studies in Pivotal Study

Medical devices often undergo design improvement during development, with evolution and refinement during lifecycles extending from early research through investigational use, initial marketing of the approved or cleared product, and on to later approved or cleared commercial device versions.

For new medical devices, as well as for significant changes to marketed devices, clinical development is marked by the following three stages: the exploratory (first-in-human, feasibility) stage, the pivotal stage (determines the safety and effectiveness of the device), and the postmarket stage (design improvement, better understanding of device safety and effectiveness and development of new intended uses). While these stages can be distinguished, it is important to point out that device development can be an ongoing, iterative process, requiring additional exploratory and pivotal studies as new information is gained and new intended uses are developed. Insights obtained late in development (e.g., from a pivotal study) can raise the need for additional studies, including clinical or non-clinical.

This section focuses on the importance of the exploratory work (in non-clinical and clinical studies) in developing a pivotal study design plan. Non-clinical testing (e.g., bench, cadaver, or animal) can often lead to an understanding of the mechanism of action and can provide basic safety information for those devices that may pose a risk to subjects. The exploratory stage of clinical device development (first-in-human and feasibility studies) is intended to allow for any iterative improvement of the design of the device, advance the understanding of how the device works and its safety, and to set the stage for the pivotal study.

Thorough and complete evaluation of the device during the exploratory stage results in a better understanding of the device and how it is expected to perform. This understanding can help to confirm that the intended use of the device will be aligned with sponsor expectations, and can help with the selection of an appropriate pivotal study design. A robust exploratory stage should also bring the device as close as possible to the form that will be used both in the pivotal trial and in the commercial market. This reduces the likelihood that the pivotal study will need to be altered due to unexpected results, which is an important consideration, since altering an ongoing pivotal study can increase cost, time, and patient resources, and might invalidate the study or lead to its abandonment.

For diagnostic devices, analytical validation of the device to establish performance characteristics such as analytical specificity, precision (repeatability/reproducibility), and limit of detection are often part of the exploratory stage. In addition, for such devices, the exploratory stage may be used to develop an algorithm, determine the threshold(s) for clinical decisions, or develop the version of the device to be used in the clinical study. For both in vivo and in vitro diagnostic devices, results from early clinical studies may prompt device modifications and thus necessitate additional small studies in humans or with specimens from humans.

Exploratory studies may continue even as the pivotal stage of clinical device development gets underway. For example, FDA may require continued animal testing of implanted devices at 6 months, 2 years and 3 years after implant. While the pivotal study might be allowed to begin after the six month data are available, additional data may also need to be collected. For example, additional animal testing might be required if pediatric use is intended. For in vitro diagnostic devices, it is not uncommon for stability testing of the device (e.g., for shelf life) to continue while (or even after) conducting the pivotal study.

While the pivotal stage is generally the definitive stage during which valid scientific evidence is gathered to support the primary safety and effectiveness evaluation of the medical device for its intended use, the exploratory stage should be used to finalize the device design, or the appropriate endpoints for the pivotal stage. This is to ensure that the investigational device is standardized as described in 21 CFR 860.7(f)(2), which states:

“To insure the reliability of the results of an investigation, a well-controlled investigation shall involve the use of a test device that is standardized in its composition or design and performance.”

This is what I’ve been arguing for a couple of years, now. If you want to test a medical device very well, then you have to test it in an exploratory way. This prepares the way for what the FDA here calls the “pivotal study”, which in software terms is basically a scripted demonstration of the product.

Yes, the FDA says, earlier in this guidance document, that it is intended to apply to clinical studies, not necessarily bench testing. But look at the reasoning: this exact reasoning does apply to software development. You might even say it is advocating an agile approach to product design.

Technique: Paired Exploratory Survey

I named a technique the other day. It’s another one of those things I’ve been doing for a while, but only now has come crisply into focus as a distinct heuristic of testing: the Paired Exploratory Survey (PES).

Definition: A paired exploratory survey is a process whereby two testers confront one product at the same time for the purpose of learning a product, preparing for formal testing, and/or characterizing its quality as rapidly as possible, whereby one tester (the “driver”) is responsible for open-ended play and all direct interaction with the product while the other tester (the “navigator” or “leader”) acts as documentarian, mission-minder, and co-test-designer.

Here’s a story about it..

Last week, I was on my way home from the CAST conference with my 17 year-old son Oliver when a client called me with an emergency assignment: “Get down to L.A. and test our product right away!” I didn’t have time to take Oliver home, so we bought some clean clothes, had Oliver’s ID flown in from Orcas Island by bush plane, and headed to SeaTac.

(I love emergencies. They’re exciting. It’s like James Bond, except that my Miss Moneypenny is named Lenore. I got to the airport and two first class tickets were waiting for us. However, a gentle note to potential clients: making me run around like a secret agent can be expensive.)

This is the first time I had Oliver with me while doing professional testing, so I decided to make use of him as an unpaid intern. Basically, this is the situation any tester is in when he employs a non-tester, such as a domain expert, as a partner. In such situations, the professional tester must assure that the non-tester is strongly engaged and having good fun. That’s why I like to make that “honorary tester” drive. I get them twiddling the knobs, punching the buttons, and looking for trouble. Then they’ll say “testing is fun” and help me the next time I ask.

(Oliver is a very experienced video gamer. He has played all the major offline games since he was 3 or 4, and the online ones for the last 5 years. I know from playing with him what this means: he can be relentless once he decides to figure out how a system works. I was hoping his gamer instinct would kick in for this, but I was also prepared for him to get bored and wander off. You shouldn’t set your expectations too high with teenagers.)

The client gave us a briefing about how the device is used. I have already studied up on this, but it’s new for Oliver. The scene reminded me of that part in the movie Inception where Leonardo DiCaprio explains the dynamics of dream invasion.We have a workstation that controls a power unit and connects to a probe which is connected to a pump. It all looks Frankenstein-y.

(I can’t tell you much about the device, in this case. Let’s just say it zaps the patient with “healing energy” and has nothing whatsoever to do with weaponized subconscious projections.)

I set up a camera so that all the testing would be filmed.

(Video is becoming an indispensable tool in my work. My traveling kit consists of a little solid state Sony cam that plugs into the wall so I don’t have to worry about battery life, a micro-tripod so I can pose the camera at any desired angle, and a terabyte hard drive which stores all the work.)

Then, I began the testing just to demonstrate to Oliver the sort of thing I wanted to do. We would begin with a sanity check of the major functions and flows, while letting ourselves deviate as needed to pursue follow-up testing on anything we find that was anomalous. After about 15 minutes, Oliver became the driver, I became the navigator, and that’s how we worked for the next 6 or 7 hours.

Oliver quickly distinguished himself as as a remarkable observer. He noticed flickers on the screen, small changes over time, quirks in the sound the device made. He had a good memory for what he had just been doing, and quickly constructed a mental model of the product.

From the transcript:

“What?!…That could be a problem…check this out…dad…look, right now…settings, unclickable…start…suddenly clickable, during operation…it’s possible to switch its entire mode to something else, when it should be locked!”

and later

“alright… you can’t see the error message every single time because it’s corrupted… but the error message… the error message is exactly what we were seeing before with the sequence bug… the error message comes up for a brief moment and then BOOM, it’s all gone… it’s like… it makes the bug we found with the sequence thing (that just makes it freeze) destructive and takes down the whole system… actually I think that’s really interesting. It’s like this bug is slightly more evolved…”

(You have to read this while imagining the voice of a triumphant teenager who’s just found an easter egg in HALO3. From his point of view, he’s finding ways to “beat the boss of the level.”)

At the start, I frequently took control of the process in order to reproduce the bugs, but as I saw Oliver’s natural enthusiasm and inquisitiveness blossom, I gave him room to run. I explained bug isolation and bug risk and challenged him to find the simplest, yet most compelling form of each problem he uncovered.

Meanwhile, I worked on my notes and noted time stamps of interesting events. As we moved along, I would redirect him occasionally to collect more evidence regarding specific aspects of the evolving testing story.

How is this different from ordinary paired testing?

Paired testing simply means two testers testing one product on the same system at the same time. A PES is a kind of paired testing.

Exploratory testing means an approach to testing whereby learning, test design, and test execution are mutually supportive activities that run in parallel. A PES is exploratory testing, too.

A “survey session,” in the lingo of Session-Based Test Management, is a test session devoted to learning a product and characterizing the general risks and challenges of testing it, while at the same time noticing problems. A survey session contrasts with analysis sessions, deep coverage sessions, and closure sessions, among possible others that aren’t yet identified as a category. A PES is a survey test session.

It’s all of those things, plus one more thing: the senior tester is the one who takes the notes and makes sure that the right areas are touched and right general information comes out. The senior tester is in charge of developing a compelling testing story. The senior tester does that so that his partner can get more engaged in the hunt for vital information. This “hunt” is a kind of play. A delicious dance of curiosity and analysis.

There are lots of ways to do paired testing. A PES is one interesting way.

Hey, I’ve done this before!

While testing with my son, I flashed back to 1997, in one of my first court cases, in which I worked with my brother Jon (who is now a director of testing at eBay, but was then a cub tester). Our job was to apply my Good Enough model of quality analysis to a specific product, and I let Jon drive that time, too. I didn’t think to give a name to that process, at the time, other than ET. The concept of paired testing hadn’t even been named in our community until Cem Kaner suggested that we experiment with it at the first Workshop on Heuristic and Exploratory Techniques in 2001.

I have seen different flavors of a PES, too. I once saw a test lead who stepped to the keyboard specifically because he wanted his intern to design the tests. He felt that that letting the kid lean back in his chair and talk ideas to the ceiling (as he was doing when I walked in) would be the best way to harness certain technical knowledge the intern had which the test lead did not have. In this way, the intern was actually the driver.

I’m feeling good about the name Paired Exploratory Survey. I think it may have legs. Time will tell.

Here’s the report I filed with the client (all specific details changed, but you can see what the report looks like, anyway).

Avoiding My Curse on Tool Vendors

Adam Goucher noticed that I recently laid a curse upon commercial test tool vendors (with the exception of Hexawise, Blueberry Consultants, and Atlassian). He wondered to me how a tool vendor might avoid my curse.

First, I’m flattered that he would even care who I curse. But, it’s a good question. Here’s my answer:

Test tool vendors that bug me:

  • Any vendor who wants me to pay for every machine I use their tool upon. Guys, the nature of testing is that I need to work with a lot of machines. Sell me the tool for whatever you want to charge, but you are harming my testing by putting obstacles between me and my test lab.
  • Any vendor that sell tools conceived and designed by a goddamn developer who hates to goddamn test. How do I know about the developer of a test tool? Well, when I’m looking at a tool and I find myself asking “Have these vendor bozos ever actually had to test something in their lives? Did they actually want a tool like this to help them? I bet this tool will triple the amount of time and energy I have to put into testing, and make me hate every minute of it” then I begin to suspect there are no great lovers of testing in the house. This was my experience when I worked with Rational Test Manager , in 2001. I met the designer of that tool: a kid barely out of MIT with no testing or test management experience who informed me that I, a Silicon Valley test management veteran, wasn’t qualified to criticize his design.
  • Any vendor selling me the opportunity, at great cost, to simulate a dim-witted test executioner. Most tool vendors don’t understand the difference between testing and checking, and they think what I want is a way to “test while I sleep.” Yes, I do want the ability to extend my power as a tester, but that doesn’t mean I’m happy to continually tweak and maintain a brittle set of checks that have weak oracles and weak coverage.
  • Any vendor who designs tools by guessing what will impress top managers in large companies who know nothing about testing. In other words: tools to support ceremonial software testing. Cem and I once got a breathless briefing about a “risk-based test management” tool from Compuware. Cem left the meeting early, in disgust. I lingered and tried to tell them why their tool was worthless. (Have you ever said that to someone, and they reacted by saying “I know it’s not perfect” and you replied by saying “Yes, it’s not perfect. I said it’s worthless, therefore it would follow that it’s also not perfect. You could not pay me to use this tool. This tool further erodes my faith in the American public education system, and by extension the American experiment itself. I’m saying that you just ruined America with your stupid stupid tool. So yeah, it’s not perfect.”) I think what bugged Cem and me the most is that these guys were happy to get our endorsement, if we wanted to give it, but they were not at all interested in our advice about how the tool could be re-designed into being a genuine risk-based testing tool. Ugh, marketers.
  • Vendors who want to sell me a tool that I can code up in Perl in a day. I don’t see the value of Cucumber. I don’t need FIT (although to his credit, the creator of FIT also doesn’t see the big deal of FIT). But if I did want something like that, it’s no big deal to write a tool in Perl. And both of those tools require that you write code, anyway. They are not tools that take coding out of our hands. So why not DIY?

Tool vendors I like:

  • Vendors who care what working testers think of their tools and make changes to impress them. Blueberry, Hexawise, and Sirius Software have all done that.
  • Vendors who have tools that give me vast new powers. I love the idea of virtual test labs. VMWare, for instance.
  • Vendors who don’t shackle me to restrictive licenses. I love ActivePerl, which I can use all over the place. And I happily pay for things like their development kit.
  • Vendors who enjoy testing. Justin Hunter, of Hexawise, is like that. He’s the only vendor speaking at CAST, this year, you know.

Bach Brothers Legion of Testing Merit

My brother and I are instituting a new award at the CAST conference on Monday: The Bach Brothers Legion of Testing Merit.

We will give this award periodically in recognition of certain testers who, we feel, deserve to be famous, but aren’t yet internationally recognized in the way they should be.

The first recipients of this award are:

  • Ajay Balamurugadas
  • Parimala Shankaraiah
  • Sharath Byregowda
  • Manoj Nair
  • Pradeep Soundararajan

The first four on this list are the founders of Weekend Testers which is a grassroots testing professionalism phenomenon. It is non-commercial, and in most respects is completely out of step with the Indian testing industry. The people who participate in it are going against the flow and ignoring the typical reward structures. Contrary to the trend of commercialized efforts at “testing professionalism”, such as ISEB and ISTQB, these people are actually doing the Weekend Testers not for glory or money or to maximize the chance they will be hired into safe boring job, but rather to achieve personal excellence in their craft.

We’re fortunate to have Ajay speaking at CAST, this year. Pradeep was invited to speak as well, but he couldn’t make it.

Pradeep Soundararajan is being given the award because, as far as Jon and I can tell, he has nearly single-handedly inspired the Context-Driven testing movement (in other words, the skilled testing culture) in India. The Weekend Tester founders credit him with inspiring them. Yes, there are other voices out there, too (Shrini Kulkarni and Meeta Prakash for instance). What makes Pradeep special is that he has suffered for his cause, enduring long periods out of work because he refused to do bad testing.

I wish I could say there was a large cash prize that goes with these awards, but at least there is honor. Jonathan Bach and James Bach honor them!

Now, go and save India, guys.

(NOTE: Do you see why we named this award as we did? We could have called it “The Context-Driven Testing Award” or some other neutral title. Why did we name the award after ourselves? Well, first, it’s not about ego. It’s about integrity. This award is based purely on the arbitrary and possible unfair opinions of two guys named Bach. The value of the award is nothing more or less than the value of our reputations. Hence its title. And this is why I keep harping about how testers must protect and build their reputations by refusing to knowingly do bad work. Here’s a question for you: if YOU were to recognize a colleague for his excellence as a tester, would that tester feel honored… or just awkward? The quality of your reputation determines the answer to that question.)