On June 9th, 2002, Bret Pettichord posted a notice on the context-driven testing discussion list (at yahoogroups.com) about a study by NIST on the economic impact of inadequate software testing. I looked at the study, and somewhat grumpily dismissed it as bad science. A couple of other people, including author Rex Black then spoke up to defend the study. Although I obviously disagree with Rex, I wanted to post his comments here as a counter-point. But, since he hasn't yet had a chance to review and edit them, I have to wait on that. (Note: Although the entire thread that includes the defenders' comments is available if you are a member of the context-driven testing mailing list, I'm not allowed to repost the thread here.)
Below, I have reproduced three critiques of the study, by Bret Pettichord, Doug Hoffman, and Cem Kaner, respectively. They have said what I would like to have said in my first grumpy response, for the most part.
I'm sorry I have such a hard time keeping my temper about things like this. I guess I'm depressed about how easily some of my colleagues in the testing business are fooled by marketing documents dressed up as science. I want to be able to scientifically justify my existence as much as the next starving consultant. But not at the cost of selling my critical faculties to the devil. I call upon my fellow testers and software process people to hold themselves and each other to a higher standard.
Inspired by this latest assault on reason, and because the NIST study relies on a fundamentally flawed survey, I went to Reiter's bookstore in downtown D.C. and picked up some helpful books on research design. Among them:
Each of these books is highly readable and I recommend them. Improving Survey Questions includes a nice little set of principles, many of which the NIST study violates. Think of these the next time someone waves another survey in your face. Here is an abridged version of those principles:
The upshot of these principles and these books are that it asking people what they think is a risky thing to do if you want to get at the truth. The people at Gallup understand this, and there another interesting book called Trust in Numbers where you can read about how they do survey design.
A valid and helpful study about the impacts of software testing, positive or negative, must take into account the fact that software development and use is a highly social and psychological process. The universe of discourse is cloudy. Facts from one project are often not commensurable with facts from other projects. Many projects have no systematic means of tracking what happened and when, and those that do are often deeply flawed. This is not surprising, because the job of software people is to develop software, not collect research quality data for the next survey.
-- James Bach
http://www.satisfice.comThe NIST study makes an economic assessment of the costs associated with two scenarios:
The difference between these two is X billion dollars. They have focused on estimating this value for two segments of the software industry and then have extrapolated to the whole industry. Then the study asserts that this is the cost of the poor testing.
X is an estimate and I'm sure that it isn't very accurate. But I also trust that with greater effort more accurate estimates of this value could be made.
I am concerned with the conclusion that this number tells us anything about testing. Indeed it seems to me that any technique that claims to detect or prevent defects could put a claim on this number. The analysis could be used to put a value on the benefits of better training, use of better languages, better training or whatever software development practice that you are peddling.
It seems to me that NIST wants to develop better software testing tools. This
is surely a great thing. For example, they funded the development of Expect,
a test tool that I have used. It is a great test tool, well adapted to its domain,
open sourced and free. I am very grateful to NIST for funding this. It, however,
was funded on the side. They let one staff
Member, Don Libes, steal some time to develop it.
Now they seem to want to do more development of test tools and test suites for software. So they asked an economics firm to estimate the potential value of really great tools. In effect they asked for an estimate of the size of the market for test tools. Except unlike a commercial firm, they are not interested in the commercial value of this market, but rather the economic. This is the value of the test tools that both developers and customers would realize.
I see this value X as an upper bound. It is the maximum conceivable value of really great testing tools. In other words, if we had a great testing regimen that allowed us to detect defects as soon as they occurred, then this regimen would save the economy X dollars. If the regimen itself cost more than X to develop, then it would be a bad investment for the US Government.
The problem with this analysis is that we are talking about measuring a fantasy. No one thinks that we can accomplish this. Not with better testing. Not with other techniques. There is no magic silver bullet. Really great test tools are more likely to reduce this figure by say 10%. That's how progress is made in this industry.
Indeed the study gave hints of the types of testing tools that NIST wants to develop. I saw several references to interoperability conformance test suites, not that I'm quite sure what these would look like. It also contained many references to Beizer and Kit.
Personally my view is that big improvements in testing will come not with better tools but by building testability into software. This includes Design by Contract and other techniques. These aren't mentioned at all in the study.
In summary:
Generally, I'm disappointed in NIST for publishing such an obviously biased and unenlightened report. I have some specific observations to explain why I'm so harsh:
1. They're making the [fatally flawed] assumption that software defects affect customers only because of inadequate testing.
They're looking at software testing as the method for reducing defects -- a test and fix approach to good quality. This is akin to the QC methods used by the US auto industry in the 50's and 60's -- testing in quality. It overlooks the fact that the defects are put there regardless of testing, and depending on testing to give us acceptable quality is guaranteed to fail. [I firmly believe this is a primary cause of the problem of bad quality software -- the people who make the product expect someone else to make the quality good enough.]
Deming pounded the point that quality is the result of process, and the only way to improve quality is to improve the process. Testing is required to assure the process remains under control (and this is critically important), but testing only provides data about the quality, it doesn't change or control it. When more of us in SQE start to apply the principles of real quality engineering we will start to really make a difference in software quality. As long as organizations depend on testing to create quality they will continue to create crappy products.
By starting with this bad assumption, everything that follows in the report
is questionable.
2. The authors are calling for standardized, certified, automated testing to
overcome the testing inadequacies.
This waves several red flags for me.
- Why are standard testing techniques good for diverse software problem domains? How?
- Who is going to decide what is certified and what isn't? Application domains and technology are changing quickly and continuously. Computer science doesn't even have a vocabulary to discuss many of the concepts, and we certainly can find acknowledged experts who have successfully used vastly different approaches to solve [what appear to be] similar problems. Aside from stifling innovation, how does using certified test data, metrics, and automated test suites help? I've participated in several projects to 'tune' a product to improve performance measures using a standard benchmark. I can't say that anyone saved money by using the standard benchmarks, but I can say a whole lot of misleading benchmark timings were generated.
- Automated testing (and particularly automated test code generation) often degenerates into proving that the program does what the program does. (If you mechanically analyze the program to generate tests to exercise it, you aren't really testing the program -- you're testing whether the mechanical analysis and test generation mechanisms work. You may prove that the program does what it does, not whether it does what is desired or appropriate.)
3. I see a big red flag about the data they collected when 50% of the CAD respondents and 33% of financial users claim to have NO major software errors in the previous year, while the other remaining users reported an average of 40 major errors.
I suspect this might be a sign that some respondents are covering their butts or interpreting the questions differently, not a sign that 50% of CAD respondents have development processes so good that there are no major errors in their code. (NIST could provide a huge boon to CS if they could explain how code can be created and released with zero discovered major errors.) I would expect statistical tests would show that there are two populations responding (and we should probably throw out the data from one of them).
4. They assert that standard tools and metrics would go a long way toward addressing the software testing problems.
I agree that we need realistic models of software and development processes that relate to the underlying properties of software. Given those models, we might be able to create standard tools and metrics, but until we understand the underlying properties, such tools and metrics may continue to cause more harm than good. We'll be ready to model standard tools and metrics when we can reasonably understand and agree upon concepts such as the size of a program, program complexity, how much testing we've done, ease of use, and what it means to stress a program. Meanwhile, I believe enforcing standards and certifications stifle our progress toward understanding fundamental software characteristics.
5. They assert that inadequate testing causes poor quality, high development costs, increased time to market, and increased market transaction costs. Based on that axiom they go on to prove that improved software testing would improve quality, reduce costs, release sooner, and lower transaction costs.
This is the same kind of circular argument that I saw in school to prove 1
= 0. By declaring the desired result as a self-evident truth, almost any argument
will confirm it.
6. They define software testing as the "dynamic execution of software,"
thus eliminating any possibility of capturing requirements, specification, or
design errors in their development phases.
They miss out on the fact that investments in better quality in earlier phases would be far less expensive and have a much higher impact on quality.
7. They assert that test infrastructures are inadequate to deal with interoperability, automated test code generation, rigorously determining when a product is good enough to release, and performance metrics and testing procedures.
I think it's absurd to assume that standardized testing infrastructure could deal with these issues. Consider the technological changes over the last ten years -- the best testing infrastructure solutions in 1992 would be useless today because the problem space shifted. The authors are assuming there is one correct answer to the issues, rather than realizing that substantially different answers are appropriate under different circumstances.
Anyway, I've spent several hours more time on the report than I think it's
worth. I hope it doesn't do too much damage, but I put it into the same category
as "GUI Testing made Painless;" it probably fits for some situation
[which I'm not familiar with] - it's just too bad the authors don't know enough
about the field to identify the limited applicability
of their ideas.
In my last note on the NIST study, I pointed out that I had wanted to use this study at a curriculum planning meeting, but that I chose not to cite it because I didn't believe that a skeptical reader (as all scientific readers should be) would consider it credible.
I noted two strengths:
Regarding the definition of quality, I am more used to seeing a broader definition. For example, Joseph Juran defined quality in terms of two factors, satisfiers and dissatisfiers. Bugs are examples of dissatisfiers. Good features are examples of satisfiers.
The reason that Juran's distinction is important is this. If I gave a project
manager an extra $100,000 and said, "Go forth and improve the quality of
the product", I suspect that we might get a distribution of expenditure
as follows:
Bret Pettichord has discussed this aspect of the study in detail also, making several other points.
Let me tie this back to credibility. Here I am in a room of (primarily) university professors who all want to improve the quality of software and give their students appropriate education for quality improvement. Coming into the room, I saw that they had allocated about 3 lecture hours (of roughly 2000 in a 4-year degree program) to software testing. They were allocating many more hours to software management, software process, and several more hours to psychology of group dynamics, history of the quality culture, human factors, etc. The people in the room were willing to listen but the decision to increase the number of testing hours (if such a decision has actually been made) was far from a done deal.
A credible study that the USA lost $umpty-billion per year because of cruddy testing would have been very, very useful in this room.
Suppose that I cited the NIST study. Several people did cite papers and studies and several of us took down references and decided to read them over the next week. If I cited that study, several of my peers would have read it.
If you were primarily a design-focused computer scientist and you read a study that attributed quality costs primarily to bad testing, you might have dismissed the study as absurd. By and large, the quality of the product is determined before it comes to testing. One of the mantras of our field is that you can't test quality into a product. Testing helps us improve our products, but there are limits. If you saw a definition of quality as the inverse of the total number of bugs, you might have protested that there is a lot more to high quality than bug count. I suspect that many of the people in the room would have dismissed the definition because it didn't map on to their notions of product quality, and that they might then have dismissed the study along with it.
Whenever you favorably cite and rely on a document, you tie yourself to that document in the eyes of anyone who you cited it to. You are seen as attesting to the credibility of the document. A reader who finds the document incredible also suffers some reduction in their perception of your credibility. This is probably obvious, but if you're curious about it, read some of the lawyers' literature on trial advocacy. The lawyer picks the evidence she will present, and one of the factors governing her selection is its potential to raise or lower their credibility with a judge or jury.
ANOTHER KEY CONCERN: THE SURVEY
The survey is very long. It has 8 sections. I has 43 numbered questions, but actually asks you to fill in about 135 fields. It asks you for data (specific numbers) on:
The second survey needs only 20 minutes. It has 32 numbered questions, with 98 fields to fill in (counting alternate fields, such as one with Yes and one with No, as one field not two), again with specific numerical estimates.
This is a lot of data. I have never been in a position, in any company that I have worked at or consulted to, to have all that data at my fingertips. If the data was available to me, sometimes it was not, it would have taken substantial digging time, including asking questions of other people, to get even ballpark answers to some of these questions. In many companies, much of this data is simply unavailable. For example, many people do not think it is useful to think of waterfall development phases or to tie bugs found in one of these phases to the phase at which the bug was allegedly introduced.
To answer such a survey within an hour, at any company I have been at, even when I was a Director of development, I would have had to do a lot of guessing.
People get impatient with long studies that ask for information that they don't already have. One of the key rules of survey design is to keep it short and simple. Another is to ask questions that people are likely to be able to answer and answer accurately without much effort.
Last spring, a senior doctoral student at Florida Tech proposed a study that involved a survey that was probably no more complex than this one. If we could have trusted the answers to that survey, we would have gained a lot of information. In combination with work this student has already done, this would have been a good dissertation. This student has been around for a while, has done great work, and it is really time for this student to graduate. I really wanted to be able to approve this work, let the student collect it and write it up, smile at the results, and let this student graduate.
The student's committee (including me) rejected the survey. I can't speak for the others. I rejected it because I believed that the survey would not be carefully answered, in a way that would give us trustworthy data. I am not applying standards for surveys to this NIST study that I would not readily apply to my own work or the work of my students or colleagues.
I believe that this type of data could be collected more accurately. For example, if I had access to cooperating institutions, I would probably have sent researchers to the institutions to interview people and to sit with them while they (interviewees) read over the relevant corporate records or did their own searches and figured out the numbers. I would have expected the researcher in the field to gather different data from different people, eventually putting together a completed survey or report as a patchwork of information from multiple sources. In some cases (e.g. relating bugs found to phases when they were introduced), this might have taken a long time to develop answers to these questions.
So, again like Bret, I believe that these data could have been collected accurately, but I attach little credibility to these numbers.
SOME OTHER RED FLAGS
The study defines testing in a way that excludes exploratory testing. Testing must be done against a pre-planned result. Even SWEBOK acknowledges that exploratory testing (for better or worse) is the most commonly practiced form. If we are counting hours of testing, are we counting the non-testing testing or just the Real Testing? This might sound like a nit, but remember, this study is busy counting how much money is being spent on testing, how much could be spent more, and what the benefits of testing more would be. Definitions are important. Either something counts or it doesn't. If you count something outside of your definition, your count is invalid. Different counters will also be inconsistent.
The study defines development in terms of a waterfall and relies on the waterfall model to characterize when bugs are introduced or found. For companies that don't follow the waterfall, this is a problem. It is nontrivial to map events from a cyclic, spiral, evolutionary, (etc.) lifecycle back to the times at which they would have happened if only the company had been following the waterfall. To my eyes, this speaks to a bias on the part of the researchers, as well as introducing a source of opportunities for error into the data.
The study says "Recently, legal action has increased when failures are attributable to insufficient testing." That's interesting. Too bad they provide no citations. To the best of my knowledge, this is great hype but has little or no basis in fact. Lawsuits for product quality typically involve allegations of fraud or breach of contract. Occasionally, the fraud involves a false statement about the sufficiency of testing or the contract promises a certain level of sophistication of testers, but very few contracts that I've seen make such promises. More often, the fraud suit goes forward because the company lied about the quality of the product (not about the testing), denying the existence of a bug that it knew about (so lack of testing could not have been the problem -- they found the bug). A very few cases have involved allegations of personal injury or property damage. Read Nancy Leveson's book on software safety. It would be a gross oversimplification (and I think she would say it would be dead wrong) to say that dangerous conditions in the software exist or make it to the field primarily because of insufficient testing.
The study asks about after sale customer service costs. In the body of the study (e.g. chapter 3), it talks about after sales support costs as if they were all due to bugs (which are all, of course, due to bad testing). In the survey, the study asks the respondent to provide the total after-sale service cost and the percentage attributable "to bugs found by customers during business operations versus those costs related to user errors or other causes not related to defective software." This percentage estimate is very difficult. I've seen estimates from Microsoft as low as 2%, from the Software Support Professionals Association of 5%, from Borland and Apple of 33%, and from research I did with David Pels at 50%. I doubt that the difference between these numbers is due to differences in product quality. I think the differences relate to how we count. That is, if the Official Cost Counter at MS counted the Power Up bugs and service calls, they would probably have estimated 2% and if David and I had ransacked MS's records, we probably would have talked about 50%. I am not suggesting that MS is misleading anyone. I am suggesting that they count differently, including things we would not have included and excluding things we would have included. To me, that means that without further calibration, the numbers from different companies are incommensurable. You can't add a Microsoft 2% to a Power Up Software 50% to get an average of 26%. I didn't see this issue acknowledged or addressed in the study, but it goes directly to the data quality.
The study repeatedly talks about certification testing tools, tools that would perform standard tests and allow us to compare attributes of different products. Huh? I mean, yeah, OK, benchmark tests are good. But (a) how do these relate to all those bugs we are talking about (b) how many attributes will these tests test (given that programs differ a lot from each other -- even competing programs -- and these are presented as if they were standardized tests being automatically run by tools)?
The study says "Test early, test often is the mantra of experienced programmers." Then it cites Ed Kit (Software Testing in the Real World). I'm sorry, but who died and named Ed an expert on mantras of experienced programmers? With no offense meant to Ed by the comparison, that would be like asking me about programmer culture. I'm not a particularly credible source on that. If you want to make a quote about programmer culture, quote someone like McConnell, whose reputation is in that domain.
The sources you don't cite speak to your credibility too. If you don't cite an obviously better source, it suggests that your knowledge is so narrow that you don't know that source. (Does that mean that these particular authors haven't read better sources? Who knows? The credibility issue, though, is not what they really do or don't know but how they would be perceived.)
On the facts of the claim, go find 100 experienced programmers (especially experienced programmers who live in the world of waterfalls, as distinct from those obviously unconsidered anti-waterfallers who do extreme programming) and ask them if this is their mantra. How many do you think would agree? I'm not confident that it would be a high number. I think that assertions like this reduce the credibility of the study when read by someone who (a) doesn't agree with the assertion and (b) considers herself an experienced programmer or knowledgeable about experienced programmers.
Is this a big deal? No, it is definitely not a big deal. But it is an indicator. Unless they know you well, people judge your credibility using heuristic rules. Rex laid out some perfectly reasonable heuristics that he used to decide that this study was credible to him. I'm illustrating the application of some heuristics that led me to decide that this study is incredible to me. A heuristic rule is fallible. For (I think) many people, if a person makes a few small misstatements, and you know nothing else about that person, you are much less likely to trust larger statements from the same person.
The study says that most bugs are introduced at the unit stage. Well, hmm, I think most coding errors are introduced during coding. But I keep reading stories that most of the failure and maintenance costs are due to requirements related errors. That's one of the reasons that I push test teams so hard to test very broadly, with the stakeholders' interests, the subject domain and the software environment as main focusers rather than the code as the main focuser. Maybe I'm wrong. (After all, the numbers I see are from Capers Jones and Dick Bender and other folks and, even though those claims are consistent with my impressions from my experience, I don't know what research they actually did to support or discover those numbers.) But I would have liked to have seen some evidence that the authors realized they were making a claim that others do not support.
I'm done with the examples. There are more, but you've got the sense of them.
These small points would have bounced off me if the bigger points were not as troubling. But in conjunction with the two larger concerns, it would be easy for me to dismiss the study, as James Bach did and as I fear that many non-testing development experts might do. That is the problem of credibility. For me, even though I think some of the numbers and the relationships among the numbers are intriguing and bear some follow-up thinking, I don't think that I can cite this study as a source of numbers to rely on.