How to measure quality is a popular question. The answer I want to give is easy to say, but not so popular: “Stop! Quality can’t be measured, but it can be discussed and assessed. Focus on that instead.”
An answer like that doesn’t work well because it sounds like some pointless distinction of labeling– most people think measurement and assessment are the same thing. It also sounds number-phobic, and my clients may wonder if I studied philology instead of engineering in university. (I love numbers! I never went to university!)
What kind of answer would work better? Two options come to mind. I could give a few shallow suggestions about some kinds of data worth considering. This is what my clients expect to hear, but a) it doesn’t really answer the question, b) it won’t ultimately lead them to clarity, control, and competence, and most importantly, c) it contributes to the damaging factory paradigm of technology development— that is the world view which says everything important can be done by robots or people who are bullied into behaving like robots.
Or… I could be professorial, and carefully explain the difference between measuring and assessment and why it matters. In this article, I will take the professorial approach. Michael Bolton tells me I am pitching this material too high and you won’t read it. Maybe he’s right. Still, you know how sometimes truth wants you to speak it, whether or not anyone listens? That’s what I’m feeling today.
Measurement is “a process of experimentally obtaining one or more quantity values that can reasonably be attributed to a [property of a phenomenon, body, or substance, where the property has a magnitude that can be expressed as a number and a reference].” How is that for a start? Bracing, yes? A little cold wind in your face there. That’s from the International Vocabulary of Metrology, a fantastic reference for winning arguments about measurement.
The gist of that pedantic definition is that measurement is about assigning numbers to things in sensible ways, so that we can use the numbers to make sensible decisions.
In the sciences, metrology is serious business. In the field of software development, it should be serious, but more often it’s just theatre; an exercise in the ritualistic use of numbers to cast a scientific aura around an otherwise irrational management process. People keep trying to count things that don’t make any sense to count, such as test cases (since a test case is essentially a file that has something in it related to performing testing, that would be like judging a business by counting all the briefcases, drawers, cabinets, and envelopes in the building… “you have 42,564 containers of business stuff! Good businessing!”).
Here’s why you should be skeptical about measurement of software product quality:
1. Quality cannot be measured objectively and comprehensively.
Measurement is one way to discover the status of something. We want to understand the status of our products, but while many things can be measured, the quality of a product is not one of those things. At least, it can’t be measured in any sense that covers all of what customers, users, and business people are talking about when they wonder if the “quality of a product is good,” or whether “the quality is better or worse than it was last month.”
What’s so hard about measuring quality? When we think of quality we are thinking of the “goodness” of a product. Yet goodness is not an intrinsic property of anything. It’s a social construct. So, we face a least four problems:
• Goodness, by definition, is always rooted in some person’s feeling— which may not be shared by other people. You know this from being human.
• Goodness is at least a three-way relationship between people, product and context. The status of people and context can and do shift over time, thus the quality of something can change even if the thing itself does not change. You know this vividly from social media: a joke you made on Twitter that was considered funny ten years ago may cause you to lose your job when re-discovered, today.
• Goodness is a phenomenon produced by the mysterious interaction of many variables. We observe these variables over time and filter them through our distorted lenses of perception and memory. There is no single kind of data to focus on; it’s a mix. And there are no reliable and uncontroversial formulae (what measurement experts call a self-consistent “quantity calculus”) by which we can map this mix to a value on a single scale. We also lack a measurement model by which such a quantity calculus could be derived. The mysteries of measuring goodness are evident in machine learning systems that hilariously mis-classify things. Or in ourselves when we hilariously fall in love with the wrong people. I bet you’ve experienced this sort of thing.
• Goodness judgments tend to be socially risky. We change our statements about quality based on how they affect other people. For instance, if I ask you whether you think a product is bad, and you do think it is bad, but you know that your honest and straight answer will cause me to cancel your project and throw all your friends out of work, are you really going to tell me the flinty truth? Or will you give a rosy tint to your response? You must have encountered this, too.
2. Quality can be partially measured.
Good news! Let nothing I wrote, above, disturb you too much. It’s not wrong to define and clarify requirements in ways that may facilitate measurement. Although I claim that whatever you do along that line will not result in an objective “measurement of quality” of your service or product, it may be useful! It may be good enough. It may give you what you need to make an assessment of quality.
Often people respond to the challenge of measuring quality simply by redefining it from something customers understand into something limited and legalistic. They define requirements in rigorous contractual terms. They make a set of “acceptance tests” that are asserted to prove that “the product works” if none of the tests “fail.” (Am I getting near the scare quote limit for this version of WordPress?) That’s a kind of game called goal displacement, and the resulting measurements are known by measurement experts as surrogate measures. If you can get your customers to agree to play that game (to accept the displaced and surrogate goals and let go of the desire to feel happy with your product), and if you choose not to care whether they are actually happy with your product, maybe you will feel happy. However, the history of technology is littered with abandoned products and withered companies that ignored the elusive but persistent reality of quality. I, personally, don’t want to get rich while upsetting customers. Are you with me?
Still, maybe a partial measure of quality is really all I need. Maybe if I have measures of the purity and mass of a lump of gold, I won’t mind that there is technically no way to tell if gold is really “good” in the sense of whether it helps or hurts society, or any other dimension of goodness. If all I want to do is sell the gold, then the economically quantifiable aspects of gold may be the only aspects that matter to me.
How about instead of saying we are measuring quality, we say we can measure clues about quality? We can collect indicators and make sense of them. We can use measurable data of many kinds for that purpose.
3. Your true goal is probably to arrive at a useful assessment of the status of the product.
Subjectivity can be helpful. I can ask you to rate a product on a scale of 1 to 5 stars, and you can reply. That would be your own statement of feeling, mapped to a number. That’s subjective assessment. It’s an assessment that you came to by interpreting and integrating experiences over time, then filtering them through your story-telling machinery. Such ratings are notoriously unreliable— if I am angry with a product, I am not going to give a four-star review, even if it is mostly good— partly because I am biased and partly because I want my opinion to have an impact. And yet a third-party can make meaning from subjectively chosen numbers by looking at trends, discontinuities and context. In such a case, it’s not that the rating itself has any fixed meaning, but more that it participates in a gestalt with other collected data.
We can improve on simple subjective assessment by systematically reviewing and discussing different dimensions of relevant data. It is still subjective, but the assessment becomes more accurate and reliable.
I recently learned that the word assessment comes from the latin for “to sit with.” As I interpret the etymology and common usage, assessment means to use evidence to make an evaluation. This is bigger than measurement. You can assess using measurements, but you also can assess with any sort of evidence, regardless of whether it is the result of a measurement process. You can, for instance, assess that your tire is flat, without using a ruler or a pressure gauge, based on that it looks obviously flat.
Measurement, then, is just one means to an end. That end is: the marshaling of good evidence to inform our assessments of quality that can be used to make business decisions. (e.g. when to ship the product; how good is the product team; are problems with the product or the team persisting, accumulating, or getting resolved?)
We can assess quality by gathering evidence about:
- How we tested. What did we test and what did we ignore? There are many interesting dimensions of coverage, and without an understanding of coverage, our findings will be impossible to interpret.
- What we found. The specific problems or interesting absence of problems. We need to understand the implications of the findings, not just count the reports.
- What that tells us about business risk. We must relate out findings to the purposes of our project; to the business and the customers.
None of this reduces to any simple formula. It all requires human social competence and often a great deal of discussion.
4. Any system of measuring people, used for evaluation, becomes a means of systematic deception.
When you try to measure software quality, you are indirectly measuring the competence and performance of the people who created it, because software is literally the frozen traces of human problem-solving. It has no other substance than that. What’s my point? Well, you know, workers don’t just sit idly while management creates systems to surveil them. They discover how to use those systems to create a positive impression and to avoid negative ones. This is especially easy to do when management wants measurements to be an alternative to complex and multi-structured evidence; where management doesn’t want to engage in discussions that explore what matters. When all the relevant dimensions are reduced to just a handful of “KPIs”, there is a lot of room for the clever to maneuver.
As anyone who works with automation in testing knows, if you reveal how much time you spend and trouble you have getting the automation to basically work and stay working, it can have terrible consequences. You might be accused of incompetence based on a reckless belief that output checks are easy to automate (all the advertisements say so!). So, you hide those extra hours. You make it look easy. And by doing so you make management believe that it really is easy— perpetuating their fantasy about automation. This is deception that you won’t need if no one is “measuring” your progress in terms of test cases automated or run per unit of time.
This is the problem of false optimization. Any context that involves human judgment– where a tacit process exists that is only partly measurable– is vulnerable to this dysfunction. People tend to optimize “toward from the light” and hide inefficiencies in the darkness. You do this when you get ready for a party at your house by tossing all the clutter into a back room and locking the door. The problem is there are almost always many ways to optimize that cause harm instead of health to a product or a project. This does not necessarily mean outright fraud, although I would say, in my experience, fraud in such cases is depressingly common. The more important problem is that any “bad number” or numbers can be made “better” using a variety of methods, only some of which actually make the system better.
Are you using a bug count to measure quality? Here are ways to make the quality look better without being better:
- Don’t openly report bugs, and don’t track them.
- Don’t establish a reliable or easy way for customers to report problems from the field.
- Demote bug reports to low severity, then make a rule to ignore low severity bugs.
- Make a rule that bugs can be reported only if there is indisputable proof that it violates written requirements and is easily reproducible.
- Report multiple problems in each single bug report form.
- Hire only junior and untrained testers.
- Say that 100% automation is your goal, automate a lot of simplistic checks, then brag about how you are doing “thousands of tests.” You won’t find many bugs that way, but you will look busy and have lots of code to show for it.
- Display anger and sorrow when bugs are reported so testers will feel discouraged about reporting anything.
- Outsource your testing to a company that wants to please you, and make it known you are not pleased by any complaints about quality.
- Get rid of testers and say “quality and testing is everyone’s responsibility.” Fewer bugs will be reported and they will be easy bugs.
5. The urge to measure is often driven by the desire to control people as if they were inanimate objects.
People are messy, as you well know. People can be hard to work with. Anyone who complains about “there are too many meetings” is invariably referring to other peoples’ meetings. We complain about bad documentation, by which we mean doc some other guy wrote. Managers frequently yearn for happy, willing, pliable workers who leave their own opinions and styles at home. Conform. Conform, you scalawags!
That’s why I think the urge to measure product quality is motivated not by any love or need of cold rationality, but rather by fear of hot debate; by the hope that “objective measurement” will prevent awkwardness and anger. Otherwise, it makes no sense! I like mathematics, statistics, probability theory, and quality engineering. (I could say I own a stack of books on these subjects, but the truth is I have so many books about them that if you made a stack you’d need scaffolding and a safety video.) You would have to admit I am an enthusiast about numbers. And still I have no urge to measure quality. When I helped ship award-winning software as one of the test managers for Borland C++, I was part of a team of 12 managers who discussed the quality and together we decided if it was good enough. No one ever suggested measuring quality, and I think that’s because we got along with each other.
Maybe instead of measuring quality, make some friends.
In summary, I’m not against collecting data.
- There are many useful kinds of data, only some of which can be measured.
- The notion that objective measurements of quality are rationally required to run a business is not rationally justified.
- An alternative to measurement is assessment, which is actually what everybody is doing in the real world, even when they claim to be measuring quality.
- Assessing quality is a process that can incorporate measures without being itself a measurement process.