Assess Quality, Don’t Measure It

Published: December 13, 2019 by James Bach 14 Comments

How to measure quality is a popular question. The answer I want to give is easy to say, but not so popular: “Stop! Quality can’t be measured, but it can be discussed and assessed. Focus on that instead.”

An answer like that doesn’t work well because it sounds like some pointless distinction of labeling– most people think measurement and assessment are the same thing. It also sounds number-phobic, and my clients may wonder if I studied philology instead of engineering in university. (I love numbers! I never went to university!)

What kind of answer would work better? Two options come to mind. I could give a few shallow suggestions about some kinds of data worth considering. This is what my clients expect to hear, but a) it doesn’t really answer the question, b) it won’t ultimately lead them to clarity, control, and competence, and most importantly, c) it contributes to the damaging factory paradigm of technology development— that is the world view which says everything important can be done by robots or people who are bullied into behaving like robots.

Or… I could be professorial, and carefully explain the difference between measuring and assessment and why it matters. In this article, I will take the professorial approach. Michael Bolton tells me I am pitching this material too high and you won’t read it. Maybe he’s right. Still, you know how sometimes truth wants you to speak it, whether or not anyone listens? That’s what I’m feeling today.

Introducing Metrology

Measurement is “a process of experimentally obtaining one or more quantity values that can reasonably be attributed to a [property of a phenomenon, body, or substance, where the property has a magnitude that can be expressed as a number and a reference].” How is that for a start? Bracing, yes? A little cold wind in your face there. That’s from the International Vocabulary of Metrology, a fantastic reference for winning arguments about measurement.

The gist of that pedantic definition is that measurement is about assigning numbers to things in sensible ways, so that we can use the numbers to make sensible decisions.

In the sciences, metrology is serious business. In the field of software development, it should be serious, but more often it’s just theatre; an exercise in the ritualistic use of numbers to cast a scientific aura around an otherwise irrational management process. People keep trying to count things that don’t make any sense to count, such as test cases (since a test case is essentially a file that has something in it related to performing testing, that would be like judging a business by counting all the briefcases, drawers, cabinets, and envelopes in the building… “you have 42,564 containers of business stuff! Good businessing!”).

Here’s why you should be skeptical about measurement of software product quality:

1. Quality cannot be measured objectively and comprehensively.

Measurement is one way to discover the status of something. We want to understand the status of our products, but while many things can be measured, the quality of a product is not one of those things. At least, it can’t be measured in any sense that covers all of what customers, users, and business people are talking about when they wonder if the “quality of a product is good,” or whether “the quality is better or worse than it was last month.”

What’s so hard about measuring quality? When we think of quality we are thinking of the “goodness” of a product. Yet goodness is not an intrinsic property of anything. It’s a social construct. So, we face a least four problems:

• Goodness, by definition, is always rooted in some person’s feeling— which may not be shared by other people. You know this from being human.

• Goodness is at least a three-way relationship between people, product and context. The status of people and context can and do shift over time, thus the quality of something can change even if the thing itself does not change. You know this vividly from social media: a joke you made on Twitter that was considered funny ten years ago may cause you to lose your job when re-discovered, today.

• Goodness is a phenomenon produced by the mysterious interaction of many variables. We observe these variables over time and filter them through our distorted lenses of perception and memory. There is no single kind of data to focus on; it’s a mix. And there are no reliable and uncontroversial formulae (what measurement experts call a self-consistent “quantity calculus”) by which we can map this mix to a value on a single scale. We also lack a measurement model by which such a quantity calculus could be derived. The mysteries of measuring goodness are evident in machine learning systems that hilariously mis-classify things. Or in ourselves when we hilariously fall in love with the wrong people. I bet you’ve experienced this sort of thing.

• Goodness judgments tend to be socially risky. We change our statements about quality based on how they affect other people. For instance, if I ask you whether you think a product is bad, and you do think it is bad, but you know that your honest and straight answer will cause me to cancel your project and throw all your friends out of work, are you really going to tell me the flinty truth? Or will you give a rosy tint to your response? You must have encountered this, too.

2. Quality can be partially measured.

Good news! Let nothing I wrote, above, disturb you too much. It’s not wrong to define and clarify requirements in ways that may facilitate measurement. Although I claim that whatever you do along that line will not result in an objective “measurement of quality” of your service or product, it may be useful! It may be good enough. It may give you what you need to make an assessment of quality.

Often people respond to the challenge of measuring quality simply by redefining it from something customers understand into something limited and legalistic. They define requirements in rigorous contractual terms. They make a set of “acceptance tests” that are asserted to prove that “the product works” if none of the tests “fail.” (Am I getting near the scare quote limit for this version of WordPress?) That’s a kind of game called goal displacement, and the resulting measurements are known by measurement experts as surrogate measures. If you can get your customers to agree to play that game (to accept the displaced and surrogate goals and let go of the desire to feel happy with your product), and if you choose not to care whether they are actually happy with your product, maybe you will feel happy. However, the history of technology is littered with abandoned products and withered companies that ignored the elusive but persistent reality of quality. I, personally, don’t want to get rich while upsetting customers. Are you with me?

Still, maybe a partial measure of quality is really all I need. Maybe if I have measures of the purity and mass of a lump of gold, I won’t mind that there is technically no way to tell if gold is really “good” in the sense of whether it helps or hurts society, or any other dimension of goodness. If all I want to do is sell the gold, then the economically quantifiable aspects of gold may be the only aspects that matter to me.

How about instead of saying we are measuring quality, we say we can measure clues about quality? We can collect indicators and make sense of them. We can use measurable data of many kinds for that purpose.

3. Your true goal is probably to arrive at a useful assessment of the status of the product.

Subjectivity can be helpful. I can ask you to rate a product on a scale of 1 to 5 stars, and you can reply. That would be your own statement of feeling, mapped to a number. That’s subjective assessment. It’s an assessment that you came to by interpreting and integrating experiences over time, then filtering them through your story-telling machinery. Such ratings are notoriously unreliable— if I am angry with a product, I am not going to give a four-star review, even if it is mostly good— partly because I am biased and partly because I want my opinion to have an impact. And yet a third-party can make meaning from subjectively chosen numbers by looking at trends, discontinuities and context. In such a case, it’s not that the rating itself has any fixed meaning, but more that it participates in a gestalt with other collected data.

We can improve on simple subjective assessment by systematically reviewing and discussing different dimensions of relevant data. It is still subjective, but the assessment becomes more accurate and reliable.

I recently learned that the word assessment comes from the latin for “to sit with.” As I interpret the etymology and common usage, assessment means to use evidence to make an evaluation. This is bigger than measurement. You can assess using measurements, but you also can assess with any sort of evidence, regardless of whether it is the result of a measurement process. You can, for instance, assess that your tire is flat, without using a ruler or a pressure gauge, based on that it looks obviously flat.

Measurement, then, is just one means to an end. That end is: the marshaling of good evidence to inform our assessments of quality that can be used to make business decisions. (e.g. when to ship the product; how good is the product team; are problems with the product or the team persisting, accumulating, or getting resolved?)

We can assess quality by gathering evidence about:

How we tested. What did we test and what did we ignore? There are many interesting dimensions of coverage, and without an understanding of coverage, our findings will be impossible to interpret.
What we found. The specific problems or interesting absence of problems. We need to understand the implications of the findings, not just count the reports.
What that tells us about business risk. We must relate out findings to the purposes of our project; to the business and the customers.

None of this reduces to any simple formula. It all requires human social competence and often a great deal of discussion.

4. Any system of measuring people, used for evaluation, becomes a means of systematic deception.

When you try to measure software quality, you are indirectly measuring the competence and performance of the people who created it, because software is literally the frozen traces of human problem-solving. It has no other substance than that. What’s my point? Well, you know, workers don’t just sit idly while management creates systems to surveil them. They discover how to use those systems to create a positive impression and to avoid negative ones. This is especially easy to do when management wants measurements to be an alternative to complex and multi-structured evidence; where management doesn’t want to engage in discussions that explore what matters. When all the relevant dimensions are reduced to just a handful of “KPIs”, there is a lot of room for the clever to maneuver.

As anyone who works with automation in testing knows, if you reveal how much time you spend and trouble you have getting the automation to basically work and stay working, it can have terrible consequences. You might be accused of incompetence based on a reckless belief that output checks are easy to automate (all the advertisements say so!). So, you hide those extra hours. You make it look easy. And by doing so you make management believe that it really is easy— perpetuating their fantasy about automation. This is deception that you won’t need if no one is “measuring” your progress in terms of test cases automated or run per unit of time.

This is the problem of false optimization. Any context that involves human judgment– where a tacit process exists that is only partly measurable– is vulnerable to this dysfunction. People tend to optimize “toward from the light” and hide inefficiencies in the darkness. You do this when you get ready for a party at your house by tossing all the clutter into a back room and locking the door. The problem is there are almost always many ways to optimize that cause harm instead of health to a product or a project. This does not necessarily mean outright fraud, although I would say, in my experience, fraud in such cases is depressingly common. The more important problem is that any “bad number” or numbers can be made “better” using a variety of methods, only some of which actually make the system better.

Are you using a bug count to measure quality? Here are ways to make the quality look better without being better:

Don’t openly report bugs, and don’t track them.
Don’t establish a reliable or easy way for customers to report problems from the field.
Demote bug reports to low severity, then make a rule to ignore low severity bugs.
Make a rule that bugs can be reported only if there is indisputable proof that it violates written requirements and is easily reproducible.
Report multiple problems in each single bug report form.
Hire only junior and untrained testers.
Say that 100% automation is your goal, automate a lot of simplistic checks, then brag about how you are doing “thousands of tests.” You won’t find many bugs that way, but you will look busy and have lots of code to show for it.
Display anger and sorrow when bugs are reported so testers will feel discouraged about reporting anything.
Outsource your testing to a company that wants to please you, and make it known you are not pleased by any complaints about quality.
Get rid of testers and say “quality and testing is everyone’s responsibility.” Fewer bugs will be reported and they will be easy bugs.

5. The urge to measure is often driven by the desire to control people as if they were inanimate objects.

People are messy, as you well know. People can be hard to work with. Anyone who complains about “there are too many meetings” is invariably referring to other peoples’ meetings. We complain about bad documentation, by which we mean doc some other guy wrote. Managers frequently yearn for happy, willing, pliable workers who leave their own opinions and styles at home. Conform. Conform, you scalawags!

That’s why I think the urge to measure product quality is motivated not by any love or need of cold rationality, but rather by fear of hot debate; by the hope that “objective measurement” will prevent awkwardness and anger. Otherwise, it makes no sense! I like mathematics, statistics, probability theory, and quality engineering. (I could say I own a stack of books on these subjects, but the truth is I have so many books about them that if you made a stack you’d need scaffolding and a safety video.) You would have to admit I am an enthusiast about numbers. And still I have no urge to measure quality. When I helped ship award-winning software as one of the test managers for Borland C++, I was part of a team of 12 managers who discussed the quality and together we decided if it was good enough. No one ever suggested measuring quality, and I think that’s because we got along with each other.

Maybe instead of measuring quality, make some friends.

In summary, I’m not against collecting data.

I’m suggesting:

There are many useful kinds of data, only some of which can be measured.
The notion that objective measurements of quality are rationally required to run a business is not rationally justified.

An alternative to measurement is assessment, which is actually what everybody is doing in the real world, even when they claim to be measuring quality.

Assessing quality is a process that can incorporate measures without being itself a measurement process.

Comments

Jin Seok Jun says
23 December 2019 at 5:25 pm
Great article. Actually I said “Only the quality which could be measured is able to be managed” to my colleagues a few days ago. This article let me think again about my thought what is the quality and how to manage(is it possible?) it. If I can get your permission, I’d like to translate this article into Korean and post on my own blog(https://angel927.tistory.com). Probably it might be useful to my colleagues also.
[James’ Reply: Go ahead.]
Reply
Ard says
2 January 2020 at 3:11 am
Thx James for the interesting article and let’s find our way in the world of numbers and I will share your article using my motto:
“If you try to measure quality it becomes quantity! “
Reply
- Bobby Washington says
  4 March 2023 at 11:26 am
  Reading in 2023 in this is still progressive. Certainly challenges my traditional thinking with respect measuring quality. I prefer the concept of assessment. Especially when you realize it’s something we naturally do every day of our lives. We are constantly making assessments which allows us to make decisions and drawing conclusions
  Reply
Evangelos Mavrogiannakis says
19 January 2020 at 3:21 pm
Interesting article. I like lot of good points but I think we need to add one more dimension to the topic. Measuring quality may is a comparison process. As consumers we do that all the time. I bought this Car because it is better quality than the other Car. I bought this Camera because is better quality than the other Camera. We do not create a list of numbers to make this decision. We may be compare specifications, price, reviews and make a decision. Some of the data we use is based on numbers and some based on user perspective. At the end of the day we decided that Product X is better quality than product Y. So somehow we did a measurement of quality by comparing similar products.
[James’ Reply: Good point. We assess with respect to various perspectives, and those are conditioned by our experiences, and those experiences include experiencing other products.]
Reply
Russell Axford says
7 February 2020 at 7:19 pm
The other day I attempted baking a cake for my kids. I measured out quantities of varying ingredients, mixed them per the instructions, set the oven to the required temperature and let it bake for the documented amount of time. The disappointed look on my little ones faces when they took a bite told me that the quality had well and truly been measured according to their own criteria, and me, explaining the quantifiable effort, didn’t change their mind. I guess the proof of the pudding really is in the eating!
Reply
Lauren Smith says
26 February 2020 at 9:39 am
This is the best article on software testing I have ever read, and you’ve had some real gems over the years. I am fortunate to work in an organization that is avoiding many of the pitfalls discussed here, and yet there is so much farther we can grow. Thank you for your leadership in this field; it is desperately needed.
[James’ Reply: Thanks.]
Reply
Sergei says
3 March 2020 at 7:28 am
Hi James,
May I ask you why are you contraposing discussion and assessment to quantitative measurement in current context?
[James’ Reply: I’m confused, because I believe my blog post completely answers this question. But maybe I don’t understand your question, yet. I’ll keep reading.]
I mean, i cannot imagine a tester that provides metrics to his team instead of telling some kind of story and even if such person exists he would probably stop doing that after inevitable “What does that even means?” question from his team. So this doesn’t seem to be a case.
[James’ Reply: Well, I can imagine it because I’ve seen this a lot in my consulting. Are you a consultant who visits a lot of companies and over the years has seen how management uses and abuses numbers? If you are, then I’m surprised you are having trouble imagining a scenario where management is not keeping their data in perspective and putting it in a useful context. If you are not, then I think you need to read more accounts of the misuse of numbers, because such accounts are common.
In the blog post to which your comment is attached. The very blog post I am talking about and that you are implying that you read, section #2 is about how we can partially measure quality. I write “How about instead of saying we are measuring quality, we say we can measure clues about quality? We can collect indicators and make sense of them. We can use measurable data of many kinds for that purpose.”
So, this is not me “contraposing” in the sense of saying it’s all one or all the other. This is me saying that the assessment part is the construction of a story, and that story can include quantitative data. Given that, what is your issue?]
As for reporting to stakeholders, they’re normally just want a product/solution/feature and some kind of evidence that it was tested. They generally don’t have any ideas what testing is and don’t really want to. I can’t blame them for that, since there are a lot of things i absolutely don’t want to know about as well. I can’t blame them either for sticking to “standards” how test reporting should be done without being aware those “standards” are full of bullshit. If i would try to tell stakeholders about what testing really is and how reporting should look like i would be look like just a little rebel punk trying to convince them in some kind of thing they barely care about.
[James’ Reply: You can blame them for promoting bullshit. You shouldn’t blame them for not knowing it’s bullshit, perhaps, but if they stick to a system that doesn’t work and even actively harms their own interests, yes you can blame them for that.
The rebel punk argument doesn’t hold water. This is your fear talking, not your experience. In any case, I’ve proven with my career that I didn’t have to promote bullshit in order to keep my jobs. Maybe I’m just way better than you are, but I doubt it. I do agree that a lot of testers who later were fired and replaced with automation (or nothing at all) told me that the reason they filled out nonsense paperwork and followed nonsense scripts during the 90’s was because they wanted to keep their jobs. Then Agile came along a got rid of most of those testers.
Job security, in the long run, is not guaranteed by being obedient. The people who created Agile were not being obedient. Neither were any of the other people who created the ideas that you think you have to obey. Why are you putting yourself in the underclass instead of being a leader? Is that making you happy?
I’m not rich, but I have the freedom to say and do what I like. And I have self-respect. That’s what being a craftsman– instead of a frightened hack– has done for me.]
Which is definitely not a good thing from a business perspective.
[James’ Reply: How do you know that it is definitely not a good thing? Maybe it’s the very best thing! Who hurt you, man? Your spirit seems very low.]
I understand that this in not a case for you because 1)You and your team are doing testing rather then developing a product/solution/feature so your stakeholders are generally more interested in testing than usual.
[James’ Reply: As a testing specialist, I am always aware that my good work serves a purpose. The purpose is not “good testing.” The purpose is a well-informed client, able to make good decisions in a timely manner.]
2) The fact that your stakeholders choose you over some band of ISTQB certified scammers already means that they probably know something about testing. But it is a problem for most part of the industry because we barely can choose our stakeholders.
[James’ Reply: I respect that your fear and sense of powerlessness is real. It matters. I don’t want to minimize that or make fun of you for it. But, man, let’s not be confused: I’m talking about ethical and effective testing, as opposed to fake testing that has little business value. If you are arguing that you are in a hostage situation, so you can’t afford to do good work, fine. I hope you escape someday. But if you are arguing that “bullshit” is a GOOD way to work. No it isn’t. It never has been. It’s a waste of time and you are providing very little value to your clients if you engage in it.]
So, to summarize it, what was this article about? We’re already doing discussions and assessments with our teams because it doesn’t work otherwise. And we can’t evade quantitative measurement reporting to stakeholders because we barely have privilege to choose stakeholders, even though we all know that all those measurements proceed directly into garbage can, where they belong to.
So what is the point?
[James’ Reply: I think you didn’t read the article very carefully. The article is about the fact that you cannot measure quality. Stop pretending that you can. If you use quantitative measurements, that can only be supportive of a story you are telling. TAKE RESPONSIBILITY FOR THAT STORY. Either you read that and you understood it, in which case you should be saying that you agree with me, OR you missed the entire middle section of my article, and you are acting like I am arguing that no one should ever use quantitative data to help form their story– WHICH I DIDN’T SAY.]

P.S. Sorry for bad English.
[James’ Reply: Your English is fine. It may be better than you think it is. I notice that you have not felt the need to quantify the badness of your English, and neither do I. But speaking in terms of qualitative assessment, I will say that I didn’t feel much of a struggle to understand you.]
Reply
- Sergei says
  4 March 2020 at 7:41 am
  [comment redacted]
  [James’ Reply: I don’t think you are getting my point, Sergei. We should continue our conversation over email or Skype. My email is james@satisfice.com. My Skype ID is “satisfice”. You continue to claim that I’m somehow not in the “real world” or that I’m talking about some weird niche. But I think, objectively speaking, I’m out there talking to people and visiting companies much more than you are. If anyone can claim to be dealing with reality, it’s me.]
  Reply
Oxagile says
26 March 2020 at 6:29 am
[Comment Redacted]
[James’ Reply: I don’t necessarily mind if you promote your consulting company while leaving a comment, Ksenia. But I do think you should leave a comment that has substance in it. Say something worth saying, and I will publish it.]
Reply
Graham Kesley says
21 August 2020 at 2:02 pm
I spent a few hours today discussing the exact topic of this post with a Test Manager who is entrenched in measurement and command / control practices. Sadly I lacked the depth or breadth of arguments provided here during my discussion and found myself becoming frustrated with his belief in measuring bug and test case volumes as an objective quality measurement.
Can you suggest a strategy to drip feed this thinking into an organisation that is been hijacked by ‘Test Factory Consultants’ and needs the carrot rather than a stick to change it?
[James’ Reply: Drip feeding is not my style, but I will try. How about the old strategy of local piloting? You work bottom-up with one team, showing a different way of working. Another little thing you can do with language is to use the term “assess” everywhere someone else would say “measure” (in the verb case) or else “evidence” in the noun case. Sometimes changing your language is a first step. Sometimes a lunch talk can help.
Or how about this: Focus on measurement actually does not solve the problem you want to solve (probably). So focus on that. Show that the problem has not gone away using that approach. Propose an alternative experiment.]
Reply
- Graham says
  25 August 2020 at 2:19 am
  Thanks James. I’ve started a pilot already in one team so evidencing an alternative approach there is simple enough. I really like the language tip. Thanks.
  Reply
Mike Merrill, MD says
27 July 2021 at 5:24 pm
A similar problem occurs in healthcare quality measurement. We create convenience metrics, or maybe a few clunky metrics that imperfectly look at important but difficult-to-accurately-adjust outcomes like mortality, then look at the numbers and say, “This is quality.”
Did you ever see this? This article seems to point to a period in the evolution of modern management thought that still resonates into the present.
https://www.wired.com/2014/10/a-spreadsheet-way-of-knowledge/
[James’ Reply: Thanks for commenting. It’s nice to hear from a different industry.]
Reply
Tom Gilb says
28 August 2022 at 8:47 am
James, hopefully you will allow this free promotion of my ideas on quality quantification, which you do say might be useful in some contexts?

[James’ Reply: Tom, you are not merely promoting some technique that might be useful– as indeed quantification is sometimes useful– you are promoting an authoritarian ethos. To quote from your materials: “QUANTIFY: You must, no option, no excuses, learn or leave, quantify all critical stakeholder values.”
This is an immoral and certainly unprofessional stance. By contrast, my culture of engineering is nicely summarized by Billy Koen, who wrote Discussion of the Method, and asserts that all techniques of Engineering are heuristic.]
Reply
shanthi aribindi says
1 July 2023 at 12:17 pm
Great article. Glad I came across it.
I am a Physician. I am interested in studying and finding ways to improve Healthcare standards and delivery. Perhaps you have written about the process of “assessment” in other blogs. Can you please direct me to them, or, suggest ways to “assess” a system as complex as Healthcare, especially in our country, The United States of America.
Shanthi.
[James’ Reply: As a physician, you know better than I do how doctors don’t “measure” health so much as they assess it. Another word could be “evaluate.” For instance, a paper that studied how forensic psychologists determine sanity or insanity used the term evaluate throughout (Warren, Janet I., Daniel C. Murrie, Preeti Chauhan, Park E. Dietz, and James Morris. 2004. “Opinion Formation in Evaluating Sanity at the Time of the Offense: An Examination of 5175 Pre-Trial Evaluations.” Behavioral Sciences & the Law 22 (2): 171–86. https://doi.org/10.1002/bsl.559). It never used the term measurement to describe the determination of sanity status.
Assessment of a complex system is usually done via some rubric, such as list of factors of interest. When I was asked to analyze a test plan for the military, years ago, I first developed a basic rubric and then conducted the assessment accordingly.
See here.
]
Reply