Justin Rohrman writes that measurement is one of the biggest problems he's experienced in test management. How do we measure quality, how do we know those measurements are good, and how do we use them to tell a story to executives? In this article, Justin explains how to speak to your business using measurements.
“All models are wrong, but some are useful.”
That familiar quote is attributed to George E. P Box, a statistician and professor at the University of Wisconsin Madison during the 1960s.
Measurement is one of the biggest problems I’ve experienced in test management. How do we measure quality, how do we know those measurements are good, and how do we use them to tell a story to executives? It’s a tough problem for lots of reasons. I’d like to start the conversation in part with a book review of Reliability and Validity in Qualitative Research and in part with a discussion of test management. The book is pretty academic, it took me some time to get through, but I think there are a lot of ideas in here that are worth thinking about.
A Little about the Book
Reliability and Validity in Qualitative Research is a pretty short book at seventy-three pages, but those are some dense pages. Some of the key points in the book are on the problem of reliability, the problem of validity, and there is a bit on the development of qualitative research as a field of study. There are a lot of gems from social science in these pages that I think will be useful to technologists wanting to communicate to a business in a more meaningful way.
One chapter covers the origins of qualitative research and some of the founding figures. The story about Franz Boaz, one of the early founders of qualitative research, really stuck with me. Boaz wanted to rid the world of amateur anthropologists by way of training, making fieldwork central to the anthropology experience, and promoting fieldwork notes without personal comment or interpretation. Boaz was known for dismissing bad practice in his field of work, but at the same time, demonized for not presenting any workable alternatives.
A Little Bit about Measurement
In my experience, no one measure did a great job of telling the story about my software ecosystem. I’ve been deceived by groups of measures, too, because I misunderstood their weaknesses.
If we are so easily deceived by measurements, imagine what happens when we send them off to others who need quick, high-level information.
It is all too easy to get stuck in the software-tester role and fixate on risk without mentioning possible ways to mitigate that risk. This was one of the great downfalls of the anthropologist Franz Boaz. Lets talk about some alternatives you can use immediately to improve your understanding of your software and customer, and communicate more clearly with your business. None of these are perfect, but they might be useful.
Watch your Customer and Your Dev Group
Go out into the world and hang out with the folks that use your product. Learn about them, who they are, what they do, and how they interact with the software you helped make. You may learn things about your customers that you could never glean from issues they report to you. Spending some time watching how people use your product can be a fantastic learning tool. On more than one occasion, I have seen mysterious bugs become less mysterious by watching how people use the software.
Likewise, step away from your role for a few minutes and observe how your software is really made. This trick has worked better for me in the past when I was looking at a group I didn’t normally work with. Stepping into the unknown is an interesting way to add perspective.
Interviews and Questionnaires
These are in the same vein as qualitative research, but you don’t have to convince your employer to send you off to a far-away land to get the information. A carefully crafted questionnaire can be a low-cost way to get a view on how your users feel about your product. You can also use these to see changes in sentiment over a period of time. Asking questions that can be answered on a spectrum such as “How does this software affect the speed of your work?” can be enlightening.
Combine your Strategies
Combining different types of observation can tell a more complete story about your product. Where one tool is weak, another may shine. Using your different forms of observation helps to create a narrative about your product instead of an oversimplified report.
Ok, lets go back to the book now for a few minutes and think about it in terms of how measurement is often used in software.
Validity is the extent to which your measure represents the thing you are measuring. There are a few different types of validity, but those probably aren’t too important to talk about just yet, with one exception. Problems with theoretical validity are what we experience a lot in test management. Theoretical (sometimes called construct) validity is the extent to which your observation corresponds to a theory.
Here are a few examples from test management we can use to make sense of that:
Test Cases Passed
Measuring how many test cases have passed are sometimes thought to be an indicator of how much testing you have done, how much more testing you need to do, and as a way to see current product quality. There are some problems though; passed tests aren’t good at exposing risk, passing tests don’t defend (better word?) a product if customers find problems.
As a thought exercise, can you think of some other ways test cases passed may not tell you what you initially thought it was telling you?
Test velocity is how fast testing is moving along. I have a hard time wrapping my head around this one. There are a few ways people like to measure this such as test cases run per period of time, or stories completed over a period of time, but this is all so tightly woven into the development process that I have a hard time thinking in terms of test time. Folks into lean sometimes think of test velocity with the takt concept. Here are some problems with measuring test velocity; tests take different amounts of time to run, so velocity isn’t a consistent measure, this measure is further skewed because of activities like data setup, bug investigation, and reporting.
Can you share a few more ways to find validity problems with test velocity?
The term reliability is used to describe how consistent the results of a measurement are. This book categorizes reliability into three types: quixotic, diachronic, and synchronic. Quixotic reliability applies readily to test measures. Measures with Quixotic reliability are unvaryingly consistent, but trivial and misleading.
Here are a couple metrics that suffer from reliability issues:
Number of Test Cases
The Number of test cases is a measure that has lots of reliability problems, here are a couple to think about; test case count can be gamed to inflate numbers , how do you count tests if they aren’t documented in a traditional way? What other ways might counting test cases be unreliable?
You probably noticed that the most of the solutions I like for measurement problems, aren’t actually measurement. A guy named Taiichi Ohno had huge success with this technique at a company he worked for in post-World-War-Two Japan.
Ohno spent a considerable amount of time talking and working side by side with factory workers, and customers. This helped him to quickly learn what was and wasn’t working and make immediate changes. His work reshaped how the manufacturing world thought about business.
You may have heard of that company—it’s called Toyota. I'd love for you to try some of these ideas out and tell me about how they work for you!
Number of test cases and Number of test cases passed are both meaningless metrics. What is key is how effective the test cases are - in other words, what percentage of the application's requirements are being tested, and what percentage of the application's requirements have been shown to function properly. As often as not, testers perform thousands of automated tests, most of which pass, without having any idea what percentage of requirements they have covered. In most cases, coverage is pitifully low - usually less than 50%.
Can't coverage be equally meaningless though? As far as I understand, coverage is only meaningful when you are talking about what you are covering. Requirements could be an example of that, method coverage could be another. The problem I see with these is that they tell you very little about the testing that was done. In measuring requirements coverage you know that a requirement was tested somehow but you don't know that the testing was meaningful or useful.
I do like using coverage models to show that something important may have been missed.