Beyond the Numbers: When Measures Become Misguided Targets

Discover how misused metrics can lead to unintended consequences in software development, as explained by Goodhart's Law, and how to avoid these pitfalls.

Nov 26, 2024

In 2016, Wells Fargo was the 3rd largest bank in the United States by assets. However, that same year, regulators subjected the bank to massive scrutiny due to widespread fraud, resulting in fines totaling $185 million. What happened?

About 15-20 years before the fraud's 2016 revelation, the bank recognized significant growth potential in cross-selling products. Cross-selling is a method in which companies focus on existing clients and offer incentives for purchasing additional products instead of seeking new customers. In banking, this means, e.g., to convince checking account clients to add savings accounts or mortgages.

Wells Fargo began tracking its success in cross-selling by measuring the average number of products held by customers. While cross-selling itself is not fraudulent, Wells Fargo executives were so eager to maximize revenues that they pressured sales employees to boost this metric (the number of banking products customers held). To meet this increasingly unattainable sales target, employees began opening accounts using customers' data without consent. Research by the American Bankruptcy Institute Journal revealed that employees "opened as many as 1.5 million checking and savings accounts, and more than 500,000 credit cards, without customers' authorization.”

The focus on a specific metric, combined with immense pressure to reach targets within the same metric, led to unintended behavior. Wells Fargo's story is not a singular occurrence. Similar examples include urban legends such as the Soviet nail factory, where factories met production quotas by making either tiny or massive nails that were useless, depending on whether they were measured by quantity or weight. Another urban legend is India's "cobra effect," where a bounty on cobras led people to breed them for profit, ultimately increasing the cobra population.

In 1975, the statistician Charles Goodhart described this effect in an article. His idea was later coined as Goodhart's Law and is often paraphrased as:

When a measure becomes a target, it ceases to be a good measure.

Measuring the number of average banking products per customer is not wrong in itself. In software development, we do this all the time. We track technical aspects like CPU load, network latency, and lines of code as well as functional aspects like daily active users, purchases, and registrations. There's nothing wrong with gathering data and using it to make informed decisions, right?

Goodhart's Law targets the intention with which metrics are used. Measuring is beneficial, but using metrics as targets and goals can be problematic. Let's dive deeper into this dynamic.

Metrics in Software Development

Software development is product development. Every software team builds a software product that others use to derive value. It follows that the quality of this product matters since users prefer products of higher quality. In software, this usually concerns things users recognize, like performance, reliability, and usability. It also includes things developers observe, like maintainability, extensibility, and operability.

These aspects are often subjective and difficult to quantify. Think of a product you like using, such as your phone or bike. Why do you like it? What you will likely come up with is a gut feeling rather than hard, data-based evidence. However, we like to use metrics to make quality aspects measurable and to implement a baseline that we can improve on. In "Architectural Fitness Functions: Building Effective Feedback Mechanisms for Quality Attributes", I wrote:

You can gather feedback on multiple aspects. One of the most crucial are important quality attributes, like performance or reliability: Do we need to build a software service that is both highly scalable and highly available? Is high performance crucial for our product? We need to assess whether the system we build holds up to those aspects.
We need effective and (if possible) automated mechanisms, especially for quality attributes. The book "Building Evolutionary Architectures: Automated Software Governance" introduced a term for feedback mechanisms on quality attributes borrowed from evolutionary algorithms: (architectural) fitness functions.

In short, an (architectural) fitness function evaluates the system with regard to a quality attribute.

These fitness functions can act as metrics to circle back evaluative or corrective information (i.e., feedback) to the development team, helping to detect areas for improvement.

It is precisely in this usage of feedback as corrective information where Goodhart's Law becomes relevant. As soon as you use metrics as a framework for evaluation and decision-making, issues can start to arise. Focusing on a few metrics tends to narrow your view of the system. People start asking, "Why is the metric down?" instead of reasoning about the actual quality of the entire system. We use metrics because the aspects we care about, such as subjective quality, are often fuzzy and not directly measurable. While we aim to achieve something intangible, we rely on metrics as proxies for that goal. When we conflate the two, we risk prioritizing actions that improve the metric at the expense of the actual objective. Metrics are not the goal itself.

Take test coverage, for example. It measures the proportion of code statements exercised by the unit tests in a test suite. This is a proxy metric because the goal we want to achieve is software that functions correctly and test coverage that helps approximate the correct functionality of the software. However, taking it as a goal can be problematic. Incentivizing teams and individuals to hit 90% test coverage can encourage writing "easy" tests that cover a lot of code, especially under time pressure. Think of testing getters and setters. Other examples of proxy metrics are measuring cyclomatic complexity or the load placed on a particular service.

Holistic Metrics

You can classify fitness functions on a scale from atomic to holistic. I provided an overview of this characterization in "Architectural Fitness Functions: Building Effective Feedback Mechanisms for Quality Attributes":

Atomic fitness functions evaluate a part of the system on a specific quality attribute, like measuring the length of methods within classes or monitoring the status of nodes within a system. Those fitness functions are easier to build, but they only give you a narrow view of the system's quality.
Fitness functions can also be holistic. They test the entire system or large parts of it and assess specific quality attributes. For example, they might measure response times or observe the system after being put under load. Holistic fitness functions are usually more complex and expensive to build.

A rule of thumb: The more the fitness function (or metrics in general) leans towards the atomic side of the scale, the more it is prone to Goodhart's Law. To discuss this effect in detail, we must first discuss the difference between outputs and outcomes.

Outputs are the deliverables of our software project - specifically, its code and executables. Outcomes are the effects these deliverables create in the world, like users interacting with the system or developers interacting with the code. The more you focus on measuring outputs instead of outcomes, the more you are building fitness functions on the atomic side. Meanwhile, focusing on measuring outcomes leads to producing holistic fitness functions since you are more likely to measure the actual effect of the system acting in its context.

Measurements on outcomes do have issues as well. Not only are they more challenging to obtain, but they are also more prone to bias and distortion since not all aspects that affect the measurement are the ones you intend to measure. Take a holistic, outcome-focused measurement like the number of daily active users. It's a metric that is less prone to Goodhart's Law because it is harder to game. However, if daily active users go down, what is the root cause? Is the system's performance or usability degrading? Do newer features not prove valuable for users? Or is it because alternative, competing products outperform ours? Figuring this out is complex, and the metric alone can't give us an answer, only an indication.

Gaming Is a Systemic Symptom, Not the Fault of Individuals

Above, I often phrased actions as “gaming the system.” I want to clarify that I do not believe metric gaming is the fault of teams or individuals. “Gaming” is a response to a flawed system of measurements and metrics. If leaders put pressure on developers to achieve a certain percentage of test coverage, they should not be surprised if the quality and purpose of some automated tests decline. Blame the flawed metric and incentive, not the individuals.

Conclusion

When building metrics and fitness functions, it is important to consider Goodhart’s Law. Avoid using atomic feedback mechanisms as targets; instead, focus on improving holistic measurements. Use atomic mechanisms as useful indicators to identify where to dig deeper after observing degradation in holistic metrics.

Shaping Shifts

Discussion about this post