평균의 평균처럼 겉보기엔 그럴듯하지만 맥락을 벗어나면 틀려지는 수치 연산이 어떻게 잘못된 보고서와 의사결정으로 이어지는지, 그리고 이를 막기 위해 덧셈을 허용하지 않는 숫자 타입을 도입한 이유를 다룬 글.
Some years ago I had an interaction with my manager that went like this:
manager> Hey, there’s something wrong with the numbers coming out of your metrics system?
me> unlikely, but not impossible –what’s up?
manager> I’m looking at this report –all of the averages by category are correct, but the averages for the total are wrong – here, check this out [screenshot of a PDF produced by another team’s system consuming my system’s data]
me> Those are using literally the same underlying math –my guess is [the other team] messed up their queries. I’ll get to the bottom of it.
And then, two days later, in the group chat:
me> So, here’s why we’re introducing a number type which doesn’t allow addition:
A number is a number right? 3 is a number, π is a number, a billion is a number. What are things we can do with numbers? Add them, subtract them, multiply them, divide them? Well zero is a number, right? Yet, you can’t divide something by zero. In the context of division, you might say that all real numbers are potential dividends which can be divided by potential divisors, which are all real numbers excluding zero.
Computers encourage us to think of things as numbers. There are a variety of reasons for this: numbers are efficient to store and work with, and many concepts from timestamps to unique IDs can be represented as numbers. 1773628147.944 in the context of “number of seconds since Jan 1, 1970” is a way of representing the current time, and when I type uuidgen in my computer’s terminal and get 08e0532f-0863-4c06-b24b-5ef53aab2933; a 128-bit hexadecimal number so big many environments require a special library to work with it for arithmetic purposes. Computer touchers are used to numbers of all sorts, but that familiarity is sometimes dangerous.
When you represent a timestamp as a number, there are certain arithmetic operations that make sense and others that don’t. What does it mean to add, multiply, or divide two timestamps? While you can add 507343153 and 1000213980 to get 1507557133, in the context of the number of seconds since Jan 1, 1970 this is entirely meaningless. These are pointless operations.
Yet, subtracting one timestamp from another gives us a number of seconds between the two, a duration, per-se, and you can then add, multiply, or divide that. Subtracting the above timestamps yields 492870827 seconds, about 136,908 hours, or a bit over fifteen and a half years.
We can operate on numbers based on the context in which they hold meaning, and this is something that is generally intuitive –adding ‘$19.99’ and ‘€9.90’ doesn’t produce ‘€$29.89’, because adding values of different units requires a conversion, and conversion of monetary units requires an exchange rate, and exchange rates will fluctuate.
Which brings me to statistics, an easy refuge of those who would lie with numbers. You can say things like the average family with some number of children has 1.8 kids and this is a lie with a kernel of truth –it is both obviously false and gives us an idea about what things look like in aggregate.
But if you were to say in State A, the average family with some number of children has 2.1 children and in State B, that number is 1.5, there are certain things you can and can’t do with these numbers: You can compare them, you can take their difference to quantify a comparison, and you can even divide one by the other to get a multiple, so you can say that families with children in State A have 1.4 times as many children as in State B.
What you can’t do is add those numbers together (3.6 has no meaning in this context), much less take their sum and then divide it by two to get an aggregate mean value and claim that families with children in States A and B have 1.8 children on average. I mean, you could, but you’re very likely wrong. There’s simply no way of knowing without getting the at least the sums and value counts underlying the provided 2.1 and 1.5 values.
This “average of averages” operation is exactly why the report mentioned up at the top of the page was wrong: The other team was getting some numbers from my system and using them in ways that made sense and yet were wrong in the context those numbers existed. They had done this with the intent of illustrating cleverness in the guise of saving resources, but they didn’t know this wasn’t producing valid results and for several months had produced reports which had incorrect data, reports on which people made decisions.
This is why I introduced a new numeric type to our in-house type system. There were already a few for various domain-specific measurable things, helping people to avoid operations that didn’t make sense in their contexts, and I had simply found another place that needed similar mistake-proofing.
Once you know to look for it, this sort of abstraction & affordance mismatch happens more often than you’d think. When something is presented in a particular form, we make assumptions about what that form offers: a door is meant for opening, a guitar meant for playing, a chair meant for sitting; and when those assumptions don’t hold, it’s generally obvious from context.
In computers, as with many mostly-abstract things, it’s up to us to create that context when base assumptions don’t hold. Is a number meant for adding? How can one tell given the cues available? Are you relying on people’s domain expertise to know? What happens when you bring in someone who isn’t a domain expert? Are you sure you’re the only people ever touching this code will be domain experts? Are you comfortable with relying on the training you think you’ll give them?
I know many people used to dealing with dynamically-typed languages tend to dislike this sort of abstraction, or even the concept of an in-house type system which spans multiple systems, but I’ve seen the value of it – many times over – in helping people who are good with logic but not necessarily domain experts on their project’s subject matter avoid potentially catastrophic mistakes.