foobuzz

by Valentin, October 29 2024, in tech

The case for 100% code coverage

When it comes to testing, I'm a proponent of aiming at 100% code coverage, and I've encountered numerous opinions claiming that such a goal is useless, worthless, idiotic, or merely amusing (and other variations of the "bad idea" part of the spectrum). I have found out that many critics seem to confuse necessary with sufficient, as if my point was that the "100%" stamp ensures that the code is perfect™, which it obviously doesn't. And, throwing the baby out with the water, if it isn't sufficient, then surely it's worthless, I guess? Anyway, the point of this article is to explain why I think 100% is valuable.

Code that never runs

When you have a piece of code, you ideally want to test that it runs fine with all possible combinations of its variables. This is either very hard or impossible to do, because the variables space is infinite or very large, and/or because you don't have the time to invest in formal proving methods (or your code is too complicated for them). So you settle on testing a set of specific variables sets, that are either "plausible" relative to what you expect from production (nominal cases), or "extreme" relative to limits of your code's parameters (edge cases). This tested subset gives you some amount of confidence in the correctness of your code, but certainly not proof.

However, there is also the reverse situation: some pieces of code are broken in a way that there is no combination of variables for which it runs: it's always wrong (or it always crashes).

Here is an example:

def get_best_paid(employees: list[tuple[str, int]]) -> str:
    """
    Precondition: `employees` must not be empty
    """
    max_salary = None
    best_paid = None

    for employee_name, salary in employees:
        if salary > max_salary:
            max_salary = salary
            best_paid = employee_name

    return best_paid

This sample of Python code computes who is the best paid employee among a list of employees, by simply keeping track of the current max salary when iterating over the list. The bug is that the max_salary variable has been initialized to None instead of 0, so the comparison for the first item of the list will fail. There is no possible input complying with the type hints and precondition that will execute. I call this: code that never runs.

Contrarily to "code that always runs" which is never proved no matter the number of tests you throw at it, "code that never runs" can be disproved with one single test (any test that successfully executes the code). In this sense, 1 test (in comparison to 0) is vastly more interesting than N tests (in comparison to N-1), since 1 test can at least prove something.

A couple of claims

Now, a couple of claims (I don't have any strong evidence for them, so feel free to challenge them, since they're kind of the centerpiece of the argument):

  • Claim n°1: Programmers regularly produce code that never runs. Here is what makes me believe this: regularly, when contributing to well-tested module, I change a very specific thing, but when I run the tests, all of them are red. I reckon this is an experience any programmer is familiar with. Well, of course, I have mistakenly introduced a bug that completely broke the code for any case. I have produced code than never runs.
  • Claim n°2: There are diminishing returns in the amount of confidence that is gained about code correctness as a function of the amount of tests. In other words, the gain in confidence between N and N+1 tests is bigger than the gain in confidence between N+1 and N+2 tests. Consequently, the difference between 0 and 1 test is the biggest gain of confidence you can have about a piece of code. You have proved that your code does not "never run".

Now, if only there was a way to benefit from those principles on an entire codebase. A way to prove that there is no code that "never runs" (which, I claim, is not rare) in the codebase. Well, there is: just have a test suite that runs each piece of code at least once. This is, by definition, 100% code coverage. Now, programmers enjoy all their tooling and whatnots about formatting, typing, and other cool things asserting stuff about 100% of code, but God forbid aiming at proving that 100% of the code is not utterly broken.

Again some disclaimers for those who struggle with nuances: I'm not saying that 100% code coverage is sufficient in any way (it's far, very far, from being sufficient). I'm also not saying that 100% code coverage is easy, nor should be prioritized on top of everything. I simply never have maintained a codebase with 100% code coverage, and do not even have this hygiene on my personal project.

I'm claiming a way more conservative thing: that there is value in 100% code coverage.