Test coverage only matters if it's at 100%
Summary: all lines of code should be covered by a test or explicitly excluded from test coverage.
What's your test coverage? What do you consider a good test coverage percentage?
Those questions often get asked during engineering interviews. It gives the interviewee a sense of how much the company cares about quality. It's also a good question for the candidate because you can't have a cargo cult answer. You need to argue. It gives useful insights into one's quality mindset.
I believe the only right value to track for a test coverage metric is 100% of the lines that should be covered (not the lines that are written), which ensures that the lines that you don't want to cover are explicit, and not implicit.
Please don't skim the articles as you'll find that the point I'm making is pretty common sense. If you still disagree have a look at the response to common objections at the end.
Note: I'm using Python for all the examples below, but the principles are applicable regardless of the programming language.
Why you should measure test coverage
If you're not already monitoring test coverage, you should probably start now
(if you don't have tests, that's for another article). Some programming
languages have better support than others. Python has excellent support for it,
through pytest
and pytest-cov
(using coverage
under the hood).
Test all functionality
Test coverage allows you to check which lines of code are run while the tests are executed. If some lines are missing coverage, it means you're not testing ("covering") the underlying behavior. You could introduce a bug, and no test would fail, which is not desirable. Introducing a bug should ideally trigger a test to fail.
Let's see a contrived example:
def toast(bread: str) -> str:
if bread == "brioche":
...
1 / 0 # let's say a bug is hidden here
return "ok"
return bread
def test_toast():
assert toast("baguette") == "ok"
The test will pass, even though toast("brioche")
should fail. Adding
a test for this case will let you find the ZeroDivisionError
.
Ensure all tests are run
Another case for checking is that you want to make sure that your test actually run. In this contrived example, the tests will pass, even though one of those test does not do anything:
def identity(bread: str) -> str:
return bread
def test_identity():
assert identity("brioche") == "brioche"
def test_identity_noop():
if 1 == 1: # this is a contrived example, there shouldn't be any 'if' in tests
return
assert identity("baguette") == "baguette"
Once you've agreed that you need test coverage, the question remains: what is the right level of coverage? It is very costly to ensure that absolutely every line of code is tested. Should it be 80%? 90%?
Why you should check for 100% coverage
TOTAL 18435 16 1742 0 99%
FAIL Required test coverage of 100% not reached. Total coverage: 99.86%
100% is clear cut.
Enforcing 100% coverage requires you to be explicit about the lines that don't
have to be covered ("explicit is better than implicit"). By including the no cover
marker (or whatever your language/coverage lib supports) next to the
code that is excluded from coverage, you allow code reviewers to see your
coverage decisions, and challenge them.
The problem with having a goal that is not 100% is that it leaves room for interpretation and negotiation. If we're currently at 80% and I bring coverage at 79.9% because I'm too lazy to test my code (or exclude it from coverage), then can't we say that it's ok? Explicit markers in the code make the discussion more factual and objective.
The other problem with something different from 100% is that you might be adding coverage in one area, and reducing it in another, without knowing it. It might still lead to 80% coverage - but the reduction in coverage might have appeared in a crucial area of your code.
Will 100% coverage prevent any bug from being introduced?
Of course not. Just because a line is run in a test does not mean it is 100% correct. You might not be testing all variations. The test might assert the wrong thing.
Some lines of code should be tested twice to ensure they're correct. See this contrived example:
def can_toast(bread: str) -> bool:
return bread in ["croissant"] # this line is covered
def test_can_toast():
assert can_toast("baguette") is True
All the lines are covered, but I'm not testing that you can't toast "brioche"
and "baguette"
. Let's say this behavior is critical (toasting a croissant is
a crime!). Without another test, somebody could change the code like this, and
nothing would break.
def can_toast(bread: str) -> bool:
return True # this line is still covered
def test_can_toast():
assert can_toast("baguette") is True
Therefore sometime you definitely should test the same line more than once, testing the failure case and the success case. I'd strongly recommend this for permission checking for instance.
def test_can_toast():
assert can_toast("baguette") is True
def test_can_toast_cannot():
assert can_toast("croissant") is False
Is this realistic?
You don't need to cover 100% of the lines you write. Some are not worth
covering as you'll see in the examples below. However, you need to cover 100% of the
lines you want to be covered. You need to be explicit about excluding lines
from coverage. Tools such as Python's coverage
allow you to specify which
line should be excluded from the coverage.
Using pytest-cov
you can achieve this pretty simply:
PLUGGED = True
def toast(bread: str) -> str:
if not PLUGGED: # pragma: no cover
return "ok"
...
return "ok"
def test_toast():
assert toast("bread") == "ok"
With this setup, test coverage will be at 100% even though we did not test the case where the toaster is not plugged because it is too simple. Writing a test for this "functionality" would not be worth it. Therefore we explicitly exclude it from coverage. You should revisit this decision if it becomes more complex and needs to be tested.
What can/should be excluded from coverage?
It depends on your codebase and your quality expectations. Also, as for any rules, there are exceptions. Examples are writing using Python, but this is not language-dependent.
Simple error handling
def toast(bread: str) -> str:
if not bread: # pragma: no cover - the handling is too obvious
raise ValueError("no bread provided")
...
Anything more complicated than that should probably be covered.
Proxy functions
from external_lib import Toaster
def toast(bread: str) -> str: # pragma: no cover - we're just proxying
return Toaster.toast(bread)
It also applies to thin wrappers you might have in front of your dependencies.
Code as configuration
def get_temperature(bread: str) -> int: # pragma: no cover - no complexity there
if bread == "brioche":
return 1
elif bread == "baguette":
return 3
...
return 2
Simple early return/continue
def toast(breads: List[str]) -> str:
for bread in breads:
if not bread: # pragma: no cover - simple early continue
continue
...
return "ok"
If the early return/continue is part of the logic, or leads to complicated side effects, it should be tested.
Code that is too difficult to test (rare)
In rare cases, it is a perfectly fine decision to exclude some code from coverage if the tests are too expensive to write (because of the required setup, or the use of an external library, for instance). Again - this should only be in rare cases, for instance when you're using an external library that is not easy to mock. In most cases, if your code is complicated to test that probably means it needs to be refactored - easy-to-test code is usually better designed.
from packagea import A
from packageb import B
def toast(bread: str) -> str: # pragma: no cover - too complicated to test
a = A.expensive_function_that_needs_to_be_mocked()
b = B.expensive_function_that_needs_to_be_mocked()
... # lots of code
return "ok"
Make sure that this code is easy to test manually, and that the code calling it is well tested.
Debugging/manual test code
import os
DEBUG = os.environ.get("DEBUGGING")
p
def toast(bread: str) -> str:
if DEBUG: # pragma: no cover - debugging code
print(f"toasting {bread}")
...
return "ok"
What should be covered
Here's a list of things that should be covered, ideally more than once.
Branches
If your language supports it, make sure you test all branches:
def toast(bread: str) -> str: # 1
if bread == "baguette": # 2
bread = "toasted_baguette" # 3
return bread # 4
def test_toast_baguette():
assert toast("baguette") == "toasted_baguette"
You're technically at 100% code coverage, but the case where bread
is not
"baguette" and the if
is never evaluated to False: line 2 never jumps to line 4. If you activate branch coverage, it will be flagged as a partial branch, and
coverage won't be at 100%. To fully test it, you'll need:
def toast(bread: str) -> str: # 1
if bread == "baguette": # 2
bread = "toasted_baguette" # 3
return bread # 4
def test_toast_baguette():
assert toast("baguette") == "toasted_baguette" # 2 -> 3 -> 4
def test_toast_other():
assert toast("brioche") == "brioche" # 2 -> 4
As shown in this example, this lets you test more functionality, thus reducing the probability of bugs.
Security: permission check, input sanitization...
Anything that relates to security should be heavily tested, with all sorts of inputs.
import re
import pytest
REGEX = re.compile("[a-z]+")
def toast(bread: str) -> str:
match = REGEX.match(bread)
if not match:
return ValueError("Invalid bread variant")
...
def test_toast_fail(): # note: you can use pytest's parametrize feature here.
with pytest.raises(ValueError):
toast("12")
with pytest.raises(ValueError):
toast("@#*$")
with pytest.raises(ValueError):
toast("<script>window.alert('hello world!')</script>")
with pytest.raises(ValueError):
toast("'; DROP TABLE users;")
Complex error handling
If error handling is sophisticated enough and can be considered a feature (e.g. recovering gracefully from some error) - it needs to be covered.
import pytest
from doubles import expect
def toast(bread: str) -> None:
if bread == "brioche":
toaster.unplug()
toaster.cleanup()
raise ValueError()
...
def test_toast_fail():
expect(toaster).unplug()
expect(toaster).cleanup()
with pytest.raises(ValueError):
toaster("brioche")
New functionality (even for proof-of-concept)
While it might be tempting to exclude new prototyping code from coverage, I think it's a bad idea because that's one more thing to remember and track. Moreover, once a feature is shipped, we usually move on to the next feature and never clean up our tests. So, it's safer to get into the habit of covering all new code.
How to get to 100%
- Start tracking and displaying the test coverage. Make it easy to find uncovered lines (for instance with an HTML report). Have an open discussion with your team.
- Ensure all new code is fully covered, and that coverage can only strictly increase. Some Github bots let you enforce this.
- Before writing new code, check the coverage of the existing functionality,
and bring it to 100%. Try to structure your tests so that running a single
test file (e.g.
test_toaster.py
) will cover 100% of its associated code file (toaster.py
). It allows for quicker iterations. - Once you're at 100%, use your preferred tool's configuration to fail the tests if coverage is not 100%.
Coverage HTML written to dir htmlcov
Required test coverage of 100% reached. Total coverage: 100.00%
Answer to common objections
100% test coverage does not mean you're testing everything right.
Absolutely - this point is explicitly stated in this article. I even give an example situation showing how just checking the test coverage leads to missing an important test.
100% test coverage does not make your code invulnerable, and it should evidently not be your only metric. This article is only about the test coverage metric.
A test suite that covers 80% is pretty good
Absolutely. It is a good number and a good goal for any codebase.
However, what about that remaining 20 %? Why are they not tested? Will it be clear in 2 months why they were not tested? In 6 months? In a year? While it may make perfect sense not to test them, you should be explicit about that decision and keep the reason in the code.
If you don't keep the test coverage metric at 100%, then you leave it up to the code reviewer to challenge your test coverage assumption.
100% is a blanket rule that leaves no room for negotiation
Once again, the goal is not to cover 100% of the lines of code - it would be
almost impossible. Thanks to no cover
markers, you can still decide to
exclude code from coverage. It actually makes this negotiation explicit in the
code, as opposed to implicit during the code review phase coverage.
Consider the example below:
def toast(bread: str) -> str:
if toast == "brioche":
raise ValueError("...")
Let's say you're fixing a bug in this code, and you find out that the if
branch is not covered. You're left to wonder why. The developer did not have
enough time? They decided it was too trivial to need a test?
With an explicit marker, the intent is clear and gives you insight with what is considered appropriate coverage for this codebase:
def toast(bread: str) -> str:
if toast == "brioche": # pragma: no cover, trivial error case
raise ValueError("...")
It is not feasible in an old codebase
Right, that's probably the case. However, this is not different from proper testing, monitoring, logging practices. Decide if it's important, then start with something attainable that brings you closer to the goal, and iterate.
Also, if it is not a desirable goal for the codebase, then for sure don't monitor test coverage!
Enforcing 100% test coverage leads to bad tests
It bears repeating: this is not about testing 100% of the lines; this is about keeping the code coverage metric at 100%. I am not sure how that would lead to bad tests.
Putting too much focus on one particular metric might lead to gaming behavior. That does not mean that no metric you should be enforced. I believe that there isn't enough material and training about what it takes to write good tests. This is beneficial regardless of whether you enforce test coverage. Also, the more you talk about test coverage, the more you iterate and learn about what it takes to write good tests in your language and codebase.
What's sure is that it is straightforward to write bad tests. It takes a lot of skills, experience, and hygiene to write great, maintainable tests. That's a topic for another article.
100% coverage only matters for languages without a type checker
No. Sure, a type checker might catch some classes of bugs that would require a test in a language without type checking, but you still need to test correctness (among others).
You're creating blind spots
Unless you're reviewing your test coverage report after every single commit, leaving explicit markers in the code and keeping the metric at 100% is actually a much safer practice.
When working in code that has been excluded, you'll immediately see the no cover
marker, perhaps with a comment explaining why the code is excluded. This lets you reconsider the coverage decision.
Any regression in coverage will break the tests.
You should not exclude code from coverage because of setup cost
This article is not about end-to-end vs. unit testing. I have provided some examples of code that I sometimes exclude from testing, but your mileage may vary.
Changelog
- 09/07/2019: added answer to common objections
- 09/06/2019: added summary, clarified that not testing complicated should only apply to extreme cases and a few other points.