LongCut logo

The medical test paradox, and redesigning Bayes' rule

By 3Blue1Brown

Summary

## Key takeaways - **Accurate tests can be misleading**: A highly accurate medical test doesn't guarantee a correct diagnosis; the probability of a positive result indicating the actual disease can be surprisingly low due to factors like prevalence. [00:07] - **Test accuracy has two parts**: Test accuracy is defined by two separate numbers: sensitivity (correctly identifying those with the disease) and specificity (correctly identifying those without the disease), not a single overall correctness percentage. [02:44] - **Doctors misinterpret test results**: Even medical professionals can incorrectly estimate the predictive value of a positive test, often overestimating it due to a misunderstanding of how prevalence impacts the outcome. [04:02] - **Bayes factor simplifies updating**: The Bayes factor, calculated as sensitivity divided by the false positive rate, provides a multiplier to update prior odds, offering a more intuitive and less calculation-intensive way to understand test results. [08:37] - **Odds vs. Probability**: Reframing probabilities into odds (ratio of positive to negative cases) allows Bayes' rule to be a simple multiplication by the Bayes factor, making Bayesian updating more straightforward. [11:01] - **Prior context is crucial**: The same test result can mean vastly different things depending on the patient's prior probability of having the disease; the Bayes factor quantifies how much the prior odds are updated, not the final probability itself. [14:48]

Topics Covered

  • Accurate tests can be misleading without context.
  • Doctors misinterpret test accuracy due to abstract numbers.
  • Bayesian updating reframes tests as updating chances, not giving answers.
  • Odds, not probabilities, simplify Bayesian updating.
  • The Bayes factor separates test accuracy from prior conditions.

Full Transcript

Some of you may have heard this paradoxical fact about medical tests.

It's very commonly used to introduce the topic of Bayes' rule in probability.

The paradox is that you could take a test which is highly accurate,

in the sense that it gives correct results to a large majority of the people taking it.

And yet, under the right circumstances, when assessing the

probability that your particular test result is correct,

you can still land on a very low number, arbitrarily low, in fact.

In short, an accurate test is not necessarily a very predictive test.

Now when people think about math and formulas,

they don't often think of it as a design process.

I mean, maybe in the case of notation it's easy to see that different choices

are possible, but when it comes to the structure of the formulas themselves

and how we use them, that's something that people typically view as fixed.

In this video, you and I will dig into this paradox,

but instead of using it to talk about the usual version of Bayes' rule,

I'd like to motivate an alternate version, an alternate design choice.

Now, what's up on the screen now is a little bit abstract,

which makes it difficult to justify that there really is a substantive difference here,

especially when I haven't explained either one yet.

To see what I'm talking about though, we should really start by spending some

time a little more concretely, and just laying out what exactly this paradox is.

Picture a thousand women and suppose that 1% of them have breast cancer.

And let's say they all undergo a certain breast cancer screening,

and that 9 of those with cancer correctly get positive results,

and there's one false negative.

And then suppose that among the remainder without cancer,

89 get false positives, and 901 correctly get negative results.

So if all you know about a woman is that she does the screening and she gets a positive

result, you don't have information about symptoms or anything like that,

you know that she's either one of these 9 true positives or one of these 89 false

positives.

So the probability that she's in the cancer group given the test

result is 9 divided by 9 plus 89, which is approximately 1 in 11.

In medical parlance, you would call this the positive predictive value of the test,

or PPV, the number of true positives divided by the total number of positive test results.

You can see where the name comes from.

To what extent does a positive test result actually predict that you have the disease?

Now, hopefully, as I've presented it this way where we're thinking

concretely about a sample population, all of this makes perfect sense.

But where it comes across as counterintuitive is if you just look

at the accuracy of the test, present it to people as a statistic,

and then ask them to make judgments about their test result.

Test accuracy is not actually one number, but two.

First, you ask how often is the test correct on those with the disease.

This is known as the test sensitivity, as in how

sensitive is it to detecting the presence of the disease.

In our example, test sensitivity is 9 in 10, or 90%.

And another way to say the same fact would be to say the false negative rate is 10%.

And then a separate, not necessarily related number is how often it's correct for those

without the disease, which is known as the test specificity,

as in are positive results caused specifically by the disease,

or are there confounding triggers giving false positives.

In our example, the specificity is about 91%.

Or another way to say the same fact would be to say the false positive rate is 9%.

So the paradox here is that in one sense, the test is over 90% accurate.

It gives correct results to over 90% of the patients who take it.

And yet, if you learn that someone gets a positive result without any added information,

there's actually only a 1 in 11 chance that that particular result is accurate.

This is a bit of a problem, because of all of the places for math to be counterintuitive,

medical tests are one area where it matters a lot.

In 2006 and 2007, the psychologist Gerd Gigerenzer gave a series of statistics

seminars to practicing gynecologists, and he opened with the following example.

A 50-year-old woman, no symptoms, participates in a routine mammography screening.

She tests positive, is alarmed, and wants to know from you

whether she has breast cancer for certain or what her chances are.

Apart from the screening result, you know nothing else about this woman.

In that seminar, the doctors were then told that the prevalence of

breast cancer for women of this age is about 1%,

and then to suppose that the test sensitivity is 90% and that its specificity was 91%.

You might notice these are exactly the same numbers

from the example that you and I just looked at.

This is where I got them.

So, having already thought it through, you and I know the answer.

It's about 1 in 11.

However, the doctors in this session were not primed with the suggestion to

picture a concrete sample of a thousand individuals, the way that you and I had.

All they saw were these numbers.

They were then asked, how many women who test positive actually have breast cancer?

What is the best answer?

And they were presented with these four choices.

In one of the sessions, over half the doctors present

said that the correct answer was 9 in 10, which is way off.

Only a fifth of them gave the correct answer, which is worse

than what it would have been if everybody had randomly guessed.

It might seem a little extreme to be calling this a paradox.

I mean, it's just a fact.

It's not something intrinsically self-contradictory.

But, as these seminars with Gigerenzer show, people, including doctors,

definitely find it counterintuitive that a test with high accuracy can give you such a

low predictive value.

We might call this a veridical paradox, which refers to facts that are provably true,

but which nevertheless can feel false when phrased a certain way.

It's sort of the softest form of a paradox, saying

more about human psychology than about logic.

The question is how we can combat this.

Where we're going with this, by the way, is that I want you to be able

to look at numbers like this and quickly estimate in your head that it

means the predictive value of a positive test should be around 1 in 11.

Or, if I changed things and asked, what if it

was 10% of the population who had breast cancer?

You should be able to quickly turn around and say

that the final answer would be a little over 50%.

Or, if I said imagine a really low prevalence,

something like 0.1% of patients having cancer,

you should again quickly estimate that the predictive value of the test is around 1 in

100, that 1 in 100 of those with positive test results in that case would have cancer.

Or, let's say we go back to the 1% prevalence, but I make the test more accurate.

I tell you to imagine the specificity is 99%.

There, you should be able to relatively quickly

estimate that the answer is a little less than 50%.

The hope is that you're doing all of this with minimal calculations in your head.

Now, the goals of quick calculations might feel very different from the goals of

addressing whatever misconception underlies this paradox,

but they actually go hand in hand.

Let me show you what I mean.

On the side of addressing misconceptions, what would you

tell to the people in that seminar who answered 9 and 10?

What fundamental misconception are they revealing?

What I might tell them is that in much the same way that you shouldn't think

of tests as telling you deterministically whether you have a disease,

you shouldn't even think of them as telling you your chances of having a disease.

Instead, the healthy view of what tests do is that they update your chances.

In our example, before taking the test, a patient's

chances of having cancer were 1 in 100.

In Bayesian terms, we call this the prior probability.

The effect of this test was to update that prior by almost an order of magnitude,

up to around 1 in 11.

The accuracy of a test is telling us about the strength of this updating.

It's not telling us a final answer.

What does this have to do with quick approximations?

Well, a key number for those approximations is something called the Bayes factor,

and the very act of defining this number serves to reinforce this

central lesson about reframing what it is the tests do.

You see, one of the things that makes test statistics so very confusing

is that there are at least 4 numbers that you'll hear associated with them.

For those with the disease, there's the sensitivity and the false negative rate,

and then for those without, there's the specificity and the false positive rate,

and none of these numbers actually tell you the thing you want to know.

Luckily, if you want to interpret a positive test result,

you can pull out just one number to focus on from all this.

Take the sensitivity divided by the false positive rate.

In other words, how much more likely are you to see

the positive test result with cancer versus without?

In our example, this number is 10.

This is the Bayes factor, also sometimes called the likelihood ratio.

A very handy rule of thumb is that to update a small prior,

or at least to approximate the answer, you simply multiply it by the Bayes factor.

So in our example, where the prior was 1 in 100,

you would estimate that the final answer should be around 1 in 10,

which is in fact slightly above the true correct answer.

So based on this rule of thumb, if I asked you what would happen if the

prior from our example was instead 1 in 1000, you could quickly estimate

that the effect of the test should be to update those chances to around 1 in 100.

And in fact, take a moment to check yourself by thinking through a sample population.

In this case, you might picture 10,000 patients where only 10 of them really have cancer.

And then based on that 90% sensitivity, we would

expect 9 of those cancer cases to give true positives.

And on the other side, a 91% specificity means that

9% of those without cancer are getting false positives.

So we'd expect 9% of the remaining patients, which is around 900,

to give false positive results.

Here, with such a low prevalence, the false positives

really do dominate the true positives.

So the probability that a randomly chosen positive case from this population

actually has cancer is only around 1%, just like the rule of thumb predicted.

Now, this rule of thumb clearly cannot work for higher priors.

For example, it would predict that a prior of

10% gets updated all the way to 100% certainty.

But that can't be right.

In fact, take a moment to think through what the answer should be,

again, using a sample population.

Maybe this time we picture 10 out of 100 having cancer.

Again, based on the 90% sensitivity of the test,

we'd expect 9 of those true cancer cases to get positive results.

But what about the false positives?

How many do we expect there?

About 9% of the remaining 90.

About 8.

So, upon seeing a positive test result, it tells you that you're

either one of these 9 true positives or one of the 8 false positives.

So this means the chances are a little over 50%, roughly 9 out of 17, or 53%.

At this point, having dared to dream that Bayesian updating could look

as simple as multiplication, you might tear down your hopes and pragmatically

acknowledge that sometimes life is just more complicated than that.

Except it's not.

This rule of thumb turns into a precise mathematical fact as long as we

shift away from talking about probabilities to instead talking about odds.

If you've ever heard someone talk about the chances of an event being 1 to 1 or 2 to 1,

things like that, you already know about odds.

With probability, we're taking the ratio of the number

of positive cases out of all possible cases, right?

Things like 1 in 5 or 1 in 10.

With odds, what you do is take the ratio of all positive cases to all negative cases.

You commonly see odds written with a colon to emphasize the distinction,

but it's still just a fraction, just a number.

So an event with a 50% probability would be described as having 1 to 1 odds.

A 10% probability is the same as 1 to 9 odds.

An 80% probability is the same as 4 to 1 odds.

You get the point.

It's the same information.

It still describes the chances of a random event,

but it's presented a little differently, like a different unit system.

Probabilities are constrained between 0 and 1, with even chances sitting at 0.5.

But odds range from 0 up to infinity, with even chances sitting at the number 1.

The beauty here is that a completely accurate,

not even approximating things way to frame Bayes' rule is to say,

express your prior using odds, and then just multiply by the Bayes' factor.

Think about what the prior odds are really saying.

It's the number of people with cancer divided by the number without it.

Here, let's just write that down as a normal fraction for a moment so we can multiply it.

When you filter down just to those with positive test results,

the number of people with cancer gets scaled down,

scaled down by the probability of seeing a positive test result given

that someone has cancer.

And then similarly, the number of people without cancer also gets scaled down,

this time by the probability of seeing a positive test result, but in that case.

So the ratio between these two counts, the new odds upon seeing the test,

looks just like the prior odds except multiplied by this term here,

which is exactly the Bayes' factor.

Look back at our example, where the Bayes' factor was 10.

And as a reminder, this came from the 90% sensitivity

divided by the 9% false positive rate.

How much more likely are you to see a positive result with cancer versus without?

If the prior is 1%, expressed as odds, this looks like 1 to 99.

So by our rule, this gets updated to 10 to 99,

which if you want you could convert back to a probability.

It would be 10 divided by 10 plus 99, or about 1 in 11.

If instead, the prior was 10%, which was the example that tripped up

our rule of thumb earlier, expressed as odds, this looks like 1 to 9.

By our simple rule, this gets updated to 10 to 9,

which you can already read off pretty intuitively.

It's a little above even chances, a little above 1 to 1.

If you prefer, you can convert it back to a probability.

You would write it as 10 out of 19, or about 53%.

And indeed, that is what we already found by thinking

things through with a sample population.

Let's say we go back to the 1% prevalence, but I make the test more accurate.

Now what if I told you to imagine that the false positive rate was only 1% instead of 9%?

What that would mean is that our Bayes factor is 90 instead of 10.

The test is doing more work for us.

In this case, with the more accurate test, it gets updated to 90 to 99,

which is a little less than even chances, something a little under 50%.

To be more precise, you could make the conversion

back to probability and work out that it's around 48%.

But honestly, if you're just going for a gut feel, it's fine to stick with the odds.

Do you see what I mean about how just defining this

number helps to combat potential misconceptions?

For anybody who's a little hasty in connecting test accuracy directly to your probability

of having a disease, it's worth emphasizing that you could administer the same test with

the same accuracy to multiple different patients who all get the same exact result,

but if they're coming from different contexts,

that result can mean wildly different things.

However, the one thing that does stay constant in every case

is the factor by which each patient's prior odds get updated.

And by the way, this whole time we've been using the prevalence of the disease,

which is the proportion of people in a population who have it,

as a substitute for the prior, the probability of having it before you see a test.

However, that's not necessarily the case.

If there are other known factors, things like symptoms,

or in the case of a contagious disease, things like known contacts,

those also factor into the prior, and they could potentially make a huge difference.

As another side note, so far we've only talked about positive test results,

but way more often you would be seeing a negative test result.

The logic there is completely the same, but the base

factor that you compute is going to look different.

Instead, you look at the probability of seeing this negative

test result with the disease versus without the disease.

So in our cancer example, this would have been the 10% false

negative rate divided by the 91% specificity, or about 1 in 9.

In other words, seeing a negative test result in that example

would reduce your prior odds by about an order of magnitude.

When you write it all out as a formula, here's how it looks.

It says your odds of having a disease given a test result equals your

odds before taking the test, the prior odds, times the base factor.

Now let's contrast this with the usual way Bayes' rule is written,

which is a bit more complicated.

In case you haven't seen it before, it's essentially just what we were

doing with sample populations, but you wrap it all up symbolically.

Remember how every time we were counting the number of true positives and

then dividing it by the sum of the true positives and the false positives?

We do just that, except instead of talking about absolute amounts,

we talk of each term as a proportion.

So the proportion of true positives in the population comes

from the prior probability of having the disease multiplied

by the probability of seeing a positive test result in that case.

Then we copy that term down again into the denominator,

and then the proportion of false positives comes from the prior

probability of not having the disease times the probability of a positive

test in that case.

If you want, you could also write this down with words instead of symbols,

if terms like sensitivity and false positive rate are more comfortable.

And this is one of those formulas where once you say it out loud it seems like a bit

much, but it really is no different from what we were doing with sample populations.

If you wanted to make the whole thing look simpler,

you often see this entire denominator written just as the probability of seeing a

positive test result, overall.

While that does make for a really elegant little expression,

if you intend to use this for calculations, it's a little disingenuous,

because in practice, every single time you do this you need to break

down that denominator into two separate parts, breaking down the cases.

So taking this more honest representation of it,

let's compare our two versions of Bayes' rule.

And again, maybe it looks nicer if we use the words sensitivity and false positive rate.

If nothing else, it helps emphasize which parts of the

formula are coming from statistics about the test accuracy.

I mean, this actually emphasizes one thing I really like about the framing with

odds and a Bayes' factor, which is that it cleanly factors out the parts that

have to do with the prior and the parts that have to do with the test accuracy.

But over in the usual formula, all of those are very intermingled together.

And this has a very practical benefit.

It's really nice if you want to swap out different priors and easily see their effects.

This is what we were doing earlier.

But with the other formula, to do that, you have to recompute everything each time.

You can't leverage a precomputed Bayes' factor the same way.

The odds framing also makes things really nice if you want to do

multiple different Bayesian updates based on multiple pieces of evidence.

For example, let's say you took not one test, but two.

Or you wanted to think about how the presence of symptoms plays into it.

For each piece of new evidence you see, you always ask the question,

how much more likely would you be to see that with the disease versus without the disease?

Each answer to that question gives you a new Bayes' factor,

a new thing that you multiply by your odds.

Beyond just making calculations easier, there's something I really like about

attaching a number to test accuracy that doesn't even look like a probability.

I mean, if you hear that a test has, for example,

a 9% false positive rate, that's just such a disastrously ambiguous phrase.

It's so easy to misinterpret it to mean there's a

9% chance that your positive test result is false.

But imagine if instead the number that we heard tacked on to test

results was that the Bayes' factor for a positive test result is, say, 10.

There's no room to confuse that for your probability of having a disease.

The entire framing of what a Bayes' factor is,

is that it's something that acts on a prior.

It forces your hand to acknowledge the prior as something that's separate entirely,

and highly necessary to drawing any conclusion.

All that said, the usual formula is definitely not without its merits.

If you view it not simply as something to plug numbers into,

but as an encapsulation of the sample population idea that we've been using throughout,

you could very easily argue that that's actually much better for your intuition.

After all, it's what we were routinely falling back on in order to check

ourselves that the Bayes' factor computation even made sense in the first place.

Like any design decision, there is no clear-cut objective best here.

But it's almost certainly the case that giving serious consideration

to that question will lead you to a better understanding of Bayes' rule.

Also, since we're on the topic of kind of paradoxical things,

a friend of mine, Matt Cook, recently wrote a book all about paradoxes.

I actually contributed a small chapter to it with thoughts

on the question of whether math is invented or discovered.

And the book as a whole is this really nice connection of thought-provoking

paradoxical things ranging from philosophy to math and physics.

You can, of course, find all the details in the description. 352 00:20:58,100 --> 00:20:51,040 .

Loading...

Loading video analysis...