Gemello Engineering AB - Compute This!

In times like these you might have wondered about how to test a person for some specific diseases and how accurate such a test would/should be. This brings us to the topic of this blog post:

A patient goes to see a doctor. The doctor performs a test with 99 percent reliability, that is, 99 percent of people who are sick test positive and 99 percent of the healthy people test negative. The doctor knows that only 1 percent of the people in the country are sick. Now the question is: if the patient tests positive, what are the chances the patient is sick?

Note that these numbers have nothing to do with the ongoing Covid-19 situation, it is more to have some numbers to work with.

Well... when I see the numbers presented above I would say (without thinking too much) that there is a pretty high probability that a person that is tested positive actually have the disease... I mean, 99% of the people that have the disease are tested positive... Perhaps this is your feeling as well. Let’s try to do some analysis

First of all, let's think about what we know and what we would like to know. It is stated in the text above that the test is 99% reliable i.e. 99% of the persons that are sick are tested positive (this is often called the Sensitivity of the test) and 99% of the persons that are healthy will be tested negative (called Specificity of the test). Let's try to express this in a sort of "mathy" way. We would then say something like:

The probability that the test result is positive given that the person is sick is 99%

and likewise:

The probability that the test result is negative given that the person is healthy is 99%

This is an examples of what is called conditional probability i.e. the probability that some stuff will happen given some other things have happened. Now, lets increase the geekiness level even more and introduce some fancy notation

Here we have used S and H to indicate if the person is Sick or Healthy and we have used P and N to indicate if the test is Positive or Negative. Moreover, the pipe operator, |, represents the conditional probability. Let's just look at the expression P(P|S). The expression P() is read as "the probability of" and the thing that is inside the parentheses, P|S, represents "The test is positive given that the person is sick". Nice! So far we have manage to express what we know as some conditional probabilities and we have introduced some fancy notation for that as well. Now... Let's get to the important stuff, what is it actually that we would like to know? If we read the last sentence in the problem formulation above very carefully and try to connect this to the notation about conditional probability we just introduced we would get something like:

What is the probability that the person is sick given that the test result is positive

Or in more mathy way of expressing it

Alright! We have now defined what we would like to know! The question is now, how do we calculate it?? Before we start with that let's just reflect over what is stated above. The two quantities P(P|S) and P(S|P) are deceivingly similar but unfortunately they are (in general) not the same. Now, If we think of probabilities as ratios it seems reasonable to calculate the quantity P(S|P) as:

Then, how to calculate this... Often it is a good thing to draw a picture of what is going on so let's try to do that. To do so, imagine a large number of persons, say 1 000 000. This first thing that happens is that a person can be either Sick or Healthy. According to the text above 1% of the populations has the disease and 99% are healthy. Let's draw that

Then, each person is tested and the test can be either positive or negative. For persons that are sick there is a 99% probability that the test will be positive and, accordingly, there is a 1% probability that the test will (incorrectly) be negative. Moreover for the persons that are healthy there is a 99% probability that the test is negative and likewise, there is a 1% probability that the test is (incorrectly) positive. Let's add that to the drawing.

Nice! We have now used the information in the text to calculate the number of persons that are in each of the four categories

Sick person with a Positive test result
Sick person with a Negative test result
Healthy person with a Negative test result
Healthy person with a Positive test result

Now, this is where the magic happens! In the picture above mark the categories that have gotten a positive test result

The sum of these two categories are the total number of persons that have gotten a positive test result. Moreover, The number of persons that correctly have gotten a positive test result is marked in green (Sick persons with a Positive test), so we can now calculate the ratio of correct positive tests compared to the total number of positive tests

What?? Only 50% probability that a person that got a positive test result actually is sick!! Hmmm, 50% is not as accurate that I would have expected.

Test again

What if we had developed another test method that had the exact same performance as the test described above, but relied on some other measurement. If we added this test to the first one we could ask:

if the patient is tested with two different test methods and both tests are positive, what are now the chances the patient is sick?

To do this we just add another layer to the picture above, it would look like this:

As you can see the picture is getting more and more busy as more and more branches appear. Now, just as above, calculate the ratio of persons with two correctly positive tests divided by the total number of persons that got two positive tests

OK! There is a 99% probability that a person that has taken two different tests that are positive actually is sick!

Time to conclude

Well, the answer to the original question was not according to (my) intuition. The learning might be that you must think quite hard about what it is that you would like to know. I mean P(S|P) and P(P|S) seems quite similar (they both contains the same letters…) but as shown above they are not. What is hidden below the calculations above is called Bayes’s theorem. We have not mentioned it in this post but the plan is to write several blog post in the future related to Bayesian statistics and Bayes’s theorem. So, we will come back to this post later this year.

The topic of this blog post actually came from one of our readers, thanks for that! If there is some interesting topic you would like us to write about, just send us an email, info@gemello.se, and we will see what we can do.

Putting the test to the test

How accurate is the test?

Test again

Time to conclude