Let's take a look at a sort of classic example in probability theory. When you went to school you might have wondered:
What is the probability that at least two persons have the same birthday in a school class?
So the task for this blog post can be: Why don't we create a plot with the probability of at least two persons have the same birthday as a function of the size of the class. You can of course solve this problem by using "pen and paper" (check out Wikipedia ) but let's use a computational approach where we use simulations to estimate probabilities.
Such a simulation can contain the following steps
- Assuming that every day of the year is equally likely for a child to be born we generate N numbers from a uniform distribution between 0 and 364. The number represents each childs birthday.
- Check how many unique birthdays there is within the N birthdays
- If the unique number of birthdays is less than N at least two persons have the same birthday
Let's write that in code! (first some imports)
In the box above, if the variable same_birthday gets the value 1 the class had at least two persons that have the same birthday. Moreover, if the variable gets the value 0 then all persons in the class have unique birthdays. If I run this code a couple of times I get both classes that have persons with the same birthday as well as classes where all persons have unique birthdays.
So... how do we get from here to estimating the probability of at least two persons having the same birthday??
The probability can be seen as the long time frequency of some stuff occuring, or in this case: if we look at a huge number of classes with the size N persons, what is the proportion of classes that have at least two persons that have the same birthday? Well that is quite easy! We can use the code above and just put a loop on top of it. It would look something like this for a class size of 20 kids (here a huge number of classes is 100 000)
OK, a 41% probability for a class of 20 persons. Now let's try this for a number of different class sizes, say between 5 and 60 and make a nice looking plot of this.
And here is the plot
Nice! So, for a class of 23 persons there is a probability of 50% that at least two persons have the same birthday. Moreover, if the size of the scool class is above 55 persons it is almost certain that at least two persons will have the same birthday. Well, not to bad. We get the answer to a interesting problem in just a few lines of code, awesome!
The birthday paradox - Gemello style
Now let's put some personal touch to this problem. When I grow up I always had at least two persons with the same birthday in my school class, this is for two reasons:
- I have a twin brother (with the same birthday as my self)
- Growing up on the "country side" there was just one school and one class per age group so me and my brother had to go to the same class
Why don't we try to incorporate this in the estimation of probability of at least two persons having the same birthday in a school class!
Let's take a look at some data for multiple births in Sweden during the last 50 years. I have downloaded data from Socialstyrelsens homepage and plotted it in the graph below
Back in the days when I was born the was approximately 1.0% multiple births. This then increased and reached a peak somewhere around year 2000 and then decreased slightly from there on. For the analysis, let's take a value from around 2010, here the probability of multiple births was 1.4 %. Let's also assume that every multiple birth is a twin birth (just to make things a bit more easy...)
First of all we can write some code that calculates the number of twin births among N births
If we assume that this is the "country side" just like where I grew up then if there is a twin birth there will be at least two persons in the class with the same birthday. So let's try to mash these two analysis together.
First of all, let's make some small adjustments to the strategy. Instead of directly investigating the size of the class we instead look at a fixed number of births and adjust the class size if there is one or more twin births (in the case when we do not "allow" for twin births these two quantities would be the same)
Alright, now we have two vectors. The vector Same_birthday_twin contains either 1 or 0 where a 1 means that there is at least two persons in the class that have the same birthday. The vector Class_size_twin contains the size of the class. Note that the loop starts a 4 births... this is because we are interested in a class size of 5 to 60 kids and if the loop would start at 5 births where could not be a class with twins with the size of 5 kids.
Cool. Last thing to do, plot the results together with the previous plot for comparison.
And finally, the plot
And we are done! When we included the possibility of multiple births and assumed "country side conditions" there is a 50% chance of at least two persons having the same birthday "already" at a class size of 19.
Time to wrap up
Once again we have solved a quite complicated problem with just a few lines of code, awesome! Just as a final remark, when I went to scool where were just 13 kids in the class (I told you it was the country side) so that would be a probability of just above 30% that two kids or more would have the same birthday (for a probability of multiple births of 1.4%).
By the way, me and my twin brother are identical twins... what is the probability of that...
That is all for now and, as always, do not hesitate to drop us a message at info@gemello.se