Monty Hall-paradokset

"Monty Hall"-problemet er et statistisk paradoks som ble kjent etter at Marilyn vos Savant skrev om det i et amerikansk blad i 1990. Problemet går som følger:

"Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1 [but the door is not opened], and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?"

Eller på norsk: Du er med på en konkurranse, og skal velge èn av tre dører. Bak en av dørene er det en bil (som er premien), og bak de to andre er det geiter. Om du velger, la oss si, dør nummer 1 - så åpner programlederen en dør, f.eks nummer 3, og bak den er det en geit. Han gir deg nå mulighet til velge dør på nytt. Er det til din fordel?

Intuitivt burde det ikke hjelpe å bytte dør. Du vet fra før at sjansen for at du finner en bil er 1/3. Om programlederen viser deg en dør med geit, vet du at enten er bilen i den du har valgt eller den andre døren som ikke er vist.  Burde det da ha noe å si om du bytter? Det paradoksale (i betydningen at det kan overraske), er at det alltid vil lønne seg å velge ny dør. Hvorfor? Det finnes flere intuitive måter å forklare dette på.

Forklaringen er overraskende enkel: Du taper om du bytter dør hvis og bare hvis du valgte bilen første gang. Sjansen for å velge bil første gang er 1/3. Dermed er sjansen for å vinne med bytting lik 1-\frac{1}{3}=\frac 23.

Eller vi kan teste problemet med empiri (og litt logisk tenking). Jeg programmerte to Monty Hall-funksjoner i SAGE (Python-kode). Den ene funksjonen bytter dør, mens den andre ikke gjør. Så tester vi. Her er koden til funksjonen som bytter dør:

def montyhall_change(choice): # med bytting
   if choice == 1:
       return False
       return True

Kort forklaring av funksjonen: Vi kan anta bilen ligger bak dør nummer 1. Hvis du velger å bytte, og velger dør nummer 1, taper du. Derfor returnerer funksjonen False. Hvis ikke, viser Monty deg en dør med geit i. Det er bare to dører med geiter, så han må åpne den du ikke valgte. Om du skal bytte, er eneste mulighet å velge døren med bilen, og funksjonen returnerer derfor True.

Vi lager også en funksjon for å ikke bytte dør. Den er naturligvis noe enklere:

def montyhall_keep(choice):
    if choice == 1:
        return True
        return False

Til slutt simulerer vi konkurransen et stort antall ganger, og ser på tallmaterialet vi får:

k = 10000
s_change = 0
s_keep = 0
for n in range(k):
    random = ZZ.random_element(3)+1
    if montyhall_change(random) == True:
        s_change += 1
    if montyhall_keep(random) == True:
        s_keep += 1

print "Antall forsok: ", k
print "Antall suksess bytte: ", s_change,
      " Antall suksess beholde: ", s_keep
print "Prosent riktig bytte: ", s_change/float(k)
print "Prosent riktig bytte: ", s_keep/float(k)

Koden over kjører konkurransen 10.000 (ti tusen) ganger. For hver gang vi vinner bil, øker s_change eller s_keep med 1, avhengig av hvilken metode vi brukte. Resultatet vises under:

Antall forsok:  10000
Antall suksess bytte:  6721  Antall suksess beholde:  3279
Prosent riktig bytte:  0.6721
Prosent riktig beholde:  0.3279

Dette stemmer godt med utregningene vi gjorde.

The probability distribution of the length of pop songs

Taking a course in probability can be extremely fun if you try to apply some of acquired knowledge to real world data sets. Finding data types that are easy to gather and at the same time interesting is a problem in itself. By a flash of inspiration, I came to think of this incredible easy example: the length (in seconds) of pop songs.

After a succesful Google search, I ended up with NRK's Spotify list "NRK mP3 - siste 400" and sampled the 42 first song lenghts.  Using the Python function below, the lengths were converted to seconds:

def mintilsek(n):
    min = floor(n)
    s   = (n-min)*100
    return int(round(min*60+s))

The sample data ended up as follows:

w = [52 241 223 231 225 242 263 222 200 220 238 213 210 213 210 183 321
 265 228 200 206 200 228 193 269 289 197 211 211 236 252 222 224 184
232 198 207 220 178 192 258 192]

The natural question to ask is: Which distribution does these numbers belong to? As so many real world populations distribute normally, the first thing I did was to plot the observed sample values in a normal QQ-plot (that is, normal quantiles on one axis, and sample quantiles on the other axis). I did that using the following R code:

> qqnorm(w); qqline(w)

Where w is the data set. The result is seen below:

The fit is not too bad, but there are noticable probability mass on the tails, so a probability distribution with heavier tails should be considered.

Let us for the moment assume that the distrubution is normal. We can do a maximum likelihood estimation, to estimate the parameters \mu,\sigma, the mean and standard deviation respectively. R has (obviously) a function for this:

> fitdistr(w, "normal")
      mean          sd    
  223.785714    29.214751 
 (  4.507934) (  3.187591) 

That is, if we assume that the distribution is normal, then a maximum likelihood estimate of the parameters tells us that the mean song length is 3:43 with a standard deviation of 29 seconds. That is, about 68% of all songs are between 3:14 and 4:12 minutes long.

If we now however assume that the population has a Cauchy distribution, which has heavier tails, then we get the following:

> fitdistr(w, "cauchy")
    location      scale   
  217.829335    15.684886 
 (  3.897386) (  3.090949)

Since neither the mean nor the variance exists for a Cauchy distribution because of its heavy tails, these numbers are difficult to compare with the previous numbers. According to this, however, 50% of all pop songs have length less than 3:37 minutes.

Finally: We know that for sufficiently large samples, the sample average \overline{X} is approximately normal. Thus we can find an approximate confidence interval for the real median \mu (page 386, Devore & Berk). The sample mean is 223.8 and the sample standard deviation is 29.6. Thus by the formula \overline{x} \pm z_{\alpha/2} \frac{s}{\sqrt{n}} we find that a confidence interval for \mu with approximately 95% confidence is  (215, 233).

Conclusion: Though I could, and maybe should, have written a lot more, it seems that assuming song lengths are normally distributed is a fair assumption, given that the data fit was nearly linear for values near the mean. (that is, if we ignore the "extremes", then the distribution is certainly "very" normal)

This proves that playing with statistics can be fun (as I have just been doing it for the last 3-4 hours).

The St Petersburg paradox

We toss a coin. If you get heads first toss, you get €2. If not, we toss again. If you get heads second toss, you get €4. In general, if you get heads on the n'th toss, you get €2^n. What is your expected gain?

Basic probability theory tells us that the expected value is the sum of the probabilities of the possible events multiplied by the corresponding gain.  One can consider the coin tossing as a discrete random variable, taking the values 0 or 1, each with a probability of \frac 12. Getting heads at first toss has a probability of \frac 12 and earns you €2, getting heads at second toss has a probability of  \frac 14 and earns you 4€, and so on. Thus the expected value of this game is \sum_{i=1}^\infty 2^{i} \frac{1}{2^i}=1+1+1+\cdots=\infty.

This sounds surprising. After all, you cannot reasonably expect to earn an infinite amount of money - there will never be an infinite row of tails. However, in this case the gain grows just as quickly as the diminishing probability of successive tails.

A simulation of the game shows that, indeed, the (arithmetic) mean grows as the number of tosses grow, but very slow. The plot below shows the mean value up to 10^7 tosses. Since the expected value is infinity, a mean gain of about €25 after ten million tosses doesn't sound much.

The plots show the same script run twice. The dissimilarity of the plots indicates the slow growth of the accuracy of the simulation (indeed, if the simulation told the real story, the plots would be too large for this world).

Let us, just for fun, change the rules of the game. Instead of getting 2^n on the n'th toss, you now get n. As we know, linear functions grow substantially slower than exponential ones, so we should expect a change in the expected value. Indeed, a calculation shows that the expected value is 2. Plotting this:

We see that already at 100 tosses, we can be pretty sure what the expected value is. This is far from the case with the original rule. From the plots, the expected value could just as well be converging (albeit very slowly).

What does these considerations imply? Firstly, that simulating events with extremely low probabilities demand extremely accuracy/number of trials. Secondly, that using only expected value leads to "paradoxes". Say you were an economist - then you'd naïvely assume that your rational course of action is to play this game. After all, if you do it many times, you will certainly get rich. But as the probability of earning anything more than a few € is so low, it will take ages before you get rich. Economists solved this by considering time as a resource. Read Wikipedias's article for more information.

The Python file used to generate the plots.