The probability distribution of the length of pop songs

Taking a course in probability can be extremely fun if you try to apply some of acquired knowledge to real world data sets. Finding data types that are easy to gather and at the same time interesting is a problem in itself. By a flash of inspiration, I came to think of this incredible easy example: the length (in seconds) of pop songs.

After a succesful Google search, I ended up with NRK’s Spotify list “NRK mP3 – siste 400” and sampled the 42 first song lenghts.  Using the Python function below, the lengths were converted to seconds:

def mintilsek(n):
    min = floor(n)
    s   = (n-min)*100
    return int(round(min*60+s))

The sample data ended up as follows:

w = [52 241 223 231 225 242 263 222 200 220 238 213 210 213 210 183 321
 265 228 200 206 200 228 193 269 289 197 211 211 236 252 222 224 184
232 198 207 220 178 192 258 192]

The natural question to ask is: Which distribution does these numbers belong to? As so many real world populations distribute normally, the first thing I did was to plot the observed sample values in a normal QQ-plot (that is, normal quantiles on one axis, and sample quantiles on the other axis). I did that using the following R code:

> qqnorm(w); qqline(w)

Where w is the data set. The result is seen below:

The fit is not too bad, but there are noticable probability mass on the tails, so a probability distribution with heavier tails should be considered.

Let us for the moment assume that the distrubution is normal. We can do a maximum likelihood estimation, to estimate the parameters $$\mu,\sigma$$, the mean and standard deviation respectively. R has (obviously) a function for this:

> fitdistr(w, "normal")
      mean          sd    
  223.785714    29.214751 
 (  4.507934) (  3.187591) 

That is, if we assume that the distribution is normal, then a maximum likelihood estimate of the parameters tells us that the mean song length is 3:43 with a standard deviation of 29 seconds. That is, about 68% of all songs are between 3:14 and 4:12 minutes long.

If we now however assume that the population has a Cauchy distribution, which has heavier tails, then we get the following:

> fitdistr(w, "cauchy")
    location      scale   
  217.829335    15.684886 
 (  3.897386) (  3.090949)

Since neither the mean nor the variance exists for a Cauchy distribution because of its heavy tails, these numbers are difficult to compare with the previous numbers. According to this, however, 50% of all pop songs have length less than 3:37 minutes.

Finally: We know that for sufficiently large samples, the sample average $$\overline{X}$$ is approximately normal. Thus we can find an approximate confidence interval for the real median $$\mu$$ (page 386, Devore & Berk). The sample mean is $$223.8$$ and the sample standard deviation is $$29.6$$. Thus by the formula $$\overline{x} \pm z_{\alpha/2} \frac{s}{\sqrt{n}}$$ we find that a confidence interval for $$\mu$$ with approximately 95% confidence is  $$(215, 233)$$.

Conclusion: Though I could, and maybe should, have written a lot more, it seems that assuming song lengths are normally distributed is a fair assumption, given that the data fit was nearly linear for values near the mean. (that is, if we ignore the “extremes”, then the distribution is certainly “very” normal)

This proves that playing with statistics can be fun (as I have just been doing it for the last 3-4 hours).

The St Petersburg paradox

We toss a coin. If you get heads first toss, you get €2. If not, we toss again. If you get heads second toss, you get €4. In general, if you get heads on the n’th toss, you get €$$2^n$$. What is your expected gain?

Basic probability theory tells us that the expected value is the sum of the probabilities of the possible events multiplied by the corresponding gain.  One can consider the coin tossing as a discrete random variable, taking the values 0 or 1, each with a probability of $$\frac 12$$. Getting heads at first toss has a probability of $$\frac 12$$ and earns you €2, getting heads at second toss has a probability of  $$\frac 14$$ and earns you 4€, and so on. Thus the expected value of this game is $$\sum_{i=1}^\infty 2^{i} \frac{1}{2^i}=1+1+1+\cdots=\infty$$.

This sounds surprising. After all, you cannot reasonably expect to earn an infinite amount of money – there will never be an infinite row of tails. However, in this case the gain grows just as quickly as the diminishing probability of successive tails.

A simulation of the game shows that, indeed, the (arithmetic) mean grows as the number of tosses grow, but very slow. The plot below shows the mean value up to $$10^7$$ tosses. Since the expected value is infinity, a mean gain of about €25 after ten million tosses doesn’t sound much.

The plots show the same script run twice. The dissimilarity of the plots indicates the slow growth of the accuracy of the simulation (indeed, if the simulation told the real story, the plots would be too large for this world).

Let us, just for fun, change the rules of the game. Instead of getting $$2^n$$ on the n’th toss, you now get $$n$$. As we know, linear functions grow substantially slower than exponential ones, so we should expect a change in the expected value. Indeed, a calculation shows that the expected value is 2. Plotting this:

We see that already at 100 tosses, we can be pretty sure what the expected value is. This is far from the case with the original rule. From the plots, the expected value could just as well be converging (albeit very slowly).

What does these considerations imply? Firstly, that simulating events with extremely low probabilities demand extremely accuracy/number of trials. Secondly, that using only expected value leads to “paradoxes”. Say you were an economist – then you’d naïvely assume that your rational course of action is to play this game. After all, if you do it many times, you will certainly get rich. But as the probability of earning anything more than a few € is so low, it will take ages before you get rich. Economists solved this by considering time as a resource. Read Wikipedias’s article for more information.

The Python file used to generate the plots.

“Fearless Symmetry” by Avner Ash, Robert Gross

At a book store in a shopping center by the coast of California I found this gem of a book. I skimmed through the content list, and bought it without much more thinking. In retrospect, it is safe to say that it was worth the $23.95 plus Californian tax.

As the title suggests, the book is much about symmetry  – but it is also slightly misleading. The book is really about number theory and the theory that led to the solution of Fermat’s  Last Theorem.

The book’s main mission is to explore the absolute Galois group $$G=G(\mathbb Q^{alg}/\mathbb Q)$$ through representations, that is, morphisms from $$G$$ to more known groups, such as matrix groups and finite fields. As such, the book is more about representation theory than symmetry. But it doesn’t stop there! A main theme in the book is how representation theory is behind generalized reciprocity laws in number theory and how reciprocity laws are used in advanced mathematics (an example of a reciprocity law is $$(p/q)=(-1/q)(q/p)$$ where $$(p/q)$$ is the Legendre symbol. That is, knowing if $$p$$ is square mod $$q$$ tells us if $$q$$ is square mod $$p$$ and conversely).

The book is written in a leisurely language and contains no difficult proofs and avoids technical definitions – without losing substance. Number theory is presented as a rich subject with lots of tools and abstractions.

The presentation was very inspirational, and this next semester will be like Christmas for me.

Mathematics – Form and function by Saunders Mac Lane

I have just finished reading “Mathematics – form and function” by Saunders Mac Lane. The main goal of the book is to present the author’s philosophy of mathematics, answering the question “what is mathematics?”. In doing so, he also answers the question “is mathematics true?” and demonstrates that it is a non-question. He presents mathematics as a set of tightly intervowen formal rules, wherein deduction is only allowed following the “rules of deduction”. Continue reading Mathematics – Form and function by Saunders Mac Lane

Banach-Tarski paradox, very weak version

I post this mainly to test the MathJax JavaScript on this blog (if you have by some paranoid reason turned off JavaScript, you will see only LaTeX code below). However, this post is not totally without substance. The usual Banach-Tarski paradox states that any two bounded subsets with non-empty interior A,B of $$\mathbb{R}^3$$, it is possible to partition A into finitely many pieces, move the pieces around using rotations only and end with a copy of B (possibly somewhere else). The proof is lengthy and involves the Axiom of Choice. This should not foster any doubts about the Choice Axiom, however. It is possible to find paradoxical constructions without the Choice Axiom (without mentioning all its reasonable consequences). $$S^2$$ is paradoxical without Choice, by the way.  Anyhow, I’m drifting away from the main topic of this post.

Very weak version of the Banach-Tarski paradox: Let $$S^1$$ be the usual unit circle in $$\mathbb{R}^2$$. Remove one point from from the circle (for example $$1$$), and call this punctured unit circle $$C$$. Then it is possible to find two subsets $$A,B$$ of $$C$$ such that after rotating $$B$$ to $$r(B)$$, we have $$A \cup r(B)=S^1$$.

Proof: The proof is short and easy. Identify the plane with $$\mathbb{C}$$ for convencience. Let $$C=S^1-\{1\}$$. Let $$B=\{ e^{in} | n \in \mathbb{N} \}$$ (we adopt the convention that $$0 \notin \mathbb{N}$$).  Then $$B$$ is a proper (countable) infinite subset of $$C$$. Let $$A=C-B$$. We have $$C=A \cup B$$. Now, let $$r:\mathbb{C} \to \mathbb{C}$$ be the rotation $$z \mapsto ze^{-i}$$. Then $$r(B)=B \cup \{ 1 \}$$ and $$A \cup r(B)=S^1$$.

The proof is easy to follow, but the technique is conceptually the same as the technique used to prove the strong version of the paradox.  The technique is basically to find some infinite subset of $$C$$ that is regular enough to “almost rotate into itself” – with “almost” here meaning “all but finitely many points”.

Why is the above result interesting? My personal opinion is its conceptual carriage; it gives us the idea of how to prove stronger results such as the Banach Tarski-paradox. And also that mathematics is not always intuitive.

I’ll end this post with  a link to my own undergrad project paper about the strong Banach-Tarski paradox (contains some minor typos, by the way). CLICK HERE.

Ian Stewart – From here to infinity

Jeg har akkurat fullført boken “From here to infinity – A Guide to Today’s Mathematics” av Ian Stewart. Tittelen er veldig beskrivende: den handler stort sett om hva matematikere har drevet med de siste par hundre årene, og tar opp et svært bredt spektrum av temaer (primtall (dvs. Riemann-hypotesen), knuteteori, topologi (f.eks 4-fargeteoremet), algebra (f.eks uløseligheten av femtegradslikningen), og beregnbarhet (f.eks Gödels kjente teorem)).

Boken er veldig inspirerende for en vordende matematiker. Vi får høre om hvor lang tid det tar å bevise viktige resultater, om uløste problemer, og noe om hvordan en matematiker arbeider.

Boken ble første gang utgitt i 1987, så noe av informasjonen er allerede utdatert. For eksempel er både Poincaré-formodningen og Fermats siste sats løste problemer i dag. Men det er bare motiverende! (selv i dag har matematikere noe å gjøre)

Begrepene i boken introduseres på en lettlest måte, og kan leses av alle med allmennkunnskapene i orden.  Jeg koseleste boken i den forstand at jeg ikke tok meg tid til å “gjennomforstå” hvert eneste nye begrep som ble innført.

Forfatteren fokuserer mye på hvor mye matematikken har blitt anvendt i samfunnet rundt oss, og beskriver hvordan utviklingen av ny matematikk har blitt inspirert av fremgang i fysikk/biologi/økonomi/osv.  Han spår at “det tjueførste århundrets matematikk” (dvs. dagens og framover!) vil bli mye preget av samspillet mellom datamaskiner/fysikk og moderne teknologi. Selv har jeg ikke dette inntrykket, men det er likevel interessant å lese om hvordan man på slutten av 80-tallet så for seg framtidens matematikk.

Boken kan lånes på  Matematisk Bibliotek på Blindern. (når jeg har levert den tilbake!)

Symmetry – Marcus du Sautoy

Jeg har nylig fullført boken “Symmetry – A journey into the patterns of nature” av Marcus du Sautoy. Jeg begynte å lese boken på flyet hjem fra Bangkok (på grunn av heldige omstendigheter satt jeg på Business Class), og fullførte den for noen dager siden.

Boken er en perle å lese for alle matematikkstudenter – og også andre matematikkinteresserte. Den er en fin blanding av matematikkhistorie, matematikken bak symmetri (fortalt svært lettfattelig) og hvordan det er å være matematiker. Sistnevnte ingrediens er kanskje den mest velsmakende av alle tre.

Som tittelen hinter til, handler boken om symmetri, og hva det egentlig ér. “Vanlige”, uopplærte mennesker, har en vag definisjon i bakhodet, og tenker kanskje på speil og blomster. Boken prøver å fortelle oss hva en matematiker mener med symmetri, og hvordan man har laget et matematisk begrep som gjør at man kan regne med “symmetrier”. Man snakker om symmetrigrupper, og klassifiseringen av de endelige simple gruppene.

Matematikkdelen av boken er enkel, og det aller meste er forståelig for de aller fleste. Likevel tror jeg det kan være en fordel å kjenne noen av begrepene (gruppeteori, spesielt) når man leser boken slik at man forstår hva den snakker om. Da blir det virkelig spennende!

Historiedelene av boken forteller hvordan man har arbeidet med symmetribegrepet opp gjennom historien, hvordan det førte til at vi begynte å telle (ved at vi ser likheter mellom enkeltting), og hvordan symmetribegrepet hjalp oss å bevise at femtegradslignigen ikke har noen løsningsformel).

du Sautoy forteller hvordan det er å arbeide som matematiker. Om hvordan det er å reise verden rundt på konferanser, om sære matematikere, og hvordan det er å jobbe på et universitet.

Som en subtil symmetrisk detalj starter og slutter boken på samme begivenhet.

Anbefales.

Matematikk, en veldig kort innføring

Jeg fullførte akkurat boken “Mathematics – A Very Short Introduction”  av Timothy Gowers i den velkjente very short-serien. Selv om boken ikke akkurat ga meg stort faglig nytt (siden jeg studerer emnet på universitetsnivå), ga den meg innsikt i formidling av matematikk. Gowers har klart det ikke veldig mange klarer, nemlig å formidle essensen av hvordan profesjonelle matematikere arbeider, og hva matematikk er (bare så det er sagt, så er jeg (ennå) ikke noen profesjonell matematiker) – og det på en veldig forståelig måte.

Et gjennomgangbudskap i boken er den matematiske metode, noe han også kaller “the abstract method”. Matematikk går ut på å velge seg noen aksiomer, (finne ut om disse er kompatible med hverandre), og utlede proposisjoner ut fra disse. Han poengterer at det ikke er relevant hva ting er, men heller hvilke regler de adlyder. Et eksempel: det er ikke viktig å vite hva tall er, det er viktig å vite for eksempel at $$a(b+c)=ab+ac$$.

Boken avsluttes med en “ofte stilte spørsmål”-del som, ihvertfall for meg, var veldig interessant å lese. Noen av spørsmålene han prøver å svare på er “Is it true that mathematicians are past it by 30?”, “Why are there so few women in mathematics?”, “Are famous mathematical problems ever solved by amateurs?”.

Boken kan leses av alle som har fullført ungdomsskolen. Om man er interessert i matematikk og  liker å lese korte lærerike bøker,  så passer denne for deg =)

Boken koster kun ca 55 norske kroner på Amazon. Jeg vil forøvrig også anbefale interesserte i å ta en titt på bloggen til Gowers.