Taking a course in probability can be extremely fun if you try to apply some of acquired knowledge to real world data sets. Finding data types that are easy to gather and at the same time interesting is a problem in itself. By a flash of inspiration, I came to think of this incredible easy example: the length (in seconds) of pop songs.
After a succesful Google search, I ended up with NRK's Spotify list "NRK mP3 - siste 400" and sampled the 42 first song lenghts. Using the Python function below, the lengths were converted to seconds:
def mintilsek(n): min = floor(n) s = (n-min)*100 return int(round(min*60+s))
The sample data ended up as follows:
w = [52 241 223 231 225 242 263 222 200 220 238 213 210 213 210 183 321 265 228 200 206 200 228 193 269 289 197 211 211 236 252 222 224 184 232 198 207 220 178 192 258 192]
The natural question to ask is: Which distribution does these numbers belong to? As so many real world populations distribute normally, the first thing I did was to plot the observed sample values in a normal QQ-plot (that is, normal quantiles on one axis, and sample quantiles on the other axis). I did that using the following R code:
> qqnorm(w); qqline(w)
Where w is the data set. The result is seen below:
Let us for the moment assume that the distrubution is normal. We can do a maximum likelihood estimation, to estimate the parameters , the mean and standard deviation respectively. R has (obviously) a function for this:
> fitdistr(w, "normal") mean sd 223.785714 29.214751 ( 4.507934) ( 3.187591)
That is, if we assume that the distribution is normal, then a maximum likelihood estimate of the parameters tells us that the mean song length is 3:43 with a standard deviation of 29 seconds. That is, about 68% of all songs are between 3:14 and 4:12 minutes long.
If we now however assume that the population has a Cauchy distribution, which has heavier tails, then we get the following:
> fitdistr(w, "cauchy") location scale 217.829335 15.684886 ( 3.897386) ( 3.090949)
Since neither the mean nor the variance exists for a Cauchy distribution because of its heavy tails, these numbers are difficult to compare with the previous numbers. According to this, however, 50% of all pop songs have length less than 3:37 minutes.
Finally: We know that for sufficiently large samples, the sample average is approximately normal. Thus we can find an approximate confidence interval for the real median (page 386, Devore & Berk). The sample mean is and the sample standard deviation is . Thus by the formula we find that a confidence interval for with approximately 95% confidence is .
Conclusion: Though I could, and maybe should, have written a lot more, it seems that assuming song lengths are normally distributed is a fair assumption, given that the data fit was nearly linear for values near the mean. (that is, if we ignore the "extremes", then the distribution is certainly "very" normal)
This proves that playing with statistics can be fun (as I have just been doing it for the last 3-4 hours).