Is the Weather Biased?
July 29, 2010 — Jon McLoone, Director, Technical Communication & Strategy
My mother has a theory: “The nicest weather is when you are at work, and then it rains on the weekend.” Hearing this from her once again, I think it is time to expose her theory to the facts and prove her wrong.
We’ll start by setting up some tools to help retrieve and categorize the data in terms of the type of day. In the United Kingdom, the weekend is Saturday and Sunday.
Now we can take the available weather data and split it into weekend values and weekday values, and discard dates where no value was recorded…
Finally, we need to average the data and present it clearly. We’ll do this as a separate step, as we’ll also want to access the raw data for further analysis.
OK, now I am ready to start. In the UK, the definition of “nice” weather is simple: the dryer the better and the hotter the better. Let’s look at the rain first…
Oh dear, she might be on to something. There has been on average nearly 0.02mm more rain on weekend days. Before conceding defeat, we have to know whether that is a significant result. Fortunately, Mathematica has all the statistics we need to test this.
If you take two sample sets from the same data then they are unlikely to have exactly the same mean. The question is, how likely is it that the means would be as different as these are purely because of the randomness of sampling? If the difference is very unlikely (the standard test is 5% of the time) then we conclude that it is significantly likely that the difference is because there is another cause (i.e., that they are being sampled from different data that actually has different means).
My hypothesis is that the weather has no bias for weekends, and therefore both sets of data are from the same data. So we give the test parameter for the difference in means as zero, and say the variances of both populations is equal and it is a TwoSided test, as the hypothesis is symmetrical.
We’ll create a shortcut function for this as we will do this test several times.
Phew! This tells us that the difference isn’t significant. Randomly dividing the data into two sets will produce this much difference in mean about 20% of the time.
What about temperature?
The average weekend temperature is 0.03°C higher—the opposite of her prediction—but again, let’s see if that is significant…
Not surprisingly, this isn’t significant either. But “niceness” is a combination of both values, and we don’t like it too windy either. Let’s factor in windiness; perhaps there is some complexity about combining the three data sets.
The weighting between the three values is somewhat subjective, but here is my formula. You will have to come up with your own to suit your climate and tastes.
Unfortunately, the weather stations that Mathematica is querying sometimes have missing data points when individual instruments or weather stations have been offline. We have to collect the data by day, reject any days where any of the values is missing, apply the “niceness” metric, and categorize as weekend or not. This is the kind of complicated data analysis that is made easy by Mathematica.
We are retrieving three data sets with around 20,000 points in each. My first naive approach was to search each set nearly 20,000 times. This was a bit slow, so instead we’ll tag the data with the type of data that it is, throw it all in together, and let the built-in Mathematica command GatherBy match up the dates. As a rule of thumb, if there is a built-in command close to what you want, try to use it rather than replicating it with something similar. This version takes a couple of seconds.
Not significant again.
At this point, I present the results. Peer review does not go well. My mother alleges selective choice of Oxford, a place where she has never lived.
I sense that it is time to change the goal of my research. Where is her theory MOST true? Let’s scan the largest 68 cities in the UK to see if it applies somewhere…
The places with the greatest variance are at the ends of the sorted list.
While both of these are significant with the 5% test, doing so on data which we have systematically selected for being the greatest outliers of 68 points is a sufficient abuse of statistical method to be meaningless. And I point out that Brighton, the place where her theory is least true, IS somewhere that she has lived.
We agree that the definitive answer would be to take into account every single UK weather station (or perhaps all those that she has lived near, weighted by how long she was there). It will only take a few more lines of Mathematica code, but I have concluded that if I want to still get birthday cards, it would be best to leave it as a mystery.