Dr Angus H Wright
2022-02-08
In the simplest possible terms, there are two skills that are required to analyse any question using data and statistics. The first skill involves understanding how to describe a given dataset, which frequently involves reducing the information contained within the data into interpretable chunks. The second skill involves being able to draw justifiable conclusions from the data at hand. In this section, we will develop the first skill, starting in the simplest possible terms and building complexity from there.
When working in empirical science, modelling and understanding datasets is paramount. In this module we start by discussing the fundamentals of data modelling. We start by discussing theories of point and interval estimation, in the context of summary statics (e.g. expectation values, confidence intervals), and estimation of data correlation and covariance. Students will learn the fundamentals of data mining and analysis, in a way that is applicable to all physical sciences.
Topics include:
When using point estimates as summaries of data, it is useful to understand some fundamental properties of each statistic. In each of the subcategories below we detail useful properties of each estimator, and important conceptual details about them.
Recall that, for an arbitrary dataset of variables \({\bf x} = {x_1,x_2,\dots,x_n}\), the mean is defined as:
\[ {\rm mean}(x)\equiv\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i. \]
\[ {\rm mean}(k{\bf x}) = k\times{\rm mean}({\bf x}) \]
\[ {\rm mean}({\bf x}+c) = {\rm mean}({\bf x})+c \]
\[ \sum_{i=1}^n (x_i - {\rm mean}({\bf x})) = 0 \]
\[ \underset{\mu}{\rm argmin{}}\sum_{i=1}^{n}(x_i-\mu)^2=\bar{x} \]
We can prove that the mean minimises the mean square distance to all data by finding the minima of the function: \[ \begin{align} \frac{\delta}{\delta \mu}\sum_{i=1}^{n}(x_i-\mu)^2 &= -2\sum_{i=1}^{n}(x_i-\mu) \\ &=0 \end{align} \] so: \[ \begin{align} \sum_{i=1}^{n}(x_i-\mu)&=0 \\ \sum_{i=1}^{n}x_i - \sum_{i=1}^{n}\mu &= 0 \\ \therefore\mu &= \frac{1}{n}\sum_{i=1}^{n}{x_i}\\ &\equiv \bar{x} \end{align} \]
Recall that, for an arbitrary dataset of variables \({\bf x} = {x_1,x_2,\dots,x_n}\), the sample standard deviation is defined as:
\[ \begin{align} {\rm std}({\bf x})\equiv s &= \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2} \\ &= \sqrt{{\rm mean}(x_i-\bar{x})^2}. \end{align} \]
The unbiased estimator of the population standard deviation is: \[ \hat{\sigma} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})^2}. \]
\[ {\rm std}({\bf x}+C) = {\rm std}({\bf x}) \]
\[ {\rm std}(k{\bf x}) = k\times{\rm std}({\bf x}) \]
Assume we construct a length \(n\) dataset of variable \(y\) with \(m\) data that are \(k\) standard deviations from the mean. The fraction of data beyond \(k\) standard deviations is \(r=m/n\). Furthermore, let’s assume \(\bar{y}=0\) (which is fine, because of the translation point above). Therefore: \[ {\rm std}({\bf y})\equiv s_y=\sqrt{\frac{1}{n}\sum_{i=1}^{n}y_i^2} \] Let’s now make our dataset as pathological as possible. To start, we’ll assign \(n-m\) data points to have \(y_0=0\), because these contribute \(0\) to the standard deviation. We’ll then place the other \(m\) elements at exactly \(k\) standard deviations from 0; \(y_1=ks_y\),. For this very strange dataset, the standard deviation becomes: \[ \begin{align} s_y&=\sqrt{\frac{1}{n}\sum_{i=1}^{m}y_1^2 + \sum_{i=m+1}^{n} y_0^2 } \\ &=\sqrt{\frac{mk^2s_y^2}{n}} \\ &=\sqrt{rk^2s_y^2} \end{align} \] so: \[ s_y^2=rk^2s_y^2 \\ \therefore r=\frac{1}{k^2} \] As this was the most pathological dataset possible, we therefore conclude that for any dataset, the maximal fraction of data that can sit \(k\) standard deviations away from the mean is \(r=k^{-2}\).
Given the formula for the standard deviation: \[ s = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_i-\bar{x})^2} \] where again we can generalise to an arbitrary dataset with \(\bar{x}=0\): \[ s_0 = \sqrt{\frac{1}{n}\sum_{i=1}^{n}x_i^2} \] So \[ n\times s^2 = \sum_{i=1}^{n}x_i^2 \] The right hand side here is the sum of all squared deviations from the mean. However: \[ \sum_{i=1}^{n}x_i^2 \leq n\times {\rm max}(x_i^2). \] That is, the sum of all deviations must be less than or equal to \(n\) times the maximal squared deviation. Therefore: \[ \begin{align} n\times s^2 &\leq n\times {\rm max}(x_i^2).\\ s^2 &\leq {\rm max}(x_i^2). \end{align} \] So there must be at least 1 data value that is greater than or equal to the standard deviation.
Recall that, for an arbitrary dataset of variables \({\bf x} = {x_1,x_2,\dots,x_n}\), the median is defined as:
\[ {\rm med}({\bf x}) \equiv \tilde{x}_{0.5} = \begin{cases} x_{[(n+1)/2]} & n\in 2\mathbb{Z}-1 \\ (x_{[n/2]}+x_{[n/2+1]})/2 & n\in2\mathbb{Z} \end{cases} \]
\[ {\rm med}(k{\bf x}) = k\times{\rm med}({\bf x}) \]
\[ {\rm med}({\bf x}+c) = {\rm med}({\bf x})+c \]
Recall that, for an arbitrary dataset of variables \({\bf x} = {x_1,x_2,\dots,x_n}\), the normalised median absolute deviation from median (nMAD) is defined as:
\[ {\rm nMAD}(x) = 1.4826\times{\rm med}(|x_i-\tilde{x}_{0.5}|)._\, \]
\[ {\rm nMAD}({\bf x}+C) = {\rm nMAD}({\bf x}) \]
\[ {\rm nMAD}(k{\bf x}) = k\times{\rm nMAD}({\bf x}) \]
Recall that the quantile function is the inverse of the cumulative distribution function, which in practice we compute using linear interpolation between rank-ordered values of the arbitrary variable \(x\).
The quantile function is useful for many reasons, but one principle use is in the definition of dispersion measures such as the “interquartile range”.
A “quartile” is a quantile that splits the dataset into quarters. The four standard quartiles are therefore defined as being the data within:
The Interquartile Range is defined as the difference between the \(25^{\rm th}\) and \(75^{\rm th}\) percentiles (i.e. the range containing the inner two quartiles, or the middle \(50\%\) of the data):
\[ {\rm iqr}({\bf x})={\rm percentile}({\bf x},0.75)-{\rm percentile}({\bf x},0.25) \]
\[ {\rm iqr}({\bf x}+C) = {\rm iqr}({\bf x}) \]
\[ {\rm iqr}(k{\bf x}) = k\times{\rm iqr}({\bf x}) \]
How do these different statistics compare? What makes one more useful than another for a particular dataset?
For Gaussian data with many observations, the mean and median are essentially equivalent. However real world datasets are messy, and data is very rarely “purely” Gaussian.
The primary pitfall with measures of central tendency come from the presence of outlier data. Given that the mean minimises the average distance to all data, the presence of outliers in a dataset can catastrophically bias the statistic.
As an example, let’s explore a similar dataset to one we’ve already seen. In Section 1a we discussed the sizes of mice depicted in Disney movies. Let’s look at a similar sort of dataset, but with many many more observations:
A statistician is working as a wildlife consultant part-time. They are called to catalogue the sizes of all mice present on the grounds in and around the Disney World resort in Florida, for conservation purposes. Being a contentious statistician, they do exactly as they are asked. The resulting dataset is below:
If we compute the point statistics of central tendency for this dataset:
## [1] 14.25428
## [1] 7.081477
The presence of the mascots completely ruins our mean estimate of the mice heights. However, the median statistic doesn’t fall into the same trap.
The story gets even worse when we want to calculate the dispersion tatistics:
## [1] 34.38242
## [1] 1.557553
## [1] 2.105527
The catastrophic failure of the standard deviation here is a combination of the inclusion of the problematic mean estimate and the requirement that standard deviations have few data-points at many standard deviations from the mean. Recall our formular for the fraction of data that can reside \(k\) standard deviations from the mean: \(r\leq \frac{1}{k^2}\). In this dataset, the outliers make up 4.31\(\%\) of the dataset. Therefore, they can reside at most 4.82 standard deviations from the mean. In reality, though, the outliers here aren’t drawn from the same Gaussian distribution as the rest of the data, and in truth reside 112.67 standard deviations from the mean.
Crucially, the nMAD statistic is robust to the outliers, as it uses median statistics in its computation.
To finally punctuate the importance of understanding the datasets you are analysing (and selecting the appropriate tools to summarise the data), let’s consider a second dataset.
A different person performed the same survey as our original statistician, but did so 6 months beforehand. This earlier survey was conducted by an up-and-coming star in the statistics world, who is particularly well respected because of her ability (unlike some other statisticians) to consistently distinguish between a real mouse and a 6ft tall human in a mouse costume.
Her survey results, compared to our initial dataset, are shown below.
mice_may<-rnorm(1e3,mean=7,sd=1.5)
maghist(mice_nov,breaks=seq(0,200,by=1),xlab='Size of Mice (cm)', ylab='Frequency',
col='blue3',
verbose=FALSE,freq=TRUE)
maghist(mice_may,breaks=seq(0,200,by=1),
col=hsv(a=0.5),add=TRUE,
verbose=FALSE,freq=TRUE)
legend('topright',legend=c("First Survey","Second Survey"),
pch=15,col=c('red','blue3'), bty='n',inset=0.1)
The results of these two surveys, ignoring the outliers, are identical. However if we were to summarise these data only using mean statistics…
## [1] 14.25428
## [1] 7.012647
…then we would be forced to draw the conclusion that the activities at Disney World (in the intervening 6 months between the surveys) have caused the local mouse population to double in their average size!
As is clear from the above, it is frequently useful to explore data distributions with multiple statistics, to describe the data in a more complete fashion and ensure that our analysis is robust to systematic failures. .
Additionally, one can explore univariate distributions using tools other than point statistics.
We have already seen how graphical tools are particularly useful for exploring datasets. In the previous lecture we discussed the utility of histograms and KDEs for plotting distributions of data in one-dimenstion. Here we extend this to some other useful distributions, and particularly focus on comparisons between different samples of observations of a single variable.
We saw in the previous slides how the histogram made it clear that the dataset we were analysing was contaminated by outliers. This is also possible (and arguably easier to see) in a KDE:
magplot(density(mice_nov,from=0,to=200,bw=1),xlab='Size of Mice (cm)', ylab='Density',
col='blue3',type='l')
lines(density(mice_may,from=0,to=200,bw=1),col=hsv(a=0.5))
legend('topright',legend=c("First Survey","Second Survey"),
lty=1,lwd=2,col=c('red','blue3'), bty='n',inset=0.1)
KDEs are therefore a flexible and useful tool for comparing distributions of data in a single dimension, across different samples of observations.
When the number of samples becomes large, though, overplotting many lines (i.e. from many KDEs) can become confusing and/or make it difficult to quickly visualise key information.
The box-and-whisker plot is a solution to this. The box-and-whisker plot essentially visualises a number quantiles of a given dataset simultaneously, while also showing the outliers. The box-and-whisker plot for our mice datasets look as below:
Notice that, because the boxplot shows quantiles, the default range here (with the outliers) is somewhat unhelpful. Instead we can zoom on the important area of the plot:
The ‘box’ in the box-and-whisker plots shows the interquartile range of the data, while the line inside the box shows the median. The “whiskers” extend to the largest data point that is not deemed to be an “outlier”. Outliers are shown as crosses. Finally, the “notch” on the box shows the uncertainty on the median, which is computed using the IQR:
\[ \Delta \tilde{x}_{0.5} = \pm 1.58 \frac{IQR}{\sqrt{n}}_\, \]
Violin plots are a way of constructing graphs that utilise the full KDE information without crowding that is associated with overlapping KDEs.
#Load the ggplot2 library
library(ggplot2)
#Create a combined dataset
df<-data.frame(height=c(mice_may,mice_nov),
survey=c(rep("May",length(mice_may)),
rep("Nov",length(mice_nov))))
#plot the violin diagram for both datasets
ggplot(df, aes(x = survey, y = height)) +
geom_violin(aes(fill = survey), trim = FALSE) +
theme_classic() +
theme(legend.position = "none")
Again we can trim down the y-limits if we want:
#plot the violin diagram with points
ggplot(df, aes(x = survey, y = height)) +
ylim(c(3,12))+
geom_violin(aes(fill = survey), trim = FALSE) +
theme_classic() +
theme(legend.position = "none")
## Warning: Removed 52 rows containing non-finite values (stat_ydensity).
## Warning: Removed 113 rows containing missing values (geom_violin).
The utility of ggplot here (and of the violin plot in general) is that it is easy to include additional information that helps understanding, without causing a saturation of information.
For example, we can add the individual data to the figure:
#plot the violin diagram with points
ggplot(df, aes(x = survey, y = height)) +
ylim(c(3,12))+
geom_violin(aes(fill = survey), trim = FALSE) +
geom_point(pch='.') +
theme_classic() +
theme(legend.position = "none")
## Warning: Removed 52 rows containing non-finite values (stat_ydensity).
## Warning: Removed 113 rows containing missing values (geom_violin).
## Warning: Removed 52 rows containing missing values (geom_point).
Or even add a box-and-whisker plot within the violin:
#plot the violin diagram with boxplot
ggplot(df, aes(x = survey, y = height)) +
ylim(c(3,12))+
geom_violin(aes(fill = survey), trim = FALSE) +
geom_boxplot(width = 0.1) +
theme_classic() +
theme(legend.position = "none")
## Warning: Removed 52 rows containing non-finite values (stat_ydensity).
## Warning: Removed 52 rows containing non-finite values (stat_boxplot).
## Warning: Removed 113 rows containing missing values (geom_violin).
Another tool for comparing univariate distributions is the quantile-quantile (or QQ) plot. The QQ plot has a number of uses that will become increasingly relevant in the later sections, but for now we’ll introduce them generally.
The QQ plot does what it says in the name: it plots the quantiles of two distributions against one-another:
qqplot(mice_may,mice_nov,xlab='Mice Heights (May Survey)',ylab='Mice Heights (Nov Survey)')
abline(0,1,col='red',lty=2)
Here you can see that the vast majority of the data lie on the one-to-one relation (shown by the red dashed line). This is clearer if we require the plot to have a fixed 1:1 aspect ratio (and convert the plot type to a line so that we can see the data extends beyond the edge of the plot!):
qqplot(mice_may,mice_nov, xlim=c(0,20), ylim=c(0,20),type='l',lwd=2,
xlab='Mice Heights (May Survey)',ylab='Mice Heights (Nov Survey)')
abline(0,1,col='red',lty=2)
QQ plots are very useful because they encode a lot of information about the differences between one-dimensional data. Let’s take a few example datasets.
#Construct some Gaussian data with different properties
zero_mean<-rnorm(1e3,mean=0,sd=1) #A standard mean=0, stdev=1 gaussian
negative_mean<-rnorm(1e3,mean=-2,sd=1) #A Gaussian with negtive mean, stdev=1
positive_mean<-rnorm(1e3,mean=2,sd=1) #A Gaussian with positive mean, stdev=1
layout(matrix(1:4,ncol=2,nrow=2,byrow=TRUE)); par(mar=c(3,3,1,1))
qqplot(zero_mean,positive_mean); abline(0,1,col='red',lty=2)
qqplot(zero_mean,negative_mean); abline(0,1,col='red',lty=2)
qqplot(positive_mean,negative_mean); abline(0,1,col='red',lty=2)
So, shifts in the mean are clearly shown as displacement in the QQ data away from the 1:1 line. Similarly, differences in the dispersion of the data are visible:
#Construct some Gaussian data with different properties
standard_scatter<-rnorm(1e3,mean=0,sd=1) #A standard mean=0, stdev=1 gaussian
smaller_scatter<-rnorm(1e3,mean=0,sd=0.5) #A Gaussian with less dispersion
larger_scatter<-rnorm(1e3,mean=0,sd=2) #A Gaussian with greater dispersion
layout(matrix(1:4,ncol=2,nrow=2,byrow=TRUE)); par(mar=c(3,3,1,1))
qqplot(standard_scatter,smaller_scatter); abline(0,1,col='red',lty=2)
qqplot(standard_scatter,larger_scatter); abline(0,1,col='red',lty=2)
qqplot(larger_scatter,smaller_scatter); abline(0,1,col='red',lty=2)
So differences in the distribution dispersion are shown as differences in the slope of the QQ plot with respect to the 1:1 line.
We can also find the influence of outliers (which we can probably guess from the first dataset…):
#Construct some Gaussian data with different properties
standard_normal<-rnorm(1e3,mean=0,sd=1) #A standard mean=0, stdev=1 gaussian
outlier_normal<-c(rnorm(1e3,mean=0,sd=1),
runif(100,min=-10,max=10))
layout(1)
qqplot(standard_normal,outlier_normal); abline(0,1,col='red',lty=2)
And so outliers cause shifts in the tails of the QQ plot. Bringing all this together, we can compare (for example) a ficticious dataset \(Y\) to an expected model distribution that follows \(X_{\rm model}\sim N(\mu=0,\sigma=1)\):
#Construct our Model dataset
model<-rnorm(1e5,mean=0,sd=1)
#Compare the model to the dataset "Y"
qqplot(Y,model); abline(0,1,col='red',lty=2)
What can you conclude about our dataset Y compared to our model?
Until now, we have essentially explored datasets with only one variable (and how to compare different sets of observations of this one variable). We now want to extend our analysis to datasets that contain two (or more) variables.
The first option for essentially all comparisons between multiple variables is almost always to produce a scatter plot. These are the simplest method of comparing observations of multiple variables.
#Create a dataset with two variables
df<-data.frame(X=rnorm(1e3,mean=0,sd=1),Y=rnorm(1e3,mean=100,sd=1))
#Plot the data as a scatter plot
magplot(df$X,df$Y,xlab="X",ylab="Y")
The first important point about plotting distributions of data in multiple dimensions, is that you should always be aware of the scales of the axes. For data that is expected to have the same dynamic range in both dimensions (that is, similar dispersion), it is almost always preferable to plot the data with fixed aspect ratio:
For data with very different aspect ratios, this is not practical (and can lead to misinterpretation of relationships between variables when there are none):
#Create a dataset with two variables of different dispersion
df<-data.frame(X=rnorm(1e3,mean=0,sd=1),Y=rnorm(1e3,mean=100,sd=50))
#Plot the data as a scatter plot with fixed aspect ratio
magplot(df$X,df$Y,xlab="X",ylab="Y",asp=1)
This gives us two options when plotting data of different dynamic ranges:
Standard coordinates are useful for visualising relationships between two variables when we are not interested (primarily) in the absolute values of the variable central tendency and dispersion measures.
Standard coordinates are achieved by subtracting the arithmetic mean from each variable, and dividing by the sample standard deviation. In other words, we transform the data to have \(\bar{x}=0\) and \(\sigma_x=1\).
#Transform the dataset of two variables of different dispersion to
#Standard coordinates
df_standard<-as.data.frame(scale(df)) #scale returns a matrix by default
colnames(df_standard)<-colnames(df)
#Plot the data as a scatter plot with fixed aspect ratio
magplot(df_standard$X,df_standard$Y,
xlab="X (standard coords)",ylab="Y (standard coords)",asp=1)
Once again, we can utilise KDEs to visualise relationships between datasets, this time in two dimensions. As with our \(1\)D KDEs, we need to specify a kernel (with which we effectively smooth-out the data), but now our kernel needs to be two-dimensional.
An example of 2D KDEs is available within the magicaxis package in the form of the function magcon. This function allows use to plot the 2D KDE as an image, with contours, etc, easily and flexibly.
#We'll use the "faithful" built-in dataset here
faithful<-get(data("faithful"))
#2D KDE of our whitened dataset in magicaxis
magcon(faithful$eruptions,faithful$waiting,
xlab="Eruption length of Old Faithful (min)",
ylab="Waiting Time till next Eruption (min)",
xlim=c(0.5, 6),ylim=c(40, 110))
As always, the ggplot2 suite of plotting functions also has very nice implementations of 2D KDEs:
When provided with datasets containing multiple dimensions, we are frequently interested in determining relationships between variables. Looking again at our “faithful” dataset:
There is a clear relationship between the duration of the eruption and the time until the next eruption. This relationship may have an underlying physical cause that we are interested in, or it may be coincidental. Determining the relationship between variables, and their significance, is therefore an important topic in statistics.
We’ve previously explored the concept of variance and standard deviation. For a single variable, recall that the variance was defined as:
\[ \begin{align} \tilde{s}^2 &= \frac{1}{n}\sum_{i=1}^{n} (x_i-\bar{x})^2 \\ &= {\rm mean}[x_i-{\rm mean}(x_i)]^2 \end{align} \]
The covariance of two variables is then defined as the joint variance between each variable:
\[ {\rm cov}(X,Y) = {\rm mean}[(x_i-{\rm mean}(x_i))\times(y_i-{\rm mean}(y_i))] \]
We will discuss the covariance formula more later in the course, but for now you can see that the definition formally makes sense if you compute the covariance of a variable with itself:
\[ \begin{align} {\rm cov}(X,X) &= {\rm mean}[(x_i-{\rm mean}(x_i))\times(x_i-{\rm mean}(x_i))] \\ &= {\rm mean}[(x_i-{\rm mean}(x_i))^2] \\ &\equiv \tilde{s}^2 \end{align} \]
The covariance of two variables describes the degree of joint variation that exists between two variables. For our “faithful” dataset, we find that the covariance is 13.9778078. This value, though, is dependent on the absolute dispersion of the dataset. That is, if we were to convert the faithful dataset into standard coordinates, the covariance changes: 0.9008112.
It is therefore often useful to compute the amount of correlation between variables, that is invariate under scaling of the variables. For this we can compute the so-called Pearson correlation coefficient:
\[ {\rm cor}(X,Y) = \frac{{\rm cov}(X,Y)}{s(X)s(Y)} \]
The correlation coefficient varies between \(-1\) (for perfectly negatively correlated data), and \(1\) (for perfectly positively correlated data). For our faithful dataset, we have:
## [1] 0.9008112
## [1] 0.9008112
The covaraiance and correlation values are useful for computing the relationships between any two variables. For datasets with two or more variables, the covariance can be computed for all combinations of different variable combinations, to create the covariance matrix and correlation matrix:
\[ {\rm cov}(\Omega) = \begin{pmatrix} {\rm var}(X_1) & {\rm cov}(X_1,X_2) & \dots & {\rm cov}(X_1,X_n) \\ {\rm cov}(X_2,X_1) & {\rm var}(X_2) & \dots & {\rm cov}(X_2,X_n) \\ \vdots & \vdots & \ddots & \vdots \\ {\rm cov}(X_n,X_1) & {\rm cov}(X_n,X_2) & \dots & {\rm var}(X_n) \end{pmatrix} \] \[ {\rm cor}(\Omega) = \begin{pmatrix} 1 & {\rm cor}(X_1,X_2) & \dots & {\rm cor}(X_1,X_n) \\ {\rm cor}(X_2,X_1) & 1 & \dots & {\rm cor}(X_2,X_n) \\ \vdots & \vdots & \ddots & \vdots \\ {\rm cor}(X_n,X_1) & {\rm cor}(X_n,X_2) & \dots & 1 \end{pmatrix} \]
We can compute these matrices for our faithful dataset:
## eruptions waiting
## eruptions 1.302728 13.97781
## waiting 13.977808 184.82331
## eruptions waiting
## eruptions 1.0000000 0.9008112
## waiting 0.9008112 1.0000000
Pearson correlation, however, should be used with caution. For linear data, the coefficient is sensible. However for strongly non-linear data the coefficient is less interpretable:
There are other correlation formalisms that attempt to circumvent the problems of the Pearson correlation coefficient is to utilise an associated measure called the Spearman Rank Correlation. The Spearman Rank Correlation is defined as the Pearson correlation between the rank-orders of the variables.
This formalism means that highly non-linear data which are simple transformations of one-another have high Spearman correlation, even when the Pearson correlation is small.
As a demonstration, we’ll construct a dataset with one non-linear variable, but which is exactly correlated to another (in the intuiative sense: knowing one perfectly informs the other).
#Define a simple Gaussian dataset
obs<-data.frame(X=rnorm(1e3,mean=0,sd=1))
#Create a non-linear variable
obs$Y<-obs$X^3
#Plot the data
magplot(obs,xlab="X",ylab="Y")
## X Y
## X 1.0000000 0.7386731
## Y 0.7386731 1.0000000
So Pearson tells us that the variables are corrlated at the \(\sim70\%\) level, while a quick look at our figure shows us that this is clearly an underestimate.
If we compute the pearson correlation of the rank-ordered variables, though:
## [1] 1
This makes sense intuitively, because the function \(Y\) is monotonically increasing and perfectly correlated to \(X\) (albeit non-linearly). This is the Spearman Rank correlation:
## X Y
## X 1 1
## Y 1 1
Let’s now see if the Spearman correlation can recover the correlation of one of our We can now look at one of the strange Pearson results from our figure:
#Define a simple dataset with Zero Pearson Correlation
obs<-data.frame(X=runif(1e5,min=-1/2*pi,max=1/2*pi))
obs$Y<-runif(1e5,min=-0.3,max=0.3)-cos(obs$X)
magplot(obs,xlab='X',ylab='Y',pch='.')
## X Y
## X 1.000000000 -0.004813213
## Y -0.004813213 1.000000000
## X Y
## X 1.000000000 -0.003053893
## Y -0.003053893 1.000000000
Notice that the rank correlation is unable to recover the correlation between non-monotonically increasing variables! In this way, the correlation coefficients are describing the joint information between two variables. In the monotonically increasing case, knowledge of \(X\) perfectly informs us abou \(Y\) and vice-versa. However in the non-monotonic case, the knowlege of \(X\) perfectly informs \(Y\), however the converse is not true.
If you only learn one thing from this course, this is probably what you should learn.
We are inclined (as humans) to interpret correlations as causal relationships. Sometimes this may be justified, but often it is not.
Here we can see that the number of divorces in the US state of Maine is inversely correlated with the consumption of margarine per person in the USA.
The conclusion is clear: Eating margarine in LA will invariably lead to a divorce in Maine.
Or more accurately: no, of course it won’t.
This is an example of a spurious correlation. Such correlations are possible (and indeed likely!) when you have few observations of many variables.
As a demonstration, let’s say that here we have made \(10\) observations and found a correlation of \(\sim 0.8\). Let’s assume that this discovery came from a catalogue containing \(p=1000\) variables, all of which are totally random. What fraction of these variables do we expect to have a correlation of \(80\%\) or more?
## 0.65 % of variables have 80% correlation or more
Said differently, there is a 1 in 153 chance that two totally random variables will have an absolute correlation of 0.8 or higher.
The likelihood of finding “significant” correlations between truely random data is non-zero, and grows with decreasing numbers of observations and increasing numbers of observed variables.
The problem is further complicated by the existence of confounding variables.
A confounding variable is one that acts upon both the dependent and independent variables in a measurement of correlation, and thereby creates a spurious correlation between the two.
#A simple Gaussian dataset
obs<-data.frame(Z=rnorm(1E3,mean=0,sd=1))
#A new variable that correlates with Z
obs$X<-sin(obs$Z)+runif(1e3,min=-0.2,max=0.2)
#And another new variable that correlates with Z
obs$Y<-obs$Z^3+runif(1e3,min=-0.2,max=0.2)
We’ve created two variables that correlate with \(Z\). But what if we never actually observed the variable \(Z\)… We would instead plot \(X\) and \(Y\):
## [1] 0.9064995
And be tempted to decide that there is a causal relationship between these two parameters, when in fact none exists.
Thus we have demonstrated the common but extremely important statistical fact:
Corrlation does not equal Causation!