Statistical Arbitrage – Correlation vs Cointegration

What is statistical arbitrage (stat arb)?

The premise of statistical arbitrage, stat arb for short, is that there is a statistical mispricing between a set of securities which we look to exploit. Typically a strategy requires going long a set of stocks and short another. StatArb evolved from pairs trading where one would go long a stock and short it’s competitor as a hedge, in pairs trading the aim is to select a stock that is going to outperform it’s peers. StatArb is all about mean reversion, in essence you are saying that the spread between any two stocks should be constant (or slowly evolving throughout time), any deviations from the spread present a trading opportunity since in StatArb we believe the spread is mean reverting. Contrary to the name statistical arbitrage isn’t about making risk free money (deterministic arbitrage is risk free).

What type of stocks make good pairs?

The best stocks to use in StatArb are those where there is a fundamental reason for believing that the spread is mean reverting / stationary. Typically this means that the stocks are in the same market sector or even better the same company (some companies have A and B shares with different voting rights or trade on different exchanges)! Some examples of fundamentally similar pairs would be Royal Dutch Shell A vs Royal Dutch Shell B shares, Goldman Sachs vs JP Morgan, Apple vs ARM (their chip supplier), ARM vs ARM ADR, some cross sector groups may also work such as Gold Mining vs Gold Price.

A poor example would be Royal Bank of Scotland vs Tesco since their businesses are completely different / don’t impact each other.

What is the mathematical definition of a good pair?

Upon coming up with a good fundamental stock pairing you next need to have a mathematical test for determining if it’s a good pair. The most common test is to look for cointegration ( as this would imply that the pair is a stationary pair (the spread is fixed) and hence statistically it is mean reverting. When testing for cointegration a Pvalue( hypothesis test is performed, so we can express a level of confidence in the pair being mean reverting.

What is the difference between correlation and cointegration?

When talking about statisitical arbitrage many people often get confused between correlation and cointegration.

  • Correlation – If two stocks are correlated then if stock A has an upday then stock B will have an upday
  • Cointegration – If two stocks are cointegrated then it is possible to form a stationary pair from some linear combination of stock A and B

One of the best explanations of cointegration is as follows: “A man leaves a pub to go home with his dog, the man is drunk and goes on a random walk, the dog also goes on a random walk. They approach a busy road and the man puts his dog on a lead, the man and the dog are now cointegrated. They can both go on random walks but the maximum distance they can move away from each other is fixed ie length of the lead”. So in essence the distance/spread between the man and his dog is fixed, also note from the story that the man and dog are still on a random walk, there is nothing to say if their movements are correlated or uncorrelated.With correlated stocks they will move in the same direction most of the time however the magnitude of the moves is unknown, this means that if you’re trading the spread between two stocks then the spread can keep growing and growing showing no signs of mean reversion. This is in contract to cointegration where we say the spread is “fixed” and that if the spread deviates from the “fixing” then it will mean revert.

Lets explore cointegration some more:

Equation for Geometric brownian motion

A_{t+T} = A_{t}+N(\mu_{a}T,\sigma_{a}\sqrt{T})

where A stands for the price of stock A
\Delta A = N(\mu_{a}\Delta t,\sigma_{a}\sqrt{\Delta t})
\Delta A = \mu_{a}\Delta t+N(0,\sigma_{a}\sqrt{\Delta t})
E_{t}[\Delta A] = \mu_{a}
From the geometric Brownian motion equation we can see that the expected change in A over time (E_{t}[\Delta A]) is \mu_{a}, in other words A is not stationary  assuming \mu_{a} isnt zero).

We want to find a cointegrated / stationary pair.

Equation for going long stock A and short n lots of Stock B

\Delta Spread=\Delta A-n\Delta B
\Delta Spread=\mu_{a}\Delta t+N(0,\sigma_{a}\sqrt{\Delta t})-n\mu_{b}\Delta t-nN(0,\sigma_{b}\sqrt{\Delta t})
E_{t}[\Delta Spread]=E_{t}[\Delta A-n\Delta B] = \mu_{a}-n\mu_{b}
Setting n=\frac{\mu_{a}}{\mu_{b}}
E_{t}[\Delta Spread]=E_{t}[\Delta A-n\Delta B] = \mu_{a}-\frac{\mu_{a}\mu_{b}}{\mu_{b}}=\mu_{a}-\mu_{a}=0
Hence is stationary assuming that the hedge ratio, n, remains constant!!!


Example of correlated stocks : Notice the spread blowing up

Example of cointegrated stocks: Notice the spread looks oscillatory

Onto the code to generate those plots (mainly taken from

?View Code RSPLUS
 #Code largely copied from
 #The input data
 nsim <- 250  #Number of data points
 mu_a <- 0.0002  #Mu_a growth rate for stock a
 sigma_a <- 0.010   #Sigma_a volatility for stock a
 mu_b <- 0.0005  #Mu_a growth rate for stock a
 sigma_b <- 0.005   #Sigma_a volatility for stock a
 corxy <- 0.8    #Correlation coeficient for xy
 #Calculate a correlated return series
 #Build the covariance matrix and generate the correlated random results
 (covmat <- matrix(c(sigma_a^2, corxy*sigma_a*sigma_b, corxy*sigma_a*sigma_b, sigma_b^2), nrow=2))
 res <- mvrnorm(nsim, c(mu_a, mu_b), covmat)    #Calculate multivariate normal distribution
 plot(res[,1], res[,2])
 #Calculate the stats of res[] so they can be checked with the input data
 cor(res[,1], res[,2])
 path_a <- exp(cumsum(res[,1]))
 path_b <- exp(cumsum(res[,2]))
 spread <- path_a - path_b
                                 #Set the plotting area to a 2 by 1 grid
 #Plot the two price series that have correlated returns
 plot(path_a, main="Two Price Series with Correlated Returns", ylab="Price", type="l", col="red")
 lines(path_b, col="blue")
  plot(spread, type="l")
 ##Cointegrated pair
 #The input data
 nsim <- 250  #Number of data points
 mu_a <- 0.0002  #Mu_a growth rate for stock a
 sigma_a <- 0.010   #Sigma_a volatility for stock a
 mu_b <- 0.0002  #Mu_a growth rate for stock a
 sigma_b <- 0.005   #Sigma_a volatility for stock a
coea <- 0.0200    #Co-integration coefficient for x
coeb <- 0.0200    #Co-integration coefficient for y
#Generate the noise terms for x and y
rana <- rnorm(nsim, mean=mu_a, sd=sigma_a) #White noise for a
ranb <- rnorm(nsim, mean=mu_b, sd=sigma_b) #White noise for b
#Generate the co-integrated series x and y
a <- numeric(nsim)
b <- numeric(nsim)
a[1] <- 0
b[1] <- 0
for (i in 2:nsim) {
#Logic here is that is b>a then we add on the difference so that
#a starts to catch up with b, hence causing the spread to close
  a[i] <- a[i-1] + (coea * (b[i-1] - a[i-1])) + rana[i-1]
  b[i] <- b[i-1] + (coeb * (a[i-1] - b[i-1])) + ranb[i-1]
#Plot a and b as prices
ylim <- range(exp(a), exp(b))
path_a <- exp(a)
path_b <- exp(b)
spread <- path_a - path_b
plot(path_a, ylim=ylim, type="l", main=paste("Co-integrated Pair (coea=",coea,",  coeb=",coeb,")", sep=""), ylab="Price", col="red")
lines(path_b, col="blue")
legend("bottomleft", c("exp(a)", "exp(b)"), lty=c(1, 1), col=c("red", "blue"), bg="white")