Statistical Arbitrage – Testing for Cointegration – Augmented Dicky Fuller

In my last post “Statistical Arbitrage Correlation vs Cointegration“, we discussed what statistical arbitrage is and what property between two pairs we aim to exploit. The mathematics essentially showed that if you go long stock A and short stock B with some appropriate hedging factor to cancel out the drift/growth terms in the Brownian motion equation then you are left with a stationary signal which is the spread between the two stocks. The maths also showed that in expectation the daily change in the spread is zero (E_{t}[\Delta Spread]=0) and hence any deviation from this presents the opportunity for a trade.

It was assumed that the stock growth terms \mu_{a} and \mu_{b} are constant (or drifting slowly over time both at the same rate) or in other words the hedge ratio is constant. This post will detail how to find the hedge ratio and present the Augmented Dicky Fuller test which can be used to identify with a certain level of confidence if the spread is stationary and hence cointegrated.

Determining the hedge factor

Spread=A-nB set the spread to 0

It is required to find the hedge factor, n, to satisfy the above equation. The hedge factor is easily found by regressing the A vs B. Notice that it is the prices being regressed and not the daily returns.

Examining the residuals

Once the hedge factor has been found the residuals of the regression must be analysed. The residual is how much the spread has changed for a given day, the whole idea of the regression was to identify the hedge factor that keeps the daily change of the spread as close to zero as possible (we want the residuals to be zero).

  • If the residuals contain a trend then this implies that the daily change of the spread has a net direction and will cause the spread to consistently widen or contract.
  • If the residuals contain no trend then this implies that the daily change of the spread will oscillate around zero (is stationary).

What residuals to analyse is open to debate, you may want to take another data set from a different point in time and apply the hedge factor you calculated in the above regression to this new set and analyse the residuals, this helps to identify an over fitting during the regression.

Dicky Fuller Test

For an AR(1) process y_{t}=ay_{t-1}+\epsilon_{t} where y_{0}=0 and \epsilon \sim N(0,\sigma^{2}) what value of a results in a stationary signal?




Take expectations, already know E[y_{0}]=0 and E[\epsilon_{t}]=0


E[y_{t}]=a^{t}*0+\Sigma_{j=0}^{t-1}a^{j}*0=0 the mean is zero and constant
Examining the variance
  • If a < 1 then as t \to \inf then V[y_{t}]\to constant
  • If a >= 1 then V[y_{t}] grows with time, hence is not stationary
 This forms the premise of the Dicky Fuller test, perform a linear regression on the residuals and see what the value of a is, if it’s greater than or equal to one this indicates the signal isn’t stationary and therefore isn’t suitable to stat arb. The dicky fuller test was invented by D.A Dickey and W.A Fuller, they produced the dickey-fuller distribution relating number of test samples and the value of a to a probability of a unit root. Further reading can be found at Dickey Fuller Test – Wikipedia. The dickey fuller test is a hypothesis test that a signal contains a unit root,we want to reject this hypothesis (a unit root means our signal is non-stationary). The test gives a pValue, the lower this number the more confident we can be that we have found a stationary signal. pValues less than 0.1 are considered to be good candidates.

Royal Dutch Shell Case Study
As mentioned in the previous post, two different shares for the same company are usually good stat arb candidates. Here the hypothesis that Royal Dutch Shell A & B shares are cointegrated will be tested.

The beta or Hedge Ratio is: 0.965378555498888

Augmented Dickey-Fuller Test

Dickey-Fuller = -3.6192, Lag order = 0, p-value = 0.03084
alternative hypothesis: stationary

The low pValue indicates that this pair is cointegrated.

Royal Bank of Scotland vs Barclays Case Study

The beta or Hedge Ratio is: 0.645507426925485

Augmented Dickey-Fuller Test

Dickey-Fuller = -1.6801, Lag order = 0, p-value = 0.7137
alternative hypothesis: stationary

The pValue indicates that the spread isn’t stationary, and the bottom right graph of the spread is definitely in a strong trend and isn’t range bound. This would have been a horrific stat arb!
On to the code:
?View Code RSPLUS
#Specify dates for downloading data, training models and running simulation
hedgeTrainingStartDate = as.Date("2008-01-01") #Start date for training the hedge ratio
hedgeTrainingEndDate = as.Date("2010-01-01") #End date for training the hedge ratio
title<-c("Royal Dutch Shell A vs B Shares")
title<-c("Royal Bank of Scotland vs Barclays")
### SECTION 1 - Download Data & Calculate Returns ###
#Download the data
symbolData <- new.env() #Make a new environment for quantmod to store data in
getSymbols(symbolLst, env = symbolData, src = "yahoo", from = hedgeTrainingStartDate, to=hedgeTrainingEndDate)
stockPair <- list(
 a = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[1],"\"",sep="")))))   #Stock A
,b = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[2],"\"",sep=""))))) #Stock B
testForCointegration <- function(stockPairs){
#Pass in a pair of stocks and do the necessary checks to see if it is cointegrated
#Plot the pairs
plot(stockPairs$a, main=stockPairs$name, ylab="Price", type="l", col="red",ylim=c((0.99*min(rbind(stockPairs$a,stockPairs$b))),(1.01*max(rbind(stockPairs$a,stockPairs$b)))))
lines(stockPairs$b, col="blue")
#Step 1: Calculate the daily returns
dailyRet.a <- na.omit((Delt(stockPairs$a,type="arithmetic")))
dailyRet.b <- na.omit((Delt(stockPairs$b,type="arithmetic")))
dailyRet.a <- dailyRet.a[is.finite(dailyRet.a)] #Strip out any Infs (first ret is Inf)
dailyRet.b <- dailyRet.b[is.finite(dailyRet.b)]
#Step 2: Regress the daily returns onto each other
#Regression finds BETA and C in the linear regression retA = BETA * retB + C
regression <- lm(dailyRet.a ~ dailyRet.b + 0)
beta <- coef(regression)[1]
print(paste("The beta or Hedge Ratio is: ",beta,sep=""))
plot(x=dailyRet.b,y=dailyRet.a,type="p",main="Regression of RETURNS for Stock A & B") #Plot the daily returns
lines(x=dailyRet.b,y=(dailyRet.b*beta),col="blue")#Plot in linear line we used in the regression
#Step 3: Use the regression co-efficients to generate the spread
spread <- stockPairs$a - beta*stockPairs$b #Could actually just use the residual form the regression its the same thing
spreadRet <- Delt(spread,type="arithmetic")
spreadRet <- na.omit(spreadRet)
plot((spreadRet), type="l",main="Spread Returns") #Plot the cumulative sum of the spread
plot(spread, type="l",main="Spread Actual") #Plot the cumulative sum of the spread
#For a cointegrated spread the cumsum should not deviate very far from 0
#For a none-cointegrated spread the cumsum will likely show some trending characteristics
#Step 4: Use the ADF to test if the spread is stationary
#can use tSeries library
adfResults <- adf.test((spread),k=0,alternative="stationary")
if(adfResults$p.value <= 0.05){
print(paste("The spread is likely Cointegrated with a pvalue of ",adfResults$p.value,sep=""))
} else {
print(paste("The spread is likely NOT Cointegrated with a pvalue of ",adfResults$p.value,sep=""))