# Statistical Arbitrage – Testing for Cointegration – Augmented Dicky Fuller

In my last post “Statistical Arbitrage Correlation vs Cointegration“, we discussed what statistical arbitrage is and what property between two pairs we aim to exploit. The mathematics essentially showed that if you go long stock A and short stock B with some appropriate hedging factor to cancel out the drift/growth terms in the Brownian motion equation then you are left with a stationary signal which is the spread between the two stocks. The maths also showed that in expectation the daily change in the spread is zero ($E_{t}[\Delta Spread]=0$) and hence any deviation from this presents the opportunity for a trade.

It was assumed that the stock growth terms $\mu_{a}$ and $\mu_{b}$ are constant (or drifting slowly over time both at the same rate) or in other words the hedge ratio is constant. This post will detail how to find the hedge ratio and present the Augmented Dicky Fuller test which can be used to identify with a certain level of confidence if the spread is stationary and hence cointegrated.

Determining the hedge factor

$Spread=A-nB$ set the spread to $0$

It is required to find the hedge factor, n, to satisfy the above equation. The hedge factor is easily found by regressing the $A vs B$. Notice that it is the prices being regressed and not the daily returns.

Examining the residuals

Once the hedge factor has been found the residuals of the regression must be analysed. The residual is how much the spread has changed for a given day, the whole idea of the regression was to identify the hedge factor that keeps the daily change of the spread as close to zero as possible (we want the residuals to be zero).

• If the residuals contain a trend then this implies that the daily change of the spread has a net direction and will cause the spread to consistently widen or contract.
• If the residuals contain no trend then this implies that the daily change of the spread will oscillate around zero (is stationary).

What residuals to analyse is open to debate, you may want to take another data set from a different point in time and apply the hedge factor you calculated in the above regression to this new set and analyse the residuals, this helps to identify an over fitting during the regression.

Dicky Fuller Test

For an AR(1) process $y_{t}=ay_{t-1}+\epsilon_{t}$ where $y_{0}=0$ and $\epsilon \sim N(0,\sigma^{2})$ what value of $a$ results in a stationary signal?

$y_{1}=ay_{0}+\epsilon_{0}$

$y_{2}=ay_{1}+\epsilon_{1}=a(ay_{0}+\epsilon_{0})+\epsilon{1}$

$y_{t}=a^{t}y_{0}+\Sigma_{j=0}^{t-1}a^{j}\epsilon_{j}$

Take expectations, already know $E[y_{0}]=0$ and $E[\epsilon_{t}]=0$

$E[y_{t}]=a^{t}E[y_{0}]+\Sigma_{j=0}^{t-1}a^{j}E[\epsilon_{j}]$

$E[y_{t}]=a^{t}*0+\Sigma_{j=0}^{t-1}a^{j}*0=0$ the mean is zero and constant
Examining the variance
$V[y_{t}]=0+\Sigma_{j=0}^{t-1}a^{j}V[\epsilon_{j}]=\sigma^{2}\Sigma_{j=0}^{t-1}a^{j}$
• If $a$ < $1$ then as $t \to \inf$ then $V[y_{t}]\to constant$
• If $a$ >= $1$ then $V[y_{t}]$ grows with time, hence is not stationary
This forms the premise of the Dicky Fuller test, perform a linear regression on the residuals and see what the value of $a$ is, if it’s greater than or equal to one this indicates the signal isn’t stationary and therefore isn’t suitable to stat arb. The dicky fuller test was invented by D.A Dickey and W.A Fuller, they produced the dickey-fuller distribution relating number of test samples and the value of $a$ to a probability of a unit root. Further reading can be found at Dickey Fuller Test – Wikipedia. The dickey fuller test is a hypothesis test that a signal contains a unit root,we want to reject this hypothesis (a unit root means our signal is non-stationary). The test gives a pValue, the lower this number the more confident we can be that we have found a stationary signal. pValues less than 0.1 are considered to be good candidates.

Royal Dutch Shell Case Study
As mentioned in the previous post, two different shares for the same company are usually good stat arb candidates. Here the hypothesis that Royal Dutch Shell A & B shares are cointegrated will be tested.

The beta or Hedge Ratio is: 0.965378555498888

Augmented Dickey-Fuller Test

Dickey-Fuller = -3.6192, Lag order = 0, p-value = 0.03084
alternative hypothesis: stationary

The low pValue indicates that this pair is cointegrated.

Royal Bank of Scotland vs Barclays Case Study

The beta or Hedge Ratio is: 0.645507426925485

Augmented Dickey-Fuller Test

Dickey-Fuller = -1.6801, Lag order = 0, p-value = 0.7137
alternative hypothesis: stationary

The pValue indicates that the spread isn’t stationary, and the bottom right graph of the spread is definitely in a strong trend and isn’t range bound. This would have been a horrific stat arb!
On to the code:
?View Code RSPLUS
 library("quantmod") library("PerformanceAnalytics") library("fUnitRoots") library("tseries")   #Specify dates for downloading data, training models and running simulation hedgeTrainingStartDate = as.Date("2008-01-01") #Start date for training the hedge ratio hedgeTrainingEndDate = as.Date("2010-01-01") #End date for training the hedge ratio   symbolLst<-c("RDS-A","RDS-B") title<-c("Royal Dutch Shell A vs B Shares")   symbolLst<-c("RBS.L","BARC.L") title<-c("Royal Bank of Scotland vs Barclays")   ### SECTION 1 - Download Data & Calculate Returns ### #Download the data symbolData <- new.env() #Make a new environment for quantmod to store data in getSymbols(symbolLst, env = symbolData, src = "yahoo", from = hedgeTrainingStartDate, to=hedgeTrainingEndDate)   stockPair <- list( a = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[1],"\"",sep=""))))) #Stock A ,b = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[2],"\"",sep=""))))) #Stock B ,name=title)         testForCointegration <- function(stockPairs){ #Pass in a pair of stocks and do the necessary checks to see if it is cointegrated   #Plot the pairs   dev.new() par(mfrow=c(2,2)) print(c((0.99*min(rbind(stockPairs$a,stockPairs$b))),(1.01*max(rbind(stockPairs$a,stockPairs$b))))) plot(stockPairs$a, main=stockPairs$name, ylab="Price", type="l", col="red",ylim=c((0.99*min(rbind(stockPairs$a,stockPairs$b))),(1.01*max(rbind(stockPairs$a,stockPairs$b))))) lines(stockPairs$b, col="blue") print(length(stockPairs$a)) print(length(stockPairs$b)) #Step 1: Calculate the daily returns dailyRet.a <- na.omit((Delt(stockPairs$a,type="arithmetic"))) dailyRet.b <- na.omit((Delt(stockPairs$b,type="arithmetic"))) dailyRet.a <- dailyRet.a[is.finite(dailyRet.a)] #Strip out any Infs (first ret is Inf) dailyRet.b <- dailyRet.b[is.finite(dailyRet.b)] print(length(dailyRet.a)) print(length(dailyRet.b)) #Step 2: Regress the daily returns onto each other #Regression finds BETA and C in the linear regression retA = BETA * retB + C regression <- lm(dailyRet.a ~ dailyRet.b + 0) beta <- coef(regression)[1] print(paste("The beta or Hedge Ratio is: ",beta,sep="")) plot(x=dailyRet.b,y=dailyRet.a,type="p",main="Regression of RETURNS for Stock A & B") #Plot the daily returns lines(x=dailyRet.b,y=(dailyRet.b*beta),col="blue")#Plot in linear line we used in the regression #Step 3: Use the regression co-efficients to generate the spread spread <- stockPairs$a - beta*stockPairs$b #Could actually just use the residual form the regression its the same thing spreadRet <- Delt(spread,type="arithmetic") spreadRet <- na.omit(spreadRet) #spreadRet[!is.na(spreadRet)] plot((spreadRet), type="l",main="Spread Returns") #Plot the cumulative sum of the spread plot(spread, type="l",main="Spread Actual") #Plot the cumulative sum of the spread #For a cointegrated spread the cumsum should not deviate very far from 0 #For a none-cointegrated spread the cumsum will likely show some trending characteristics #Step 4: Use the ADF to test if the spread is stationary #can use tSeries library adfResults <- adf.test((spread),k=0,alternative="stationary") print(adfResults) if(adfResults$p.value <= 0.05){ print(paste("The spread is likely Cointegrated with a pvalue of ",adfResults$p.value,sep="")) } else { print(paste("The spread is likely NOT Cointegrated with a pvalue of ",adfResults$p.value,sep="")) }   }     testForCointegration(stockPair)

## 17 thoughts on “Statistical Arbitrage – Testing for Cointegration – Augmented Dicky Fuller”

1. Great work, I had created something to do the exact same thing but yours is much more elegant and I am quite the R noob. I was wondering if there was a way to adjust the input to take any CSV or txt file of closing prices. I tried to modify it using

read.csv(choose.files(), stringsAsFactors=F)

but ran into a lot of problems I didn’t understand. I have historical databases in csv and would like to use your test against it. Anyway great work.

2. Hi Derek,

Im a bit confused by saying that the prices are regressed in hedge ratio calculation. The paragraph above says it is prices, the code says it is returns.??

• I’m currently travelling so can’t give a full answer but it appears that the regression co-efficient used to make the spread is wrong.

Let’s use intuition to work out what should happen. I have two stocks, stock A price 100, stock B price 500. These are special stocks if A changes by % then B will change by the same %.

The problem is that in $terms B changes by approx 5 times and much as A due to being of different prices. Hence we regress prices of stock A with B to figure out for each long of stock A how many of stock B do I need to short so that the$ loss matches the gain of A.

So whilst I’ve plotted returns and drawn the line of best fit, I should also regress prices.

3. You can find all the ebooks for quantitative finance on this blog

4. Hi Derek,

I noticed that you used the returns at the regression to calculate the Spread, but there is a problema using the returns, they are stationary time series. So you can’t calculate the cointegration between they.
This is the first definition of cointegration, right?

Anyway your work is great, just this detail that call me atention.

André

• Hi I used the above “model” to develop a working system for futures and currencies cross asset. Can see it being traded via IB simulated account. We use 5 minute price data. http://www.statarbtrading.wordpress.com

Was happy to see the exact paper here !

5. Pingback: Beyond Pairs | Systematic Edge

6. Thanks Gekko for the backtesting code. It is very useful. Couple of comments below:

1) Another reader has already commented about this above. movingAvg needs to be amended by adding align=”right” in order to have the first moving avg number on day 90:

movingAvg = rollmean(spread,lookback,align=”right”, na.pad=TRUE)

2) since we enter trades at end of day, the return on trade date shouldn’t count. we can simply shift every element in the “positions” vector down by using the “shift” function in the taRifx library.

Also, I don’t believe daily return is (aRet – stockPair$hedgeRatio*bRet). Imagine if you had a large hedge ratio, i.e. if stock A is priced at$100 and stock B is priced at $10, then the hedgeRatio would be in the neighborhood of 10. Since aRet and bRet are in % terms, the formula won’t work. Daily return should be aRet – bRet * (ratio between dollar neutral ratio vs hedge ratio). See amendments: library(taRifx) #Calculate spread daily ret aRet <- Delt(t[,1],k=1,type="arithmetic") bRet <- Delt(t[,2],k=1,type="arithmetic") dollarNeutralRatio <- stockPair$a/stockPair$b hedgeRatioOVERdollarNeutralRatio <- stockPair$hedgeRatio/shift(dollarNeutralRatio,-1)
dailyRet <- aRet – bRet*hedgeRatioOVERdollarNeutralRatio
dailyRet[is.na(dailyRet)] <- 0
tradingRet <- dailyRet * shift(positions,-1)
simulateTrading <- tradingRet
}

7. Great work! You beautifully summarized this relatively messy concept with very simple statistical and intuitive relationships. I came across with your blog only an hour ago, and I cant wait for reading every post and contributing (if I can) to your views and discussions. Well done, and looking forward to reading more from you.

Aykut

• Gekko, this is amazing. My only question is what data are we using? Are you just picking one day and see if that is day is co-integrated and thus the whole pair is co-integrated?

8. Hello Gekko,
Nice article. But I would like to know if by just applying an ADF to the spread A – wB, is not a bit ‘abrupt’ to determine if the the spread is stationary and hence cointegrated. I mean testing for stationarity is one thing and cointegration is another thing. For example, if you apply the KPSS test on the first spread, not sure it will give you a stationary result. IMHO, to test cointegration, you need to determine if its stationary, most stocks are, in I(1) form, but then you need to check if their residuals are stationary, with engle or a bit more complicated with Johansen. So my question was, do you think the fact that we dont do these tests, undermine the efficacy of the strategy….

thanks

9. Great post on a fascinating topic of cointegration. Do you know of any research on how stable a cointegration time series can be? All tests (ADF, EG, Johansen) test within a certain time frame and can say if during that timeframe the series are cointegrated. However, how do you know if this cointegration (hedge values) will be stable in the future?

10. Recently, someone asked me an interesting question: Why would I set the lag to 0 when doing such cointegration tests? I reflected on this for some time. My answer is that we’re using the stationarity test to figure out whether spreads are reverting to 0. Setting a lag different from 0 would imply that we expect the spread series to actually contain an auto-regressive part in it. Now, this is a really tricky question to answer: Would this expectation actually make sense or not? I tend to deny this, but I’m not entirely certain. Any opinions?