In my last post “Statistical Arbitrage Correlation vs Cointegration“, we discussed what statistical arbitrage is and what property between two pairs we aim to exploit. The mathematics essentially showed that if you go long stock A and short stock B with some appropriate hedging factor to cancel out the drift/growth terms in the Brownian motion equation then you are left with a stationary signal which is the spread between the two stocks. The maths also showed that *in expectation* the daily change in the spread is zero () and hence any deviation from this presents the opportunity for a trade.

It was assumed that the stock growth terms and are constant (or drifting slowly over time both at the same rate) or in other words the hedge ratio is constant. This post will detail how to find the hedge ratio and present the Augmented Dicky Fuller test which can be used to identify with a certain level of confidence if the spread is stationary and hence cointegrated.

**Determining the hedge factor**

set the spread to

It is required to find the hedge factor, n, to satisfy the above equation. The hedge factor is easily found by regressing the . Notice that it is the prices being regressed and not the daily returns.

**Examining the residuals**

Once the hedge factor has been found the residuals of the regression must be analysed. The residual is how much the spread has changed for a given day, the whole idea of the regression was to identify the hedge factor that keeps the daily change of the spread as close to zero as possible (we want the residuals to be zero).

- If the residuals contain a trend then this implies that the daily change of the spread has a net direction and will cause the spread to consistently widen or contract.
- If the residuals contain no trend then this implies that the daily change of the spread will oscillate around zero (is stationary).

What residuals to analyse is open to debate, you may want to take another data set from a different point in time and apply the hedge factor you calculated in the above regression to this new set and analyse the residuals, this helps to identify an over fitting during the regression.

**Dicky Fuller Test**

For an AR(1) process where and what value of results in a stationary signal?

Take expectations, already know and

- If < then as then
- If >= then grows with time, hence is not stationary

**Royal Dutch Shell Case Study**

The beta or Hedge Ratio is: 0.965378555498888

Augmented Dickey-Fuller Test

Dickey-Fuller = -3.6192, Lag order = 0, p-value = 0.03084

alternative hypothesis: stationary

The low pValue indicates that this pair is cointegrated.

**Royal Bank of Scotland vs Barclays Case Study**

The beta or Hedge Ratio is: 0.645507426925485

Augmented Dickey-Fuller Test

Dickey-Fuller = -1.6801, Lag order = 0, p-value = 0.7137

alternative hypothesis: stationary

^{?}View Code RSPLUS

library("quantmod") library("PerformanceAnalytics") library("fUnitRoots") library("tseries") #Specify dates for downloading data, training models and running simulation hedgeTrainingStartDate = as.Date("2008-01-01") #Start date for training the hedge ratio hedgeTrainingEndDate = as.Date("2010-01-01") #End date for training the hedge ratio symbolLst<-c("RDS-A","RDS-B") title<-c("Royal Dutch Shell A vs B Shares") symbolLst<-c("RBS.L","BARC.L") title<-c("Royal Bank of Scotland vs Barclays") ### SECTION 1 - Download Data & Calculate Returns ### #Download the data symbolData <- new.env() #Make a new environment for quantmod to store data in getSymbols(symbolLst, env = symbolData, src = "yahoo", from = hedgeTrainingStartDate, to=hedgeTrainingEndDate) stockPair <- list( a = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[1],"\"",sep=""))))) #Stock A ,b = coredata(Cl(eval(parse(text=paste("symbolData$\"",symbolLst[2],"\"",sep=""))))) #Stock B ,name=title) testForCointegration <- function(stockPairs){ #Pass in a pair of stocks and do the necessary checks to see if it is cointegrated #Plot the pairs dev.new() par(mfrow=c(2,2)) print(c((0.99*min(rbind(stockPairs$a,stockPairs$b))),(1.01*max(rbind(stockPairs$a,stockPairs$b))))) plot(stockPairs$a, main=stockPairs$name, ylab="Price", type="l", col="red",ylim=c((0.99*min(rbind(stockPairs$a,stockPairs$b))),(1.01*max(rbind(stockPairs$a,stockPairs$b))))) lines(stockPairs$b, col="blue") print(length(stockPairs$a)) print(length(stockPairs$b)) #Step 1: Calculate the daily returns dailyRet.a <- na.omit((Delt(stockPairs$a,type="arithmetic"))) dailyRet.b <- na.omit((Delt(stockPairs$b,type="arithmetic"))) dailyRet.a <- dailyRet.a[is.finite(dailyRet.a)] #Strip out any Infs (first ret is Inf) dailyRet.b <- dailyRet.b[is.finite(dailyRet.b)] print(length(dailyRet.a)) print(length(dailyRet.b)) #Step 2: Regress the daily returns onto each other #Regression finds BETA and C in the linear regression retA = BETA * retB + C regression <- lm(dailyRet.a ~ dailyRet.b + 0) beta <- coef(regression)[1] print(paste("The beta or Hedge Ratio is: ",beta,sep="")) plot(x=dailyRet.b,y=dailyRet.a,type="p",main="Regression of RETURNS for Stock A & B") #Plot the daily returns lines(x=dailyRet.b,y=(dailyRet.b*beta),col="blue")#Plot in linear line we used in the regression #Step 3: Use the regression co-efficients to generate the spread spread <- stockPairs$a - beta*stockPairs$b #Could actually just use the residual form the regression its the same thing spreadRet <- Delt(spread,type="arithmetic") spreadRet <- na.omit(spreadRet) #spreadRet[!is.na(spreadRet)] plot((spreadRet), type="l",main="Spread Returns") #Plot the cumulative sum of the spread plot(spread, type="l",main="Spread Actual") #Plot the cumulative sum of the spread #For a cointegrated spread the cumsum should not deviate very far from 0 #For a none-cointegrated spread the cumsum will likely show some trending characteristics #Step 4: Use the ADF to test if the spread is stationary #can use tSeries library adfResults <- adf.test((spread),k=0,alternative="stationary") print(adfResults) if(adfResults$p.value <= 0.05){ print(paste("The spread is likely Cointegrated with a pvalue of ",adfResults$p.value,sep="")) } else { print(paste("The spread is likely NOT Cointegrated with a pvalue of ",adfResults$p.value,sep="")) } } testForCointegration(stockPair) |

[…] Post navigation ← Previous […]

Great work, I had created something to do the exact same thing but yours is much more elegant and I am quite the R noob. I was wondering if there was a way to adjust the input to take any CSV or txt file of closing prices. I tried to modify it using

read.csv(choose.files(), stringsAsFactors=F)

but ran into a lot of problems I didn’t understand. I have historical databases in csv and would like to use your test against it. Anyway great work.

Hi Derek,

Im a bit confused by saying that the prices are regressed in hedge ratio calculation. The paragraph above says it is prices, the code says it is returns.??

I’m currently travelling so can’t give a full answer but it appears that the regression co-efficient used to make the spread is wrong.

Let’s use intuition to work out what should happen. I have two stocks, stock A price 100, stock B price 500. These are special stocks if A changes by % then B will change by the same %.

The problem is that in $ terms B changes by approx 5 times and much as A due to being of different prices.

Hence we regress prices of stock A with B to figure out for each long of stock A how many of stock B do I need to short so that the $ loss matches the gain of A.

So whilst I’ve plotted returns and drawn the line of best fit, I should also regress prices.

[…] pas les bons programmes. Un des sites sur lesquels j’ai trouvé les bons programmes (pour avoir accès au programme cliquez ici), le programme pour s’acquitter de cette tâche est intéressant mais m’a nécessité […]

You can find all the ebooks for quantitative finance on this blog

Hi Derek,

I noticed that you used the returns at the regression to calculate the Spread, but there is a problema using the returns, they are stationary time series. So you can’t calculate the cointegration between they.

This is the first definition of cointegration, right?

Anyway your work is great, just this detail that call me atention.

André

Hi I used the above “model” to develop a working system for futures and currencies cross asset. Can see it being traded via IB simulated account. We use 5 minute price data. http://www.statarbtrading.wordpress.com

Was happy to see the exact paper here !

[…] Gekko Quants 3 Part Series on Stat Arb ( Part I, II, III ) […]

Thanks Gekko for the backtesting code. It is very useful. Couple of comments below:

1) Another reader has already commented about this above. movingAvg needs to be amended by adding align=”right” in order to have the first moving avg number on day 90:

movingAvg = rollmean(spread,lookback,align=”right”, na.pad=TRUE)

2) since we enter trades at end of day, the return on trade date shouldn’t count. we can simply shift every element in the “positions” vector down by using the “shift” function in the taRifx library.

Also, I don’t believe daily return is (aRet – stockPair$hedgeRatio*bRet). Imagine if you had a large hedge ratio, i.e. if stock A is priced at $100 and stock B is priced at $10, then the hedgeRatio would be in the neighborhood of 10. Since aRet and bRet are in % terms, the formula won’t work. Daily return should be aRet – bRet * (ratio between dollar neutral ratio vs hedge ratio).

See amendments:

library(taRifx)

#Calculate spread daily ret

aRet <- Delt(t[,1],k=1,type="arithmetic")

bRet <- Delt(t[,2],k=1,type="arithmetic")

dollarNeutralRatio <- stockPair$a/stockPair$b

hedgeRatioOVERdollarNeutralRatio <- stockPair$hedgeRatio/shift(dollarNeutralRatio,-1)

dailyRet <- aRet – bRet*hedgeRatioOVERdollarNeutralRatio

dailyRet[is.na(dailyRet)] <- 0

tradingRet <- dailyRet * shift(positions,-1)

simulateTrading <- tradingRet

}

Apologies, this comment was meant for the other page:

http://gekkoquant.com/2013/01/21/statistical-arbitrage-trading-a-cointegrated-pair/

Great work! You beautifully summarized this relatively messy concept with very simple statistical and intuitive relationships. I came across with your blog only an hour ago, and I cant wait for reading every post and contributing (if I can) to your views and discussions. Well done, and looking forward to reading more from you.

Aykut

Gekko, this is amazing. My only question is what data are we using? Are you just picking one day and see if that is day is co-integrated and thus the whole pair is co-integrated?

Hello Gekko,

Nice article. But I would like to know if by just applying an ADF to the spread A – wB, is not a bit ‘abrupt’ to determine if the the spread is stationary and hence cointegrated. I mean testing for stationarity is one thing and cointegration is another thing. For example, if you apply the KPSS test on the first spread, not sure it will give you a stationary result. IMHO, to test cointegration, you need to determine if its stationary, most stocks are, in I(1) form, but then you need to check if their residuals are stationary, with engle or a bit more complicated with Johansen. So my question was, do you think the fact that we dont do these tests, undermine the efficacy of the strategy….

thanks

Which statistical test can be used to test the trade factors

Great post on a fascinating topic of cointegration. Do you know of any research on how stable a cointegration time series can be? All tests (ADF, EG, Johansen) test within a certain time frame and can say if during that timeframe the series are cointegrated. However, how do you know if this cointegration (hedge values) will be stable in the future?

Recently, someone asked me an interesting question: Why would I set the lag to 0 when doing such cointegration tests? I reflected on this for some time. My answer is that we’re using the stationarity test to figure out whether spreads are reverting to 0. Setting a lag different from 0 would imply that we expect the spread series to actually contain an auto-regressive part in it. Now, this is a really tricky question to answer: Would this expectation actually make sense or not? I tend to deny this, but I’m not entirely certain. Any opinions?