Hidden Markov Models – Forward & Viterbi Algorithm Part 2 of 4

In the previous post the Hidden Markov Model was defined, however efficient algorithms are need to calculate some the probabilities / perform the marginalisation over hidden states. Two algorithms that can be used are the forward algorithm and the Viterbi algorithm.

The forward algorithm calculates the likelihood of the data given the model over all possible state sequences.

The Viterbi algorithm calculates the likelihood of the data given the model over the single most likely state sequence.

The forward algorithm

The forward algorithm allows for efficient calculation of the likelihood function p(\mathbf{O}|\lambda).

The forward variable is the likelihood of the HMM producing all the observations up to time t
and occupying state j at time t, it is defined as:
\alpha_{j}(t)=p(\mathbf{o_{1},...,o_{t}},x(t)=j|\lambda)

It is calculated recursively by calculating the forward variable for time t-1 being in state k and then calculating the probability of moving to state j at time t:
\alpha_{j}(t)=p(\mathbf{o_{1},...,o_{t}},x(t-1)=k,x(t)=j|\lambda)=\alpha_{k}(t-1)a_{kj}b_{j}(\mathbf{o_{t}})

Where a_{kj} is the probability of a jump from state k to state j, and b_{j}(\mathbf{o_{t}}) is the probability of generating feature vector \mathbf{o_{t}} from state j.

\alpha_{j}(t)=\sum_{k=1}^{N}p(\mathbf{o_{1},...,o_{t},}x(t-1)=k,x(t)=j|\lambda)

0.1 Forward Algorithm Initialisation

\alpha_{1}(0)=1,\alpha_{j}(0)=0 for 1<j\leq N and \alpha_{1}(t)=0 for 1<t\leq T

0.2 Recursion

For t=1,2,...,T
…… For j=2,3,...,N-1
……………..\alpha_{j}(t)=b_{j}(\mathbf{o_{t}})[\sum_{k=1}^{N-1}\alpha_{k}(t-1)a_{kj}]

0.3 Termination

p(\mathbf{O}|\lambda)=\sum_{k=2}^{N-1}\alpha_{k}(T)a_{kN}

The Viterbi Algorithm

The forward algorithm calculated p(\mathbf{O}|\lambda) by summing over all state sequences, it is sometimes preferable to approximate p(\mathbf{O}|\lambda) which used all state sequences with \hat{p}(\mathbf{O}|\lambda) which will use the single most likely state sequence. This is known as the Viterbi algorithm, the algorithm finds the most likely state sequence.

\hat{p}(\mathbf{O}|\lambda)=max_{X}[p(\mathbf{O},X|\lambda)]\text{ Where \ensuremath{X} is the most likely state sequence}

The probability of the best partial path of length t through the HMM ended at state j is defined as: \phi_{j}(t)=max_{X^{(t-1)}}[p(\mathbf{o_{1},...,o_{t},}x(t)=j|\lambda)]. Where X^{(t-1)} is the best partial path / state sequence.

As with the forward variable \phi
can be calculated recursively \phi_{j}(t)=max_{i}[\phi_{i}(t-1)a_{ij}b_{j}(\mathbf{o_{t}})]

0.4 Viterbi Algorithm Initialisation

\phi_{1}(0)=1,\phi_{j}(0)=0 for 1<j<N and \phi_{1}(t)=0 for 1\leq t\leq T

0.5 Recursion

For t=1,2,...,T
…… For j=2,3,...,N-1
……………..\phi_{j}(t)=max_{1\leq k<N}[\phi_{k}(t-1)a_{kj}]b_{j}(\mathbf{o_{t}})
…………….. store the preceding node pred(j,t)=k

0.6 Termination

p(\mathbf{O},X|\lambda)=max_{1<k<N}\phi_{k}(T)a_{kN}
store the preceding node pred(N,T)=k

The most likely path is found by following the preceding node information backwards that is stored in pred(j,t)

Underflow Errors

The direct calculation of p(\mathbf{O}|\lambda) will most likely cause arithmetic underflow errors. The probabilities can become so small that the computer is unable to calculate them correctly. You should instead calculate the log likelihood e.g log(p(\mathbf{O}|\lambda))

Hidden Markov Models – Model Description Part 1 of 4

Hidden Markov Models

This post will develop a general framework for classification tasks using hidden markov models. The tutorial series will cover how to build and train a hidden markov models in R. Initially the maths will be explained, then an example in R provided and then an application on financial data will be explored.

General Pattern Recognition Framework

A set of features \mathbf{o} are derived from data set \mathbf{d} and a class \omega identified by finding the most likely class given the data \mathbf{o}

\hat{\omega}=argmax_{\omega}P(\omega|\mathbf{o})

However P(\omega|\mathbf{o}) is unknown, so Bayes’ rule must be used.

\hat{\omega}=argmax_{\omega}\frac{P(\mathbf{o}|\omega)P(\omega)}{P(\mathbf{o})}=argmax_{\omega}P(\mathbf{o}|\omega)P(\omega)

Since the maximisation does not depend upon P(\mathbf{o}) we can ignore it. The terms P(\mathbf{o}|\omega) and P(\omega) , are the likelihood of the data given the class and prior probability of a class respective, both terms are defined by a model. The feature model P(\mathbf{o}|\omega) will be described by the hidden markov model (HMM), each class will have it’s own HMM.

The Task at Hand

First we need to generate a set of features \mathbf{o} from the raw data \mathbf{d}. I will skip this step for now because it is specific to the application of your hidden markov model, for example in finance \mathbf{d} may be various stock prices and \mathbf{o} could be a set of technical indicators / volatility calculations applied to the data \mathbf{d}. HMM’s are popular in speech recognition and typically \mathbf{o} is a vector describing the characteristics of the frequency spectrum of the speech.

Secondly the feature vector \mathbf{o} must then be assigned a class from the HMM. This is done the via maximum likelihood estimation, the HMM is a generative model, choose the class that is most likely to have generated the feature vector \mathbf{o}.
For finance the class might be a market regime (trending/mean reverting) or in speech recognition the class is a word.

Example HMM Specification

hidden markov model

N The number of states in the HMM

a_{ij} The probability of transitioning from state i to state j

b_{j}(\mathbf{o}) The probability of generating feature vector \mathbf{o} upon entering state j (provided j is not the entry or exit state)

The HMM \lambda may be written as \lambda=[N,a_{ij},b_{j}]

\mathbf{O}=[\mathbf{o_{1},}\mathbf{o_{2},o_{T}]} the observed feature vectors

X=[x_{1},x_{2},x_{T}] the specified state sequence

The joint probability is the probability of jumping from one state to the next multiplied by the prob of generating the feature vector in that state:

p(\mathbf{O},X|\lambda)=a_{x_{0},x_{1}}\prod_{t=1}^{T}b_{x_{t}}(\mathbf{o_{t}})a_{x_{t},x_{t+1}}

Where x_{0} is always the entry state 1, and x_{T+1} is always the exit state N.

Likelihood Calculation

In the above joint probability calculation we have assumed a state sequence X. However this is a latent variable, we do not know it, it is hidden (hence the name HIDDEN markov model)! However if we sum over all possible state sequences we can marginalise it out.

p(\mathbf{O}|\lambda)=\sum_{X}p(\mathbf{O},X|\lambda)

This can be problematic due to the number of possible state sequences (especially in a real-time application), luckily algorithms exist to effectively perform the calculation without needing to explore every state sequence. One such algorithm is the forward algorithm.

What is b_{j}(\mathbf{o})?

This is the output distribution for a given state j. The distribution can be anything you like however it should hopefully match the distribution of the data at state j, and it must be mathematically tractable. The most natural choice at this stage is to assume \mathbf{o} can be described by the multivariate Gaussian. As a word of caution if the elements of your feature vector are highly correlated then \Sigma, the covariance matrix, has a lot of parameters to measure. See if you can collapse \Sigma
to a diagonal matrix.

E.g b_{j}(\mathbf{o})\sim N(\mathbf{o};\mu_{j},\Sigma_{j})

How to train b_{j}(\mathbf{o}) / Viterbi Parameter Estimation

We already know how to fit a normal distribution, the MLE for \mu is the mean, and \Sigma the covariance of the feature vector. However we must only calculate the mean and covariance on feature vectors that came from state j, this is known as Viterbi Segmentation. Viterbi Segmentation means there is a hard assignment between feature vector and the state that generated it, an alternative method is called Balm-Welch which probabilistically assigns feature vectors to multiple states.

State j generated observations starting at t_{j}

\hat{\mu}_{j}=\frac{1}{t_{j+1}-t_{j}}\sum_{t=t_{j}}^{t_{j+1}-1}\mathbf{o_{t}}

\hat{\mathbf{\Sigma}}_{j}=\frac{1}{t_{j+1}-t_{j}}\sum_{t=t_{j}}^{t_{j+1}-1}[(\mathbf{o_{t}-\hat{\mu}_{j}})(\mathbf{o_{t}-\hat{\mu}_{j}})

It is not known in advance which state generated which observation vector, fortunately there is an algorithm called the Viterbi algorithm to approximately solve this problem.

hmm training outline

The forward algorithm for efficient calculation of p(\mathbf{O}|\lambda) and the Viterbi algorithm will be explored in my next post.

Economic Data In R – Nonfarms Payroll & Other FED Data

This post is a simple demo to show how to get economic data into R.The data comes from the Federal Reserve Bank of St Louis http://research.stlouisfed.org/fred2/.

I will show how to download Nonfarms Payroll numbers, although it is very easy to modify the code below to download GDP, CPI etc…

non-farms payroll plot

non-farms payroll vs s&p 500 regression

The top chart shows non-farms plotted with the S&P 500. It is interesting to note that in the % change charts there is a crash in the market around mid 08 this is then followed by a crash in the non-farms numbers. Although not a very rigorous analysis it looks like non-farms numbers LAG the market.

The second chart regress the % change in payrolls with the % change in the S&P for the month. It is seen in the scatter plot that there is no clear relationship between payroll change and S&P change.

The second regression on the right takes this months payroll change and regress it against next months S&P return, ie try and see if the numbers from this month can tell us anything about the return in the S&P for the coming month. Payrolls don’t look very predictive at the 1month time horizon. I think a more interesting analysis would look at payrolls on the market over 10,20,30min horizons intraday.

Onto the code:

?View Code RSPLUS
library("quantmod")
 
#To see what the datasets are available from the FED goto the link below
#http://research.stlouisfed.org/fred2/
 
economicData <- new.env() #Make a new environment for quantmod to store data in
 
startDate = as.Date("2000-01-01") #Specify what date to get the prices from
getSymbols("PAYEMS",src="FRED",env=economicData,from=startDate) #Payems is non-farms payrolls
getSymbols("^GSPC",env=economicData,from=startDate) #S&P 500
 
 
 
economicData$PAYEMS <- window(economicData$PAYEMS,start=startDate) #Window our data (FRED ignores the from parameter above) :@
economicData$GSPC <- window(economicData$GSPC,start=startDate) #Window our data
 
mergedData <- merge(economicData$PAYEMS,Cl(economicData$GSPC),all=FALSE) #join the two datasets based on their SHARED dates
 
#Calculate the % diff
mergedPercDiff<- mergedData
mergedPercDiff$PAYEMS <- diff(mergedData$PAYEMS)/Lag(mergedData$PAYEMS)
mergedPercDiff$GSPC.Close <- diff(mergedData$GSPC.Close)/Lag(mergedData$GSPC.Close)
 
dev.new()
par(mfrow=c(2,2))
plot(mergedData$PAYEMS, main="Non-Farm Payrolls",ylab="Thousands of Persons")
plot(mergedPercDiff$PAYEMS, main="Non-Farm Payrolls", ylab="% Change")
plot(mergedData$GSPC.Close, main="S&P 500 Close",ylab="Close Price")
plot(mergedPercDiff$GSPC.Close, main="&P 500 Close",ylab="% Change")
 
#Function to plot data and add regression line
doPlot <- function(x,y,title,xlabel,ylabel){
  x<-as.vector(x)
  y<-as.vector(y)
  regression <- lm(y~x)
  print(regression)
  plot(y~x,main=title, xlab=xlabel,ylab=ylabel)
  abline(regression,col="red",lwd=1.5)
  legend("bottomleft",paste("y=",regression$coefficients[2],"x+",regression$coefficients[1],sep=""),bg="lightblue")
}
 
dev.new()
par(mfrow=c(1,2))
doPlot(mergedPercDiff$PAYEMS,mergedPercDiff$GSPC.Close,"Regress Non-Farms Payroll with S&P Monthly Returns","Non-Farms Monthly % Change","S&P 500 Monthly % Change")
doPlot(Lag(mergedPercDiff$PAYEMS),mergedPercDiff$GSPC.Close,"Regress Non-Farms Payroll with NEXT MONTH'S S&P Monthly Return","Non-Farms Monthly % Change","S&P 500 Monthly % Change")