Hidden Markov Models
This post will develop a general framework for classification tasks using hidden markov models. The tutorial series will cover how to build and train a hidden markov models in R. Initially the maths will be explained, then an example in R provided and then an application on financial data will be explored.
General Pattern Recognition Framework
A set of features are derived from data set and a class identified by finding the most likely class given the data
However is unknown, so Bayes’ rule must be used.
Since the maximisation does not depend upon we can ignore it. The terms and , are the likelihood of the data given the class and prior probability of a class respective, both terms are defined by a model. The feature model will be described by the hidden markov model (HMM), each class will have it’s own HMM.
The Task at Hand
First we need to generate a set of features from the raw data . I will skip this step for now because it is specific to the application of your hidden markov model, for example in finance may be various stock prices and could be a set of technical indicators / volatility calculations applied to the data . HMM’s are popular in speech recognition and typically is a vector describing the characteristics of the frequency spectrum of the speech.
Secondly the feature vector must then be assigned a class from the HMM. This is done the via maximum likelihood estimation, the HMM is a generative model, choose the class that is most likely to have generated the feature vector .
For finance the class might be a market regime (trending/mean reverting) or in speech recognition the class is a word.
Example HMM Specification
The number of states in the HMM
The probability of transitioning from state i to state j
The probability of generating feature vector upon entering state j (provided j is not the entry or exit state)
The HMM may be written as
the observed feature vectors
the specified state sequence
The joint probability is the probability of jumping from one state to the next multiplied by the prob of generating the feature vector in that state:
Where is always the entry state 1, and is always the exit state N.
In the above joint probability calculation we have assumed a state sequence . However this is a latent variable, we do not know it, it is hidden (hence the name HIDDEN markov model)! However if we sum over all possible state sequences we can marginalise it out.
This can be problematic due to the number of possible state sequences (especially in a real-time application), luckily algorithms exist to effectively perform the calculation without needing to explore every state sequence. One such algorithm is the forward algorithm.
What is ?
This is the output distribution for a given state j. The distribution can be anything you like however it should hopefully match the distribution of the data at state j, and it must be mathematically tractable. The most natural choice at this stage is to assume can be described by the multivariate Gaussian. As a word of caution if the elements of your feature vector are highly correlated then , the covariance matrix, has a lot of parameters to measure. See if you can collapse
to a diagonal matrix.
How to train / Viterbi Parameter Estimation
We already know how to fit a normal distribution, the MLE for is the mean, and the covariance of the feature vector. However we must only calculate the mean and covariance on feature vectors that came from state j, this is known as Viterbi Segmentation. Viterbi Segmentation means there is a hard assignment between feature vector and the state that generated it, an alternative method is called Balm-Welch which probabilistically assigns feature vectors to multiple states.
State j generated observations starting at
It is not known in advance which state generated which observation vector, fortunately there is an algorithm called the Viterbi algorithm to approximately solve this problem.
The forward algorithm for efficient calculation of and the Viterbi algorithm will be explored in my next post.