## Nov 26, 2008

### How Did Topic Models Evolve

`          Finite models                    Infinite models       --------------------------------------------------------------------                                                                          |          Latent Dirichlet allocation ---> Hierarchical Dirichlet process | mixtures            A                                A                            |             |                                |                            |Start -->  Finite mixture model ---------> Infinite mixture model         | mixture            |                                |                            |            V                                V                            |          Hidden Markov model -----------> Infinite hidden Markov model   | temporal`

Finite Mixture Model
This is the very basic modeling technique --- represent a complex thing by a weighted sum of simple things --- e.g., Fourier transformation represents a complex signal by a weighted sum of sine curves. A typical example of finite mixture model is Gaussian mixture, which is often learned using EM. Another example is multinomial mixture with Dirichlet prior, which can be learned using Gibbs sampling because Dirichlet and multinomial are conjugate.

Infinite Mixture Model
By modeling the component prior using Dirichlet process, finite mixture model is extended to have infinite mumber of mixture components. If the likelihood is conjugate with the Dirichlet process prior, the model can be conveniently learned using Gibbs sampling [Neal 1998, Markov Chain Sampling Methods for Dirichlet Process Mixture Models]. Another interesting model is [Rasmussen 2000, Infinite Gaussian Mixture Model].

Note that the Dirichlet process can be defined on any measure space, even if a space whose each sample point is a component (parameters). So the generative process becomes that a Dirichlet process G generates a component θ for each observed object x. Because the discrete-and-clustering property of Dirichlet process, components of multiple x's are often the same, that is, these x's are of the same component/cluster. So we can say that G (of infinite mixture model) encodes both components and their weights (of finite mixture model). We often use another Dirichlet process parameterized by H as the prior of G. Because G~DP(H) and θ~G are conjugate, G can be integrated out. The likelihood x~F(θ) can be of any form (is this true? need to read the infinite Gaussian mixture model paper.).

Latent Dirichlet Allocation
Mixture models represent the probability distribution of "a bag of" samples. However, in some cases, we need to model more than bags, where objects in each bag is represented by a mixture model. To capture the commonality between bags, we want mixture models of these bags share the same set of components, but each bag has its own set of weights. When the components are multinomial and weight sets of bags share a Dirichlet prior, this hierarchical mixture model is known as latent Dirichlet allocatioin.

Exact inference of LDA is intractable. Approximate inference can be done using variational methods. In particular, if a Dirichlet prior is assumed to generate the set of multinomial components (shared by mixture models of documents), these components can be integrated out, and the model inference can be done using Gibbs sampling.

Hierarchical Dirichlet Process
In each bag j, each objects xji is generated by its correspond component θji. In the bag, all θji's are samples of Gj, a Dirichlet process (which encodes components and their weights within bag j). In the introduction of LDA, we explained that we hope that Gj's of various documents share components. This can be done by defining all Gj's on the same measure space, which covers all possible values of θ, and assuming that Gj's are samples of a global Dirichlet process G0. Note that Gj~G0 mimics a multinomial sampling, and θji~Gj also mimics a multinomial sampling, so we put a Dirichlet process prior on G0 and make the whole generation process conjugate, thus G0 and Gj's can all be integrated out. This is key to estimate HDP using Gibbs sampling.

Hidden Markov Model
In mixture model, components are exchangable. If we assume first-order temporal dependency between components, mixture model becomes hidden Markov model (HMM), where each output pdf of the HMM corresponds to a component of the mixture model. Inference of HMM can be done using the Viterbi algorithm. A usual hierarchical extension of HMM is to extend each output pdf by a mixture model.

Infinite Hidden Markov Model
HMM assumes the state/component in time t, denoted by θt, depends only on θt-1. Some high-order extensions to HMM assumes that θt depends only on θt-1, ... θt-L, where L is the known as the order of HMM and reflect the length of memory/history. Most high-order HMMs assumes L a fixed positive integer, and can be represented by a first-order HMM whose each state corresponds to a combination of θt-1, ... θt-L. In contrast to these fixed-order HMMs, the Variable-Length HMM (VLHMM) learns a set of combinations with variable lengths. This is under the intuition that in some times we need only short memory/history to predict the future accurately, while in some times, we need longer memory/history. Comparing with VLHMM, which facilitates a prefix-tree prunning approach to learn the memories with max length L, the infinite HMM seems much more elegant and can learn memories with infinite length. (I need to read [Beal etc, The Infinite Hidden Markov Model] carefully to ensure that above description is not wrong).