Tech Notes of Yi Wang: 11/2008

Nov 30, 2008

The Sum-Product Algorithm and Highly Scalable LDA?

Months before, Yuzhou Zhang notified me an interesting paper
Clustering by Passing Messages Between Data Points. I am very interested with this paper, because if an algorithm can be expressed in message-passing, it can be well scaled up using parallel computing frameworks like MPI and BSP.

Last week, Zhiyuan notified my that that basis of clustering by message passing is the sum-product algorithm, which is a well-studied probabilistic inference algorithm for tree models expressed by factor graphs. In the weekend, on my flight trip to Guangzhou, I learned the sum-product algorithm and its variant max-sum from Chapter 8 of Pattern Recognition and Machine Learning.

I am now had been curious about the fact that max-sum and sum-product are used for inference in tree or poly-tree models. But clustering is, in most cases, equivalent to learning a mixture model. In a mixture model, each observables variable has at least two parents --- the component index (label variable) and component parameters. This makes the graphical model no longer a tree (nor a poly-tree). Then the question is how could max-sum and sum-product used to do clustering?

After a discussion on the paper with Zhiyuan, I feel that the answer is as follows: It is true that the problem of "clustering of data points" can be modeled by a mixture model. However, the paper treats a problem of "graph partitioning" or "clustering of interconnected vertices". In this paper, the authors model the problem by a poly-tree --- roots are constraints like "if a vertex v is a centroid, it must be the centroid of itself"; the second level nodes in the tree correspond to the centroid assignments of vertices in the graph; and the third level nodes are the vertices in the graph. This poly-tree can be inferred using sum-product.

In P9 of the supporting material of Clustering by Passing Messages Between Data Points, the authors explained that:

The max-sum algorithm for the factor graph in Fig. S2A can be derived in a straightforward fashion (c.f. (S6)) and consists of sending messages from variables to functions and from functions to variables in a recursive fashion.

Where (S6) refers to an introductory paper Factor Graphs and the Sum-Product Algorithm. I read it and feels that it is a pretty good introduction to sum-product, but, to understand factor graphs, it would be better to refer to the book Pattern Recognition and Machine Learning.

My intention is, if max-sum/sum-product can be used in inference of a mixture model, I would like to make sure whether it is also applicable in inference of LDA? If so, we may have a novel highly-scalable solution to LDA.

As this paper uses sum-product to do graph partitioning instead of clustering a bag of words, even if we can generalize the sum-product to handle more than one graphs (as LDA generalizes mixture model to handle multiple bags of words), the result solution may be like frequent sub-graph mining, instead of a noval inference algorithm of LDA ...

Nov 27, 2008

Some Questions In Reading PRML

Here follows some questions that I met in reading Pattern Recognition and Machine Intelligence. Hope to get through them in the next week:

Why the Markov blanket of a node in a Bayesian network must include co-parents, whereas the Markov blanket of a node in a Markov Random Field includes only those direct neighbors. (ref. P383, Figure 8.26 and P385, Figure 8.28)
Lagrange multiplier with constraint g(X)>0. "Now, however, the sign of the Lagrange multiplier is crucial, because the function f(x) will only be at a maximum if its gradient is oriented away from the region g(x) > 0, as illustrated in Figure E.3. We therefore have ∇f(x) = −λ∇g(x) for some value of λ > 0."
On variational inference (P465, Equation 10.6): how to dissect out the dependence on one of the factors q_j(Z_j).
Factor graphs and the sum-product algorithm (also known as belief propagation).
Variational inference and expectation propagation.

How to Label/Tag Documents?

I am interesting with labelling an arbitrary document with predefined or learned labels. I prefer using LDA with this problem, because we have a highly scalable implementation of LDA, and LDA can explain each document by topics. Nevertheless, I am doing a survey before programming. I have read the following papers. Would anyone please provide any hint?

Simon Lacoste-Jullien, Fei Sha, and Michael I. Jordan. DiscLDA: Discriminative learning for dimensionality reduction and classification. NIPS 2008.
D. Blei and M. Jordan. Modeling Annotated Data. SIGIR 2003.
Chemudugunta, C., Holloway, A., Smyth, P., & Steyvers, M. Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning. In: 7th International Semantic Web Conference, 2008.
Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. WWW 2008
Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai. Automatic Labeling of Multinomial Topic Models, KDD 2007.
Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai. Semantic Annotation of Frequent Patterns, ACM TKDD, 1(3), 2007.

Nov 26, 2008

How Did Topic Models Evolve


          Finite models                    Infinite models
       --------------------------------------------------------------------
                                                                          |
          Latent Dirichlet allocation ---> Hierarchical Dirichlet process | mixtures
            A                                A                            | 
            |                                |                            |
Start -->  Finite mixture model ---------> Infinite mixture model         | mixture
            |                                |                            |
            V                                V                            |
          Hidden Markov model -----------> Infinite hidden Markov model   | temporal

Finite Mixture Model
This is the very basic modeling technique --- represent a complex thing by a weighted sum of simple things --- e.g., Fourier transformation represents a complex signal by a weighted sum of sine curves. A typical example of finite mixture model is Gaussian mixture, which is often learned using EM. Another example is multinomial mixture with Dirichlet prior, which can be learned using Gibbs sampling because Dirichlet and multinomial are conjugate.

Infinite Mixture Model
By modeling the component prior using Dirichlet process, finite mixture model is extended to have infinite mumber of mixture components. If the likelihood is conjugate with the Dirichlet process prior, the model can be conveniently learned using Gibbs sampling [Neal 1998, Markov Chain Sampling Methods for Dirichlet Process Mixture Models]. Another interesting model is [Rasmussen 2000, Infinite Gaussian Mixture Model].

Note that the Dirichlet process can be defined on any measure space, even if a space whose each sample point is a component (parameters). So the generative process becomes that a Dirichlet process G generates a component θ for each observed object x. Because the discrete-and-clustering property of Dirichlet process, components of multiple x's are often the same, that is, these x's are of the same component/cluster. So we can say that G (of infinite mixture model) encodes both components and their weights (of finite mixture model). We often use another Dirichlet process parameterized by H as the prior of G. Because G~DP(H) and θ~G are conjugate, G can be integrated out. The likelihood x~F(θ) can be of any form (is this true? need to read the infinite Gaussian mixture model paper.).

Latent Dirichlet Allocation
Mixture models represent the probability distribution of "a bag of" samples. However, in some cases, we need to model more than bags, where objects in each bag is represented by a mixture model. To capture the commonality between bags, we want mixture models of these bags share the same set of components, but each bag has its own set of weights. When the components are multinomial and weight sets of bags share a Dirichlet prior, this hierarchical mixture model is known as latent Dirichlet allocatioin.

Exact inference of LDA is intractable. Approximate inference can be done using variational methods. In particular, if a Dirichlet prior is assumed to generate the set of multinomial components (shared by mixture models of documents), these components can be integrated out, and the model inference can be done using Gibbs sampling.

Hierarchical Dirichlet Process
In each bag j, each objects x_ji is generated by its correspond component θ_ji. In the bag, all θ_ji's are samples of G_j, a Dirichlet process (which encodes components and their weights within bag j). In the introduction of LDA, we explained that we hope that Gj's of various documents share components. This can be done by defining all G_j's on the same measure space, which covers all possible values of θ, and assuming that Gj's are samples of a global Dirichlet process G₀. Note that G_j~G₀ mimics a multinomial sampling, and θ_ji~G_j also mimics a multinomial sampling, so we put a Dirichlet process prior on G₀ and make the whole generation process conjugate, thus G₀ and G_j's can all be integrated out. This is key to estimate HDP using Gibbs sampling.

Hidden Markov Model
In mixture model, components are exchangable. If we assume first-order temporal dependency between components, mixture model becomes hidden Markov model (HMM), where each output pdf of the HMM corresponds to a component of the mixture model. Inference of HMM can be done using the Viterbi algorithm. A usual hierarchical extension of HMM is to extend each output pdf by a mixture model.

Infinite Hidden Markov Model
HMM assumes the state/component in time t, denoted by θ_t, depends only on θ_t-1. Some high-order extensions to HMM assumes that θ_t depends only on θ_t-1, ... θ_t-L, where L is the known as the order of HMM and reflect the length of memory/history. Most high-order HMMs assumes L a fixed positive integer, and can be represented by a first-order HMM whose each state corresponds to a combination of θ_t-1, ... θ_t-L. In contrast to these fixed-order HMMs, the Variable-Length HMM (VLHMM) learns a set of combinations with variable lengths. This is under the intuition that in some times we need only short memory/history to predict the future accurately, while in some times, we need longer memory/history. Comparing with VLHMM, which facilitates a prefix-tree prunning approach to learn the memories with max length L, the infinite HMM seems much more elegant and can learn memories with infinite length. (I need to read [Beal etc, The Infinite Hidden Markov Model] carefully to ensure that above description is not wrong).

Understanding Expectation-Maximization

I have been using EM for years without knowing some details. Thanks to Chris Bishop for the general description of EM in Section 9.4 of Pattern Recognition and Machine Learning. Here follows my learning note.

EM is designed to maximize p(X|θ) in case that optimization of p(X|θ) is difficult, but that optimization of the complete-data likelihood function p(X,Z|θ) is significantly easier.When we consider p(X,Z|θ) with the context of p(X|θ), we have to consider the distribution of Z, denoted by q(Z). Given q(Z), we can decompose p(X|θ) into two terms:

$\ln p(\mathbf{X}\mid\mathbf{\theta}) = \mathcal{L}(q,\mathbf{\theta}) + \operatorname{KL}(q||p)$

where

$\mathcal{L}(q,\mathbf{\theta}) = \sum_{\mathbf{Z}} { q(\mathbf{Z}) \ln \left\{ \frac{ p(\mathbf{X}, \mathbf{Z} | \mathbf{\theta}) }{ q( \mathbf{Z} ) } \right\} }$
$\operatorname{KL}(q||p) = - \sum_{\mathbf{Z}} q(\mathbf{Z}) \ln \left\{ \frac{ p(\mathbf{Z} \mid \mathbf{X}, \mathbf{\theta} ) }{ q(\mathbf{Z}) } \right\}$

Note that both terms are functionals of q(Z) and are larger than or equal to zero. When q is identical to the true posterior distribution P(Z|X,θ), KL(q||p) is minimized to be zero and L(q,θ) is maximized to be ln p(X|θ).The EM algorithm iteratively executes an E-step and an M-step. In the E-step, it estimates P(Z|X,θ^old) and takes it as q(Z). This minimizes KL(q||p) to be zero.

In the M-step, it fixes q(Z) and maximizes L(q,θ) with respect to θ. This maximization does not only increases L(q,θ), but also increases KL(q||p), unless a local optimum is reached. This is because the true posterior p changes with θ, and any change increases KL(q||p) from 0 to a positive number. (We had maximized KL(q||p) to 0 in the E-step.) As both terms increases, ln P(X|θ) increases. This is why EM converges monotonically.

This iterative process is similar with a "growing worm". (Describe the metaphor with more details later.)

It is notable that

$\mathcal{L}(q,\theta) = \sum_{Z} p(Z|X,\theta^{old}) \ln p(X,Z|\theta) - \sum_{Z} p(Z|X,\theta^{old}) \ln p(Z|X,\theta^{old})$

Because the second term to the right of the equation is independent of θ, the M-step maximizes only the first term, which is usually denoted by Q(θ,θ^old) in the literature.

The figure to the left illustrates the EM algorithm in the model parameter space.
When the model is a exponential distribution or a mixture of exponential distribution, L(q, θ) is convex (as shown by the blue curve). Because in the E-step, K(q||p) is minimized to be zero, so the curve of L(q, θ) touches the curve of ln p(X|θ) at θ^old. By maximizing L(q, θ) with respect to θ in the M-step, we find θ^new, which corresponds to a larger ln p(X|θ) value. As explained previously, θ^new increases not only L(q, θ) but also KL(q||p)=ln p(X|θ) - L(q, θ). (Note in the figure that KL(q||p) at θ^new is larger than zero.)

Variational and Stochastic EM
Recall that we need to minimize KL(q||p) in the E-step by estimating p(Z|X,θ^old). For some models, this is intractable and we have to use some approximated methods. If we use variational methods, then EM becomes Variational EM. If we use stochastic methods, EM becomes Stochastic EM.

Generalized EM
In the M-step, we need to maximize L(q, θ). If L(q, θ) has multiple optima, or the maximization is intractable, we can only find an approximate solution. In such cases, EM becomes Generalized EM.

Learning Materials on Non-Parametric Bayesian Methods and Topic Modeling

In the recent months, I have been learning non-parametric Bayesian
methods for topic modeling. Here follows some documents I feel
helpful in the learning process, and what I want to do using the
learned knowledge. Any of your comments are highly appreciated.

0. Measure Theory and sigma-algebra
http://en.wikipedia.org/wiki/Measure_theory
http://en.wikipedia.org/wiki/Sigma-algebra
Measure theory is the basis of Dirichlet process and many other
stochastic processes. Sigma-algebra is the support of measure theory.
They are keys to generalize finite latent factors to infinity.

1. Basics of Dirichlet process:
http://velblod.videolectures.net/2007/pascal/mlss07_tuebingen/teh_yee_whye/teh_yee_whye_dp_article.pdf
This introduction is a course note written by Yee Teh, the author
of hierarchical Dirichlet process (HDP).

2. Dirichlet Process Mixture Models
http://gs2040.sp.cs.cmu.edu/neal.pdf
This paper presents the Dirichlet process mixture model which is a
mixture with (potentially) infinite number of components. This paper
explains how to generalize traditional mixture models using a
Dirichlet process as the prior of components. This generalization
makes it possible to estimate the number of components using Gibbs
sampling in tractable amount of runtime complexity.

3. Hierarchical Dirichlet Process
http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf
As LDA models each document using a mixture of finite number of
topics, hierarchical Dirichlet process (HDP) models each document by a
mixture of infinite number of topics, where each finite mixture is a
Dirichlet process mixture model.

What are blocked/collapsed Gibbs sampling

I learned a little about Gibbs sampling from the text book. However, I had been confused when I learn Latent Dirichlet Allocation (LDA) and was told LDA can be trained by a technique known as "collapsed" Gibbs sampling. Fortunately, this post in the well-known Natural Language Processing Blog answered my question concisely:

The standard setup for Gibbs sampling over a space of variables a,b,c (I'll assume there are no exploitable independences) is:

Draw a conditioned on b,c
Draw b conditioned on a,c
Draw c conditioned on a,b

This is quite a simple story that, in some cases, be "improved." For instance, it is often possible to jointly draw a and b, yielding:

Draw a,b conditioned on c
Draw c conditioned on a,b

This is the "blocked Gibbs sampler." Another variant, that is commonly used in our community, is when one of the variables (say, b) can be analytically integrated out, yielding:

Draw a conditioned on c
Draw c conditioned on a

For LDA, by introducing a Dirichlet prior, beta, for the model parameters (i.e., word distributions of topics), we can integrate out model parameters. Therefore, the Gibbs sampling algorithm samples only the latent variables (i.e., topic assignments of words in the training corpus). To my understand, this is how Gibbs sampling in LDA is "collapsed".

What is a Noisy-OR Model

I am interested with the paper "Noisy-OR Component Analysis and its Application to Link Analysis" published by Tomas Singliar and Milos Hauskrecht on JMLR 7 (2006). A very preliminary prerequisite to understand this paper is to know the "noisy-or" model. However, it seems that noisy-or is an old topic and no much can be found via Google. Fortunately, I got a very brief description from an old paper "Possibility theory and the generalized Noisy OR model". Snapshotting the section on Noisy OR as an image as attached:

From this paper, I also find the original paper that presents Noisy OR at the UAI's website.
Unfortunately, I do not have access to the full text in PDF format... :-(

A test post experiencing Blogger.com

I create this post for experiencing Blogger.com. To the left of this post is a picture of sun flowers taken at Tebet.

I think Blogger.com is better than MSN Space. At least, I can save a post under work many times before I decide to publish it. Moreover, Blogger.com can auto-save my editing as what Google Docs does.

Often I would like to post images of math equations that are copy-and-pasted from a scientific paper in PDF format. This is done by using the gorgeous LaTeX editor hosted onCodeCogs.com. For example:
$P(\boldsymbol{X}) = P(x_1,\ldots,x_N) = \int \prod_{i=1}^N P(x_i \mid \boldsymbol{\theta}) dP(\boldsymbol{\theta})$

Tech Notes of Yi Wang