Nov 30, 2008

The Sum-Product Algorithm and Highly Scalable LDA?

Months before, Yuzhou Zhang notified me an interesting paper
Clustering by Passing Messages Between Data Points. I am very interested with this paper, because if an algorithm can be expressed in message-passing, it can be well scaled up using parallel computing frameworks like MPI and BSP.

Last week, Zhiyuan notified my that that basis of clustering by message passing is the sum-product algorithm, which is a well-studied probabilistic inference algorithm for tree models expressed by factor graphs. In the weekend, on my flight trip to Guangzhou, I learned the sum-product algorithm and its variant max-sum from Chapter 8 of Pattern Recognition and Machine Learning.

I am now had been curious about the fact that max-sum and sum-product are used for inference in tree or poly-tree models. But clustering is, in most cases, equivalent to learning a mixture model. In a mixture model, each observables variable has at least two parents --- the component index (label variable) and component parameters. This makes the graphical model no longer a tree (nor a poly-tree). Then the question is how could max-sum and sum-product used to do clustering?

After a discussion on the paper with Zhiyuan, I feel that the answer is as follows: It is true that the problem of "clustering of data points" can be modeled by a mixture model. However, the paper treats a problem of "graph partitioning" or "clustering of interconnected vertices". In this paper, the authors model the problem by a poly-tree --- roots are constraints like "if a vertex v is a centroid, it must be the centroid of itself"; the second level nodes in the tree correspond to the centroid assignments of vertices in the graph; and the third level nodes are the vertices in the graph. This poly-tree can be inferred using sum-product.

In P9 of the supporting material of Clustering by Passing Messages Between Data Points, the authors explained that:
The max-sum algorithm for the factor graph in Fig. S2A can be derived in a straightforward fashion (c.f. (S6)) and consists of sending messages from variables to functions and from functions to variables in a recursive fashion.
Where (S6) refers to an introductory paper Factor Graphs and the Sum-Product Algorithm. I read it and feels that it is a pretty good introduction to sum-product, but, to understand factor graphs, it would be better to refer to the book Pattern Recognition and Machine Learning.

My intention is, if max-sum/sum-product can be used in inference of a mixture model, I would like to make sure whether it is also applicable in inference of LDA? If so, we may have a novel highly-scalable solution to LDA.

As this paper uses sum-product to do graph partitioning instead of clustering a bag of words, even if we can generalize the sum-product to handle more than one graphs (as LDA generalizes mixture model to handle multiple bags of words), the result solution may be like frequent sub-graph mining, instead of a noval inference algorithm of LDA ...