Dec 27, 2008

Stochastic Sampling

The basic goal of stochastic sampling is to draw random samples from a given distribution p(x). Sometimes, p(x) is not exactly known, for example, knowing only f(x)=p(x)/K.

Given the ability of stochastic sampling, we may do the following things: (i) approximate p(x), (ii) integrate p(x), and (iii) optimize p(x).

A difficulty for stochastic sampling is that the form of p(x) or f(x) is usually complex, however, we can only sample directly from some easy-formed distributions, like discrete distribution, uniform distribution or unit Gaussian.

So, usually we sample from an easy-to-sample distribution q(x) instead of from p(x) or f(x), then we either assign each sample point a "weight", or an "acceptance probability" which decides whether we accept this point as a sample from p(x) or f(x).

Some basic sampling methods include:
• Monte Carlo: Sample from q(x). No weight or acceptance probability is calculated. Just pretends that samples from q(x) is from p(x), and use the samples to compute the integration of p(x).
• Weighted Sampling: Sample from q(x). Assign each sample x a weight w=p(x)/q(x). Use the weighted samples to compute the integration of p(x).
• Rejection Sampling: Sample from q(x), where q(x) > p(x) in the whole domain of p(x). Accept each sample x by an acceptance probability a=p(x)/q(x).
In above methods, we have to choose q(x) which is easy-to-sample and approximate p(x). These two requirements are often contradictory. In order not to be bothered by choosing q(x), we may resort to Markov Chain Monte Carlo methods, which constructively approach p(x) without q(x) along a Markov chain.
Few more words for simulated annealing: to take it as a sampling method, we decrease T down to 1; to make it an optimization method, we decrease T down to 0.

References:
1. B. Walsh. Markov Chain Monte Carlo and Gibbs Sampling. Lecture notes. 2004.
2. David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Chapter 29 and Chapter 30.

Dec 26, 2008

iPhone 和 gPhone 之比较 —— Android 恐怕将战胜 iPhone

（这是2008年3月11日，我在我的msn space上发表的一篇技术评论。转载到此，以便备份。同时改正了一些错误，补充了一些内容。）

（一） 操作系统

iPhone 上运行的是 Mac OS X，这个操作系统也运行在 Apple 的各类台式机和笔记本电脑上。这个操作系统是基于 BSD 开发的。Mac OS X的系统内核（Darwin）是“开源软件”（open source）。但实际上下载和编译都很困难，而且与 Mac OS X内核相关的中间件（middleware）等软件并不是 open source 的。更重要的是，“开源软件”和“自由软件”（free software）是不一样的两个概念——“开源”容许用户看到源码，但却不容许修改源码，所以并不能样形成一个开放的包括世界各地程序员的开发者群体。所以Mac OS X 的开发是牢牢掌握在 Apple 一家手里的。Steve Jobs 似乎是想通过这招来打造他唯一的“最好”。

（二） 开发工具和SDK

Android 的开发工具的可选项太多了 —— 在 Linux下工作的程序员都可以很方便的使用他们日常熟悉的工具为 Android 些程序，比如 Eclipse, Emacs, InteliJ 等等。

（三）编程语言

（四）三维图形

iPhone已经上市一年多了（到2008年12月），但是尚无有影响力的三维图形应用软件。而2008年2月我试用Android的开发机的时候，就能跑著名的三维图形电子游戏 Quake了。而且开发机上运行的 Google Map 的地球，也是用 OpenGL ES 渲染的。三维图形技术还被用来作 Google Map 的全景图。（参加附图：依次是 运行三维全景图的 Android、运行 Google Earth 的 Android、和运行 Quake 的 Android）

（五）回忆

Apple 制造 个人电脑（PC）的最早的厂家之一，比 IBM 设计 PC/AT 和 PC/XT 要早。当时 Apple 退出的 Apple I 和 Apple II 个人电脑曾经风靡全世界，其中包括刚从文革中苏醒的中国。我上小学三年级的时候，爸爸给买了一台当时中科院设计制造的 CEC-I（中华学习机），其实就是 Apple II 加上了一个汉卡，卡上有在没有硬盘的年代里用于存储汉字库的EPROM。（这个算不算技术侵权我就不知道了，就算算，我个人也是从中受益的）。Apple II 用的是6502处理器，现在用来控制空调和洗衣机，真是廉颇老矣。后来 Apple 开始基于 Motorola 的 680x0 CPU 推出 Macintosh 系列，这是和 IBM PC 并行发展的。记得我上初中的时候，在字典里查不到 Macintosh 的意思，后来一个同学问了我们中学（长沙雅礼中学）的一个外教，这小伙子当时是耶鲁大学的一个博士生，有一台 Macintosh。他告诉我们 Macintosh 是产于加州的一种大苹果——Steve Jobs到底还是没去种别的水果啊。

IBM后来想跟Microsoft学，搞 了个 OS/2想和 Windows 竞争。可是已经晚了——Microsoft 已经在开放的操作系统框架下拉拢了很多熟悉 DOS/Windows 的驱动程序开发者，并导致大量的硬件设备只有 Windows 版本的驱动程序，而不支持 OS/2。我上高中的时候曾经兴冲冲的从湖南省电力厅的一个办公室，用好几十张5吋软盘拷贝了一个 OS/2，结果不能驱动我当时新买的声霸卡，而且在 486 上的启动速度太慢了，于是放弃转而使用 Microsoft Window NT (3.5?)。

Steve Jobs 是个设计的天才，可是好像不愿意走大家共赢的道路。他又一次天才的挽救了Apple，但是似乎再次准备放弃开放式的道路。要是 Steve Jobs 怀孕要修产假去，Apple 可怎么办哟。。。。。

（六）闲话

Dec 24, 2008

Some Free But Excellent Books

1. David J.C. MacKay. Information Theory, Inference, and Learning Algorithms.
2. Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods.

Dec 20, 2008

Some C++ Trickes

To output the content in a container:
copy(A.begin(), A.end(), ostream_iterator(cout, " "));

Dec 15, 2008

Reading Teh's Slides on Dirichlet Process

Gaussian distribution: Each draw G from a GP is a function over the real axis. Any finite marginalization of G is Gaussian distributed.

Finite mixture model: \pi ~ Dir(\alpha/K, ...) --- \alpha is the total number of observables in prior, /K distribute them evenly into K slots, 1/K, ... is the base distribution.

Agglomerative property of Dirichlet distributions: imaging the visualization of slots --- merging two slots sums their areas.

Decimative property of Dirichlet distributions: dividing a slot by (\beta, 1-\beta), the area is divided by the same ratio.

Conjugacy between Dirichlet and Multinomial: \pi ~ Dir(\alpha), z ~ Disc(\pi), then z ~ Disc(....), why? (P38/80)

Dec 12, 2008

Interesting Papers in NIPS 2008

1. E. Sudderth, M. Jordan. Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes.
I am interested in how Pitman-Yor process is able to model power-lay data.

2. Amit Gruber, Michal Rosen-Zvi and Yair Weiss. Latent Topic Models for Hypertext.
Topic model describes the generation of hyperlinks between document pairs.

3. Simon Lacoste-Julien, Fei Sha and Michael I. Jordan. DiscLDA: Discriminative Learning for
Dimensionality Reduction and Classification
.
I am not yet quiet clear with how the linear translations were introduced in this Bayesian modeling approach.

The Trace of China's Development

In an article of 《南方周末》Dec 11, 2008, the authors rank top-10 words in each report of national representative meeting of Chinese Communist Party（中国共产党全国代表大会）. Here attaches a scanned version of this article.
However, as the authors rank words by N(w,c)/N(c), the frequency of the word, we see that non-discriminative/non-descriptive words like "我们", "党" appear in most list. In short, word ranking by term frequency is not semantically discriminative, or, not characteristic.

Using our method presented in this post, Zhiyuan ranked words in each report and generated the following ranked lists. Compared with those published in《南方周末》, ours are much more descriptive.

 中共第七大报告，1945，延安，在抗日战争即将胜利的前夜 "中共第八大报告，1956，刘少奇政治报告，邓小平党章的报告，周恩来第二个五年计划" 中共第九大报告，1969，林彪，林彪作为“毛泽东同志的亲密战友和接班人” 中共第十大报告，1973，王洪文，《关于修改党章的报告》 中共第十一大报告，1977，叶剑英，《关于修改党的章程的报告》 中共第十二大报告，1982，胡耀邦，《全面开创社会主义现代化建设的新局面》 中共第十三大报告，1987，赵紫阳，《沿着有中国特色的社会主义道路前进》 中共第十四大报告，1992，江泽民，《加快改革开放和现代化建设步伐，夺取有中国特色社会主义事业的更大胜利》 中共第十五大报告，1997，江泽民，《高举邓小平理论伟大旗帜，把建设有中国特色社会主义事业全面推向二十一世纪》 中共第十六大报告，2002，江泽民，《全面建设小康社会，开创中国特色社会主义事业新局面》 中共第十七大报告，2007，胡锦涛，《高举中国特色社会主义伟大旗帜 为夺取全面建设小康社会新胜利而奋斗》 1 中国 主义 阶级 主义 主席 主义 经济 建设 经济 发展 发展 2 人民 阶级 革命 阶级 阶级 社会 改革 经济 主义 建设 社会 3 国民党 我国 无产 革命 革命 工作 社会 改革 社会 社会 建设 4 民主 国家 主席 主席 他们 经济 发展 主义 发展 文化 制度 5 解放区 工作 主义 斗争 无产 文明 领导 社会 建设 制度 改革 6 抗日 农业 斗争 无产 主义 建设 企业 发展 改革 经济 加强 7 共产党 方面 专政 人民 四人帮 精神 体制 开放 中国 加强 完善 8 政府 生产 修正 历史 专政 必须 必须 市场 加强 完善 坚持 9 侵略 必须 刘少奇 马克思 斗争 思想 制度 中国 理论 坚持 体系 10 纲领 资本 他们 路线 指示 问题 建设 坚持 坚持 思想 推进 11 日本 工业 大革命 专政 资产 组织 管理 加强 制度 改革 提高 12 军队 错误 指出 世界 反革命 我国 阶段 国家 文化 领导 中国 13 要求 他们 资产 国际 路线 错误 国家 体制 企业 工作 开放 14 政策 资产 马克思 列宁 大革命 领导 生产力 现代化 阶段 统一 科学 15 战争 产品 伟大 修正 阴谋 生产 开放 企业 邓小平 提高 文化 16 统治 合作社 群众 代表 亲自 生活 经营 思想 国家 监督 结构 17 自由 民族 列宁 共产党 分子 国家 中央 领导 市场 政治 管理 18 民族 政策 路线 他们 民主 中央 党员 我国 实现 教育 机制 19 法西斯 领导 胜利 刘少奇 伟大 斗争 技术 基础 体制 市场 体制 20 联合 改造 世界 大革命 政治 方面 机关 政治 提高 科学 创新

A Review on Attribute or Summarization Extraction

1. Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema and Andrew Fano. Text Mining to Extract Product Attributes. SIGKDD Explorations (2006).
2. Popescu, Ana-Maria, Etzioni, Oren. Extracting Product Features And Opinions From Reviews. EMNLP 2005.
3. Sujith Ravi, Marius Pasca. Using Structured Text for Large-Scale Attribute Extraction, CIKM 2008.
4. Patrick Nguyen, Milind Mahajan and Geoffrey Zweig. Summarization of Multiple User Reviews in the Restaurant Domain. MSR Technical Report. 2007.
A brief from this paper describing their system:
"..., we build a classifier which predicts, from each review, a numerical rating associated to the review. Then, each review is segmented into ... snippets. .... Each snippet is then assigned a bipolar relevance score, to mark it as characteristically disparaging (bad) or praising (good). Snippets are categorized into predefined types (such as food, or service). To produce the final review, we tally the most relevant snippets, both good and bad, in each category."
For the following reasons, I do NOT believe this method works nor I would cite this tech report: (1) Authors do NOT explain how they "tally the most relevant snippets". (2) I do not believe a classifier can well predict a rating from a review. (3) The usage of pLSA is wrong.

5. Rada Mihalcea and Hakan Ceylan. Explorations in Automatic Book Summarization. EMNLP 2007.
This paper presents segments a book into segments, and rank segments by TextRank, which is PageRank on textual vertices (ref [6]).

6. Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. EMNLP 2004.
This paper addresses the Keyword Extraction problem. The solution is simple, construct a graph whose each vertex is a word from a corpus, and an edge indicate a co-occurrence relation between two words. Run PageRank on this graph to rank words. As this method does not employ latent topics, I guess words like "这些“, "非常" will in high priority in the rank list. It may be useful to compare this method with ranking by P(w|c), where w denotes words and c denotes domain.

7. Rada Mihalcea. Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization. ACL 2004.
This paper is very similar to [6]. Differences include: (1) rank sentences instead of words, (2) the similarity between two sentences is measured by Jaccard coefficient, and (3) use HITS in addition to PageRank. It does not explain how to use the authority and hub values respectively.

8. Gunes¸ Erkan and Dragomir R. Radev. LexPageRank: Prestige in Multi-Document Text Summarization. EMNLP 2004.
This is similar to [4], [5] and [6]. It ranks sentences from various documents using PageRank. Similarity between sentences is computed by cosine distance. And I do not see any explanation on what is the cosine distance between two sentences.

9. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. WWW 2008.
Attachments:
1. A Python implementation of PageRank, which may be used in our experiments.

Dec 10, 2008

Hidden Markov Model, Maximum Entropy Hidden Markov Model and Conditional Random Field

Thanks Peng Xu for teaching me about these three models. The differences between them are as follows:
1. HMM: arrow from st-1 to st, and from st to ot
2. Maximum Entropy HMM: arrow from st-1 to st, but from ot to st
3. CRF: no arrow between st-1 and st, neither between st and ot
References:
1. Andrew Mccallum, Dayne Freitag, Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation, ICML 2000.

Dec 8, 2008

A Latent Topic Model for Domain-Specific Attributes Extraction

It is difficult to estimate P(w|d) directly because of the high-dimensionality of d and w, and the sparsity of data.
1. Under the constraint that we consider a finite set of documents, for example, n pLSA, these are the training documents and may have up to 1M. Consider the vocabulary size is also a large number, P(w|d) has huge dimensionality and requires too much data to do sufficient statistics in estimation.
2. In practice, we prefer to realize the fact that it is an infinite set of documents, as modeled by LDA. If there is not latent topics, it would be intractable to estimate P(w|d) either theoretically or practically.
So, we introduce latent topics to factorize
$P(w|d)&space;=&space;\sum_{z=1}^K&space;P(w|z)&space;P(z|d)$
Note that the number of topics K is much more smaller than the vocabulary size V and the number of documents which is infinite.

In practice, latent topic modeling is particularly valuable in the case with sparse data. For an instance, let us consider the problem of understanding the meaning of a Web search query. Usually, a query contains only few words. If there is not latent topics and we understand the query (document) as a word distribution, a query would correspond to few non-zero values in a V-dimensional distribution --- this is too little information, difficult to understand the document --- we do not even have any confidence that some of the few words are "noise".

In contrast, given latent topic modeling, each word in the document can be associated with a topic distribution by Bayesian inference. Note that each topic is a set of weighted words learned from the potentially huge (sufficient) training corpus. This means that latent topic modeling understands documents by sufficient knowledge collected from training data, instead of a word distribution with few non-zeros values.

We are recently using the idea of latent topic modeling to identify domain-specific words. In this case, a domain is similar to a super-document. Using latent topic factorizing, we design a hierarchical latent topic model (ref. figure to the right) to estimate P(w|z), P(z|d) and P(d|c), where w denotes a word, z denotes a topic, d denotes a document and c denotes a domain (category).

Note that the model to the right can handle the case where a document is labeled belonging to multiple domains. An extension to it using ideas from Pachinko allocation could enhance it to handle the case where domains are organized into a hierarchy.

Given the learning result, we can use Gibbs sampling to infer the word distribution of a given domain
$p(w|c)&space;=&space;\sum_{z=1}^K&space;\left(&space;p(w|z)&space;\int&space;p(z|d)&space;P(d|c)&space;\;&space;\mathrm{d}d&space;\right)$
where the integration can be computed in close-form using the conjugacy between Dirichlet and multinational. Similarly, we can infer p(c|w). These two quantities allow us to compute the rank words in a domain by the "characteristic" measure (proposed by David Cohn in his probabilistic HITS algorithm):
$C(w|c)&space;=&space;P(w|c)&space;P(c|w)$
References:
Please refer to this post for details.

Dec 7, 2008

Tutorials on Conditional Random Fields by Andrew McCallum

1. Hanna M. Wallach. Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania, 2004. (9 pages)
2. An Introduction to Conditional Random Fields for Relational Learning. Charles Sutton and Andrew McCallum. Book chapter in Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006. (35 pages)

Multi-vocabulary LDA and Feature Weighting

How do we model documents which contain more than one type of words, or, say, contain words from multiple vocabularies? Usually, vocabularies have various noise-ratio, sparsity, etc, and we would like to given them various trust values.

It is notable that the full conditional posterior distribution of latent topics contains two factors --- (1) one related with the word under current focus, and (2) one on the global (document-level) topic distribution. So, even if the current word has ambiguous senses, the global topic distribution of the document trends to assign the word a "correct" topic. This is how LDA (but NOT pLSA) do disambiguation.

If some words in a document is from a vocabulary that is more trustable than others, we should scale up their occurrence counts, in order to make they larger impact to the global topic distribution.

Dec 6, 2008

Inference in Finite Mixture Models by Gibbs Sampling and Sum-Product

By Gibbs Sampling
Click the thumbnail to the left to download a 3-page note on inference in finite mixture model using Gibbs sampling. This note uses an example of estimating a mixture of "circles" in images, which, I thought about on my way to office by the subway. LaTeX source of this note is here.

Indeed, the posterior distribution of latent variables are easy to estimate, and there is no need to do approximate inference using Gibbs sampling. (This is also the reason people learn finite mixture models using EM.) What I wrote in this note is a preliminary step before understanding inference in infinite mixture model using Gibbs sampling.

By Sum-Product
Denote the likelihood of a mixture model by

$p(x)&space;=&space;\sum_{k=1}^K&space;\theta_k&space;P_k(x)$

where θ={θ1, ... θK} is the prior of latent component indicators, and Pk(x) is the k-th component.
Given N data points X={x1, ... xN}, and denoting the latent component indicators by Z={z1, ... zN}, the factor graph contains N disjoint sub-graphs, each is a chain Ri--zi--Li, where Ri is a function representing the prior of zi: Ri(zi)=θzi; and Li is a function representing the likelihood given zi: Li(zi)=Pzi(xi).

Dec 5, 2008

A Constructive Derivation of the Sum-Product Algorithm

I wrote this note using LaTeX.
Click the thumbnail to the left to download the PDF file.

I have a previous post of some key points of sum-product.

Some Questions about LDA with Partial Answers

• Is the disambiguation ability of LDA related with mvLDA?
Yes. This makes the weighting of various vocabularies reasonable. For details, please refer to this post in this Blog.

• The feature weighting problem in mvLDA, sLDA and TOT.
1. TOT seems do not has the feature weighting problem, because the time-stamp is a sample from a Beta distribution and is thus constrained in the range [0,1].
2. DisLDA seems do not has the problem, as yd is a class indicator.
3. sLDA has the problem, as stated in our paper draft.
4. mvLDA also has the problem, which is related with the disambiguation ability of the full conditional posterior.

• How could LDA used to identify domain specific words?
Initial results look not bad. Progress, related data, trained model and action items are updated in this post.

• How would LDA becomes if the training corpus contains only one document?
This would degenerate LDA to a mixture of multinomial distributions. But, how the co-occurrences of words affect the training process and manifest in the learning result in this degenerated case?

• Could sum-product (belief propagation) be used for inference in mixture models, pLSA and LDA?
Yes. I believe it is applicable to all three models. A brief on inference in mixture models is in this post. I should write down how to do inference in pLSA and LDA with factor graphs.

• Could sum-product algorithm (belief propagation) and expectation propagation used with GBR to scale up the LDA training process?
No, they cannot scale up the LDA training process. Although they can be used with GBR to do inference, but practically, as documents are iid and can be sharded across various computers, we can use MapReduce to get all benefits that can be get using GBR in inference. A more serious problem is that after inference (E-step), however, we need an M-step which definitely requires AllReduce support. This answers the question posed in this previous post.

• How to evaluate the "weighted-sum inference algorithm" for LDA?
It is in fact the initialization step of Gibbs sampling / sum-product / expectation propagation.

Dec 4, 2008

Unoctify --- Decode Octals in Text

#! /usr/bin/rubySTDIN.sync = trueSTDOUT.sync = truebuf = ''while c = STDIN.read(1)if buf.empty? and c == '\\'  buf << length ="=" buf =" ''" buf =" ''">

The Highly Efficient AllReduce in MPICH2

The MPI specification defines a parallel computing API known as AllReduce. In MPICH2, the most famous implementation of MPI, AllReduce is implemented using a very smart algorithm. Thanks to Wenyen who explained this algorithm clearly in few minutes for me. To the right is the figure drawn by Wenyen on a white board.

Suppose we have n computers, and each computer has a data block with n cells. Denote the j-th data cell on the i-th computer by Dij, and all data cells on computer i by Di={Di1,...,Din}. The goal of AllReduce is to make each of the j-th computer has R(D)=\sum_i Di.

The AllReduce algorithm implemented in MPICH2 consists of two stages: (1) Reduce-scatter and (2) All-gather. Reduce-scatter makes computer i has R(Di)=\sum_j Dij. In all-gather stage, computers exchange their partial result to make each computer have R(D)={R(D1), ..., R(Dn)}.

References:
1. Rajeev Thakur, Rolf Rabenseifner, and William Gropp, Optimization of Collective Communication Operations in MPICH, Int'l Journal of High Performance Computing Applications, (19)1:49-66, Spring 2005.

Distributed Inference of LDA using Expectation-Propagation and GBR?

After learning expectation-propagation (EP) from PRML, I think it is possible to program EP algorithms designed for inference in factorized graphical models using GBR. However, the following issues are to be confirmed:
• There has no proof that EP converges to a (local or global) optima. However, the EP algorithm developed for LDA [1] seems converges.
• The EP for LDA [1] is in fact an extended EP algorithm, but not literally an EP algorithm. Need to make sure that this extended EP can be programmed using GBR.
• Need to estimate the number and size of messages need to passed between workers. This is an approximate estimate of the communication cost of the GBR program.
References:
1. Thomas Minka and John Lafferty, Expectation-propagation for the generative aspect model, UAI 2002

Identify Domain-Specific Words

We had a solution to the question posed in this post.

Recently, there is a discussion among Kaihua, Zhiyuan and I, where Kaihua introduced an interesting problem --- how to identify words that are specific to a certain domain. To make this problem well define, let us suppose that we have both domain-specific data, e.g., a set of reviews of IT products, as well background data, e.g., a (much) larger set of reviews of various products. The question is, how to find those words that are used specific in the domain of "IT products".

I think it is valuable to solve this problem. For example, given "IT product"-specific words like "screen size electric consumption weight response time", we can easily locate valuable phrases/sentences in reviews of IT products. This may lead to a new way of Web search --- given a query on an IT product, returns a summarized review that is an aggregation of many reviews.

Let us denote the domain-specific documents by D', the background documents by D. Denote the vocabulary of D and D' by V and V', where D'\in D and V'\in V.

Now, Zhiyuan has provided some preliminary results, which look promising. This preliminary results are about the domain of film in English Wikipedia corpus. Zhiyuan's method is basing on the following signals:
1. N(term|domain)
2. N(term|corpus)
3. JS{ p(term|domain) || p(term|corpus) }
given N(term) and N(corpus).

Now we need to do the following things:
1. (zkh) Figure out how to use above signals or need other signals.
2. (wyi) Bayesian modeling of domain-specific word identification, in particular, considering the hierarchy of domains:
background -> domain -> sub-domain -> entities --> reviews.
3. (lzy) More detailed experiments using Tianya.cn as background corpus, Dianping.com as domain specific data, restaurants in Dianping.com data as entities.
1. The Tianya.cn LDA model is at: /gfs/wf/home/hjbai/tianya/model2/word_distribution-00000-of-00001
2. The Dianping.com training documents are at: /home/chengxu/shenghuo_data/xmlfeed/reviews/dianping.com.tsv
3. Edit reviews in Dianping.com (which can act as ground-truth) are at: /home/chengxu/shenghuo_data/xmlfeed/dianping_data.tsv
Copied 2 and 3 to /home/wyi/dianping/ for convenient access of Zhiyuan.

Dec 1, 2008

Key Points about the Sum-Product Algorithm

• The sum-product algorithm is not limited with inference of probabilistic graphical model. It is a general method for marginalization of high-dimensional functions [1].
• It is required that the high-dimensional function can be factorized as a product of local functions with lower dimensionality. An intuitive representation of a factorization is factor graph.
• Both Bayesian networks and Markov Random Fields can be represented by factor graphs (ref. Equation (8.5) and Equation (8.39) in [2], and details in Section 8.4.3).
• Factor graphs are bipartite graphs, where edges connect variable nodes and function nodes. That is to say, neighbors of a variable node are all function nodes; neighbors of a function node are all variable nodes.
• If there is no loop in the factor graph, sum-product on this graph is to pass messages in both ways along each edge.
• If there are loops in the factor graph, sum-product should be done in an iterative fashion.
• It is a good start point to understand sum-product from the specific problem of inference in a chain model (which results in the forward/backward algorithm). Then extend the chain by branches to make it a tree (e.g, the latent topics Z in supervised LDA affected by the Dirichlet prior, document rates and content words). Then extend the tree to have loops.
• As sum-product is expressed by message-passing , it is suitable to be developed using parallel computing frameworks like MPI and BSP.
References:
1. Factor Graphs and the Sum-Product Algorithm, Frank R. Kschischang, Senior Member, Brendan J. Frey, and Hans-Andrea Loeliger, IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001
2. Pattern Recognition and Machine Learning, Christopher Bishop, 2004, Section 8.4.
3. Mixture Modeling by Affinity Propagation, NIPS 2006
4. Clustering by Passing Messages Between Data Points, Science, Vol 315, No 5814, pp 972-976, February 2007.