## Dec 27, 2008

### Stochastic Sampling

The basic goal of stochastic sampling is to draw random samples from a given distribution p(x). Sometimes, p(x) is not exactly known, for example, knowing only f(x)=p(x)/K.

Given the ability of stochastic sampling, we may do the following things: (i) approximate p(x), (ii) integrate p(x), and (iii) optimize p(x).

A difficulty for stochastic sampling is that the form of p(x) or f(x) is usually complex, however, we can only sample directly from some easy-formed distributions, like discrete distribution, uniform distribution or unit Gaussian.

So, usually we sample from an easy-to-sample distribution q(x) instead of from p(x) or f(x), then we either assign each sample point a "weight", or an "acceptance probability" which decides whether we accept this point as a sample from p(x) or f(x).

Some basic sampling methods include:
• Monte Carlo: Sample from q(x). No weight or acceptance probability is calculated. Just pretends that samples from q(x) is from p(x), and use the samples to compute the integration of p(x).
• Weighted Sampling: Sample from q(x). Assign each sample x a weight w=p(x)/q(x). Use the weighted samples to compute the integration of p(x).
• Rejection Sampling: Sample from q(x), where q(x) > p(x) in the whole domain of p(x). Accept each sample x by an acceptance probability a=p(x)/q(x).
In above methods, we have to choose q(x) which is easy-to-sample and approximate p(x). These two requirements are often contradictory. In order not to be bothered by choosing q(x), we may resort to Markov Chain Monte Carlo methods, which constructively approach p(x) without q(x) along a Markov chain.
Few more words for simulated annealing: to take it as a sampling method, we decrease T down to 1; to make it an optimization method, we decrease T down to 0.

References:
1. B. Walsh. Markov Chain Monte Carlo and Gibbs Sampling. Lecture notes. 2004.
2. David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Chapter 29 and Chapter 30.

## Dec 26, 2008

### iPhone 和 gPhone 之比较 —— Android 恐怕将战胜 iPhone

（这是2008年3月11日，我在我的msn space上发表的一篇技术评论。转载到此，以便备份。同时改正了一些错误，补充了一些内容。）

（一） 操作系统

iPhone 上运行的是 Mac OS X，这个操作系统也运行在 Apple 的各类台式机和笔记本电脑上。这个操作系统是基于 BSD 开发的。Mac OS X的系统内核（Darwin）是“开源软件”（open source）。但实际上下载和编译都很困难，而且与 Mac OS X内核相关的中间件（middleware）等软件并不是 open source 的。更重要的是，“开源软件”和“自由软件”（free software）是不一样的两个概念——“开源”容许用户看到源码，但却不容许修改源码，所以并不能样形成一个开放的包括世界各地程序员的开发者群体。所以Mac OS X 的开发是牢牢掌握在 Apple 一家手里的。Steve Jobs 似乎是想通过这招来打造他唯一的“最好”。

（二） 开发工具和SDK

Android 的开发工具的可选项太多了 —— 在 Linux下工作的程序员都可以很方便的使用他们日常熟悉的工具为 Android 些程序，比如 Eclipse, Emacs, InteliJ 等等。

（三）编程语言

（四）三维图形

iPhone已经上市一年多了（到2008年12月），但是尚无有影响力的三维图形应用软件。而2008年2月我试用Android的开发机的时候，就能跑著名的三维图形电子游戏 Quake了。而且开发机上运行的 Google Map 的地球，也是用 OpenGL ES 渲染的。三维图形技术还被用来作 Google Map 的全景图。（参加附图：依次是 运行三维全景图的 Android、运行 Google Earth 的 Android、和运行 Quake 的 Android）

（五）回忆

Apple 制造 个人电脑（PC）的最早的厂家之一，比 IBM 设计 PC/AT 和 PC/XT 要早。当时 Apple 退出的 Apple I 和 Apple II 个人电脑曾经风靡全世界，其中包括刚从文革中苏醒的中国。我上小学三年级的时候，爸爸给买了一台当时中科院设计制造的 CEC-I（中华学习机），其实就是 Apple II 加上了一个汉卡，卡上有在没有硬盘的年代里用于存储汉字库的EPROM。（这个算不算技术侵权我就不知道了，就算算，我个人也是从中受益的）。Apple II 用的是6502处理器，现在用来控制空调和洗衣机，真是廉颇老矣。后来 Apple 开始基于 Motorola 的 680x0 CPU 推出 Macintosh 系列，这是和 IBM PC 并行发展的。记得我上初中的时候，在字典里查不到 Macintosh 的意思，后来一个同学问了我们中学（长沙雅礼中学）的一个外教，这小伙子当时是耶鲁大学的一个博士生，有一台 Macintosh。他告诉我们 Macintosh 是产于加州的一种大苹果——Steve Jobs到底还是没去种别的水果啊。

IBM后来想跟Microsoft学，搞 了个 OS/2想和 Windows 竞争。可是已经晚了——Microsoft 已经在开放的操作系统框架下拉拢了很多熟悉 DOS/Windows 的驱动程序开发者，并导致大量的硬件设备只有 Windows 版本的驱动程序，而不支持 OS/2。我上高中的时候曾经兴冲冲的从湖南省电力厅的一个办公室，用好几十张5吋软盘拷贝了一个 OS/2，结果不能驱动我当时新买的声霸卡，而且在 486 上的启动速度太慢了，于是放弃转而使用 Microsoft Window NT (3.5?)。

Steve Jobs 是个设计的天才，可是好像不愿意走大家共赢的道路。他又一次天才的挽救了Apple，但是似乎再次准备放弃开放式的道路。要是 Steve Jobs 怀孕要修产假去，Apple 可怎么办哟。。。。。

（六）闲话

## Dec 24, 2008

### Some Free But Excellent Books

1. David J.C. MacKay. Information Theory, Inference, and Learning Algorithms.
2. Radford M. Neal. Probabilistic Inference Using Markov Chain Monte Carlo Methods.

## Dec 20, 2008

### Some C++ Trickes

To output the content in a container:
copy(A.begin(), A.end(), ostream_iterator(cout, " "));

## Dec 15, 2008

### Reading Teh's Slides on Dirichlet Process

Gaussian distribution: Each draw G from a GP is a function over the real axis. Any finite marginalization of G is Gaussian distributed.

Finite mixture model: \pi ~ Dir(\alpha/K, ...) --- \alpha is the total number of observables in prior, /K distribute them evenly into K slots, 1/K, ... is the base distribution.

Agglomerative property of Dirichlet distributions: imaging the visualization of slots --- merging two slots sums their areas.

Decimative property of Dirichlet distributions: dividing a slot by (\beta, 1-\beta), the area is divided by the same ratio.

Conjugacy between Dirichlet and Multinomial: \pi ~ Dir(\alpha), z ~ Disc(\pi), then z ~ Disc(....), why? (P38/80)

## Dec 12, 2008

### Interesting Papers in NIPS 2008

1. E. Sudderth, M. Jordan. Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes.
I am interested in how Pitman-Yor process is able to model power-lay data.

2. Amit Gruber, Michal Rosen-Zvi and Yair Weiss. Latent Topic Models for Hypertext.
Topic model describes the generation of hyperlinks between document pairs.

3. Simon Lacoste-Julien, Fei Sha and Michael I. Jordan. DiscLDA: Discriminative Learning for
Dimensionality Reduction and Classification
.
I am not yet quiet clear with how the linear translations were introduced in this Bayesian modeling approach.

### The Trace of China's Development

In an article of 《南方周末》Dec 11, 2008, the authors rank top-10 words in each report of national representative meeting of Chinese Communist Party（中国共产党全国代表大会）. Here attaches a scanned version of this article.
However, as the authors rank words by N(w,c)/N(c), the frequency of the word, we see that non-discriminative/non-descriptive words like "我们", "党" appear in most list. In short, word ranking by term frequency is not semantically discriminative, or, not characteristic.

Using our method presented in this post, Zhiyuan ranked words in each report and generated the following ranked lists. Compared with those published in《南方周末》, ours are much more descriptive.

 中共第七大报告，1945，延安，在抗日战争即将胜利的前夜 "中共第八大报告，1956，刘少奇政治报告，邓小平党章的报告，周恩来第二个五年计划" 中共第九大报告，1969，林彪，林彪作为“毛泽东同志的亲密战友和接班人” 中共第十大报告，1973，王洪文，《关于修改党章的报告》 中共第十一大报告，1977，叶剑英，《关于修改党的章程的报告》 中共第十二大报告，1982，胡耀邦，《全面开创社会主义现代化建设的新局面》 中共第十三大报告，1987，赵紫阳，《沿着有中国特色的社会主义道路前进》 中共第十四大报告，1992，江泽民，《加快改革开放和现代化建设步伐，夺取有中国特色社会主义事业的更大胜利》 中共第十五大报告，1997，江泽民，《高举邓小平理论伟大旗帜，把建设有中国特色社会主义事业全面推向二十一世纪》 中共第十六大报告，2002，江泽民，《全面建设小康社会，开创中国特色社会主义事业新局面》 中共第十七大报告，2007，胡锦涛，《高举中国特色社会主义伟大旗帜 为夺取全面建设小康社会新胜利而奋斗》 1 中国 主义 阶级 主义 主席 主义 经济 建设 经济 发展 发展 2 人民 阶级 革命 阶级 阶级 社会 改革 经济 主义 建设 社会 3 国民党 我国 无产 革命 革命 工作 社会 改革 社会 社会 建设 4 民主 国家 主席 主席 他们 经济 发展 主义 发展 文化 制度 5 解放区 工作 主义 斗争 无产 文明 领导 社会 建设 制度 改革 6 抗日 农业 斗争 无产 主义 建设 企业 发展 改革 经济 加强 7 共产党 方面 专政 人民 四人帮 精神 体制 开放 中国 加强 完善 8 政府 生产 修正 历史 专政 必须 必须 市场 加强 完善 坚持 9 侵略 必须 刘少奇 马克思 斗争 思想 制度 中国 理论 坚持 体系 10 纲领 资本 他们 路线 指示 问题 建设 坚持 坚持 思想 推进 11 日本 工业 大革命 专政 资产 组织 管理 加强 制度 改革 提高 12 军队 错误 指出 世界 反革命 我国 阶段 国家 文化 领导 中国 13 要求 他们 资产 国际 路线 错误 国家 体制 企业 工作 开放 14 政策 资产 马克思 列宁 大革命 领导 生产力 现代化 阶段 统一 科学 15 战争 产品 伟大 修正 阴谋 生产 开放 企业 邓小平 提高 文化 16 统治 合作社 群众 代表 亲自 生活 经营 思想 国家 监督 结构 17 自由 民族 列宁 共产党 分子 国家 中央 领导 市场 政治 管理 18 民族 政策 路线 他们 民主 中央 党员 我国 实现 教育 机制 19 法西斯 领导 胜利 刘少奇 伟大 斗争 技术 基础 体制 市场 体制 20 联合 改造 世界 大革命 政治 方面 机关 政治 提高 科学 创新

### A Review on Attribute or Summarization Extraction

1. Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema and Andrew Fano. Text Mining to Extract Product Attributes. SIGKDD Explorations (2006).
2. Popescu, Ana-Maria, Etzioni, Oren. Extracting Product Features And Opinions From Reviews. EMNLP 2005.
3. Sujith Ravi, Marius Pasca. Using Structured Text for Large-Scale Attribute Extraction, CIKM 2008.
4. Patrick Nguyen, Milind Mahajan and Geoffrey Zweig. Summarization of Multiple User Reviews in the Restaurant Domain. MSR Technical Report. 2007.
A brief from this paper describing their system:
"..., we build a classifier which predicts, from each review, a numerical rating associated to the review. Then, each review is segmented into ... snippets. .... Each snippet is then assigned a bipolar relevance score, to mark it as characteristically disparaging (bad) or praising (good). Snippets are categorized into predefined types (such as food, or service). To produce the final review, we tally the most relevant snippets, both good and bad, in each category."
For the following reasons, I do NOT believe this method works nor I would cite this tech report: (1) Authors do NOT explain how they "tally the most relevant snippets". (2) I do not believe a classifier can well predict a rating from a review. (3) The usage of pLSA is wrong.

5. Rada Mihalcea and Hakan Ceylan. Explorations in Automatic Book Summarization. EMNLP 2007.
This paper presents segments a book into segments, and rank segments by TextRank, which is PageRank on textual vertices (ref [6]).

6. Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. EMNLP 2004.
This paper addresses the Keyword Extraction problem. The solution is simple, construct a graph whose each vertex is a word from a corpus, and an edge indicate a co-occurrence relation between two words. Run PageRank on this graph to rank words. As this method does not employ latent topics, I guess words like "这些“, "非常" will in high priority in the rank list. It may be useful to compare this method with ranking by P(w|c), where w denotes words and c denotes domain.

7. Rada Mihalcea. Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization. ACL 2004.
This paper is very similar to [6]. Differences include: (1) rank sentences instead of words, (2) the similarity between two sentences is measured by Jaccard coefficient, and (3) use HITS in addition to PageRank. It does not explain how to use the authority and hub values respectively.

8. Gunes¸ Erkan and Dragomir R. Radev. LexPageRank: Prestige in Multi-Document Text Summarization. EMNLP 2004.
This is similar to [4], [5] and [6]. It ranks sentences from various documents using PageRank. Similarity between sentences is computed by cosine distance. And I do not see any explanation on what is the cosine distance between two sentences.

9. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. WWW 2008.
Attachments:
1. A Python implementation of PageRank, which may be used in our experiments.

## Dec 10, 2008

### Hidden Markov Model, Maximum Entropy Hidden Markov Model and Conditional Random Field

Thanks Peng Xu for teaching me about these three models. The differences between them are as follows:
1. HMM: arrow from st-1 to st, and from st to ot
2. Maximum Entropy HMM: arrow from st-1 to st, but from ot to st
3. CRF: no arrow between st-1 and st, neither between st and ot
References:
1. Andrew Mccallum, Dayne Freitag, Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation, ICML 2000.

## Dec 8, 2008

### A Latent Topic Model for Domain-Specific Attributes Extraction

It is difficult to estimate P(w|d) directly because of the high-dimensionality of d and w, and the sparsity of data.
1. Under the constraint that we consider a finite set of documents, for example, n pLSA, these are the training documents and may have up to 1M. Consider the vocabulary size is also a large number, P(w|d) has huge dimensionality and requires too much data to do sufficient statistics in estimation.
2. In practice, we prefer to realize the fact that it is an infinite set of documents, as modeled by LDA. If there is not latent topics, it would be intractable to estimate P(w|d) either theoretically or practically.
So, we introduce latent topics to factorize
$P(w|d)&space;=&space;\sum_{z=1}^K&space;P(w|z)&space;P(z|d)$
Note that the number of topics K is much more smaller than the vocabulary size V and the number of documents which is infinite.

In practice, latent topic modeling is particularly valuable in the case with sparse data. For an instance, let us consider the problem of understanding the meaning of a Web search query. Usually, a query contains only few words. If there is not latent topics and we understand the query (document) as a word distribution, a query would correspond to few non-zero values in a V-dimensional distribution --- this is too little information, difficult to understand the document --- we do not even have any confidence that some of the few words are "noise".

In contrast, given latent topic modeling, each word in the document can be associated with a topic distribution by Bayesian inference. Note that each topic is a set of weighted words learned from the potentially huge (sufficient) training corpus. This means that latent topic modeling understands documents by sufficient knowledge collected from training data, instead of a word distribution with few non-zeros values.

We are recently using the idea of latent topic modeling to identify domain-specific words. In this case, a domain is similar to a super-document. Using latent topic factorizing, we design a hierarchical latent topic model (ref. figure to the right) to estimate P(w|z), P(z|d) and P(d|c), where w denotes a word, z denotes a topic, d denotes a document and c denotes a domain (category).

Note that the model to the right can handle the case where a document is labeled belonging to multiple domains. An extension to it using ideas from Pachinko allocation could enhance it to handle the case where domains are organized into a hierarchy.

Given the learning result, we can use Gibbs sampling to infer the word distribution of a given domain
$p(w|c)&space;=&space;\sum_{z=1}^K&space;\left(&space;p(w|z)&space;\int&space;p(z|d)&space;P(d|c)&space;\;&space;\mathrm{d}d&space;\right)$
where the integration can be computed in close-form using the conjugacy between Dirichlet and multinational. Similarly, we can infer p(c|w). These two quantities allow us to compute the rank words in a domain by the "characteristic" measure (proposed by David Cohn in his probabilistic HITS algorithm):
$C(w|c)&space;=&space;P(w|c)&space;P(c|w)$
References:
Please refer to this post for details.

## Dec 7, 2008

### Tutorials on Conditional Random Fields by Andrew McCallum

1. Hanna M. Wallach. Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania, 2004. (9 pages)
2. An Introduction to Conditional Random Fields for Relational Learning. Charles Sutton and Andrew McCallum. Book chapter in Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006. (35 pages)

### Multi-vocabulary LDA and Feature Weighting

How do we model documents which contain more than one type of words, or, say, contain words from multiple vocabularies? Usually, vocabularies have various noise-ratio, sparsity, etc, and we would like to given them various trust values.

It is notable that the full conditional posterior distribution of latent topics contains two factors --- (1) one related with the word under current focus, and (2) one on the global (document-level) topic distribution. So, even if the current word has ambiguous senses, the global topic distribution of the document trends to assign the word a "correct" topic. This is how LDA (but NOT pLSA) do disambiguation.

If some words in a document is from a vocabulary that is more trustable than others, we should scale up their occurrence counts, in order to make they larger impact to the global topic distribution.

## Dec 6, 2008

### Inference in Finite Mixture Models by Gibbs Sampling and Sum-Product

By Gibbs Sampling
Click the thumbnail to the left to download a 3-page note on inference in finite mixture model using Gibbs sampling. This note uses an example of estimating a mixture of "circles" in images, which, I thought about on my way to office by the subway. LaTeX source of this note is here.

Indeed, the posterior distribution of latent variables are easy to estimate, and there is no need to do approximate inference using Gibbs sampling. (This is also the reason people learn finite mixture models using EM.) What I wrote in this note is a preliminary step before understanding inference in infinite mixture model using Gibbs sampling.

By Sum-Product
Denote the likelihood of a mixture model by

$p(x)&space;=&space;\sum_{k=1}^K&space;\theta_k&space;P_k(x)$

where θ={θ1, ... θK} is the prior of latent component indicators, and Pk(x) is the k-th component.
Given N data points X={x1, ... xN}, and denoting the latent component indicators by Z={z1, ... zN}, the factor graph contains N disjoint sub-graphs, each is a chain Ri--zi--Li, where Ri is a function representing the prior of zi: Ri(zi)=θzi; and Li is a function representing the likelihood given zi: Li(zi)=Pzi(xi).

## Dec 5, 2008

### A Constructive Derivation of the Sum-Product Algorithm

I wrote this note using LaTeX.

I have a previous post of some key points of sum-product.

• Is the disambiguation ability of LDA related with mvLDA?
Yes. This makes the weighting of various vocabularies reasonable. For details, please refer to this post in this Blog.

• The feature weighting problem in mvLDA, sLDA and TOT.
1. TOT seems do not has the feature weighting problem, because the time-stamp is a sample from a Beta distribution and is thus constrained in the range [0,1].
2. DisLDA seems do not has the problem, as yd is a class indicator.
3. sLDA has the problem, as stated in our paper draft.
4. mvLDA also has the problem, which is related with the disambiguation ability of the full conditional posterior.

• How could LDA used to identify domain specific words?
Initial results look not bad. Progress, related data, trained model and action items are updated in this post.

• How would LDA becomes if the training corpus contains only one document?
This would degenerate LDA to a mixture of multinomial distributions. But, how the co-occurrences of words affect the training process and manifest in the learning result in this degenerated case?

• Could sum-product (belief propagation) be used for inference in mixture models, pLSA and LDA?
Yes. I believe it is applicable to all three models. A brief on inference in mixture models is in this post. I should write down how to do inference in pLSA and LDA with factor graphs.

• Could sum-product algorithm (belief propagation) and expectation propagation used with GBR to scale up the LDA training process?
No, they cannot scale up the LDA training process. Although they can be used with GBR to do inference, but practically, as documents are iid and can be sharded across various computers, we can use MapReduce to get all benefits that can be get using GBR in inference. A more serious problem is that after inference (E-step), however, we need an M-step which definitely requires AllReduce support. This answers the question posed in this previous post.

• How to evaluate the "weighted-sum inference algorithm" for LDA?
It is in fact the initialization step of Gibbs sampling / sum-product / expectation propagation.

## Dec 4, 2008

### Unoctify --- Decode Octals in Text

#! /usr/bin/rubySTDIN.sync = trueSTDOUT.sync = truebuf = ''while c = STDIN.read(1)if buf.empty? and c == '\\'  buf << length ="=" buf =" ''" buf =" ''">

### The Highly Efficient AllReduce in MPICH2

The MPI specification defines a parallel computing API known as AllReduce. In MPICH2, the most famous implementation of MPI, AllReduce is implemented using a very smart algorithm. Thanks to Wenyen who explained this algorithm clearly in few minutes for me. To the right is the figure drawn by Wenyen on a white board.

Suppose we have n computers, and each computer has a data block with n cells. Denote the j-th data cell on the i-th computer by Dij, and all data cells on computer i by Di={Di1,...,Din}. The goal of AllReduce is to make each of the j-th computer has R(D)=\sum_i Di.

The AllReduce algorithm implemented in MPICH2 consists of two stages: (1) Reduce-scatter and (2) All-gather. Reduce-scatter makes computer i has R(Di)=\sum_j Dij. In all-gather stage, computers exchange their partial result to make each computer have R(D)={R(D1), ..., R(Dn)}.

References:
1. Rajeev Thakur, Rolf Rabenseifner, and William Gropp, Optimization of Collective Communication Operations in MPICH, Int'l Journal of High Performance Computing Applications, (19)1:49-66, Spring 2005.

### Distributed Inference of LDA using Expectation-Propagation and GBR?

After learning expectation-propagation (EP) from PRML, I think it is possible to program EP algorithms designed for inference in factorized graphical models using GBR. However, the following issues are to be confirmed:
• There has no proof that EP converges to a (local or global) optima. However, the EP algorithm developed for LDA [1] seems converges.
• The EP for LDA [1] is in fact an extended EP algorithm, but not literally an EP algorithm. Need to make sure that this extended EP can be programmed using GBR.
• Need to estimate the number and size of messages need to passed between workers. This is an approximate estimate of the communication cost of the GBR program.
References:
1. Thomas Minka and John Lafferty, Expectation-propagation for the generative aspect model, UAI 2002

### Identify Domain-Specific Words

We had a solution to the question posed in this post.

Recently, there is a discussion among Kaihua, Zhiyuan and I, where Kaihua introduced an interesting problem --- how to identify words that are specific to a certain domain. To make this problem well define, let us suppose that we have both domain-specific data, e.g., a set of reviews of IT products, as well background data, e.g., a (much) larger set of reviews of various products. The question is, how to find those words that are used specific in the domain of "IT products".

I think it is valuable to solve this problem. For example, given "IT product"-specific words like "screen size electric consumption weight response time", we can easily locate valuable phrases/sentences in reviews of IT products. This may lead to a new way of Web search --- given a query on an IT product, returns a summarized review that is an aggregation of many reviews.

Let us denote the domain-specific documents by D', the background documents by D. Denote the vocabulary of D and D' by V and V', where D'\in D and V'\in V.

Now, Zhiyuan has provided some preliminary results, which look promising. This preliminary results are about the domain of film in English Wikipedia corpus. Zhiyuan's method is basing on the following signals:
1. N(term|domain)
2. N(term|corpus)
3. JS{ p(term|domain) || p(term|corpus) }
given N(term) and N(corpus).

Now we need to do the following things:
1. (zkh) Figure out how to use above signals or need other signals.
2. (wyi) Bayesian modeling of domain-specific word identification, in particular, considering the hierarchy of domains:
background -> domain -> sub-domain -> entities --> reviews.
3. (lzy) More detailed experiments using Tianya.cn as background corpus, Dianping.com as domain specific data, restaurants in Dianping.com data as entities.
1. The Tianya.cn LDA model is at: /gfs/wf/home/hjbai/tianya/model2/word_distribution-00000-of-00001
2. The Dianping.com training documents are at: /home/chengxu/shenghuo_data/xmlfeed/reviews/dianping.com.tsv
3. Edit reviews in Dianping.com (which can act as ground-truth) are at: /home/chengxu/shenghuo_data/xmlfeed/dianping_data.tsv
Copied 2 and 3 to /home/wyi/dianping/ for convenient access of Zhiyuan.

## Dec 1, 2008

### Key Points about the Sum-Product Algorithm

• The sum-product algorithm is not limited with inference of probabilistic graphical model. It is a general method for marginalization of high-dimensional functions [1].
• It is required that the high-dimensional function can be factorized as a product of local functions with lower dimensionality. An intuitive representation of a factorization is factor graph.
• Both Bayesian networks and Markov Random Fields can be represented by factor graphs (ref. Equation (8.5) and Equation (8.39) in [2], and details in Section 8.4.3).
• Factor graphs are bipartite graphs, where edges connect variable nodes and function nodes. That is to say, neighbors of a variable node are all function nodes; neighbors of a function node are all variable nodes.
• If there is no loop in the factor graph, sum-product on this graph is to pass messages in both ways along each edge.
• If there are loops in the factor graph, sum-product should be done in an iterative fashion.
• It is a good start point to understand sum-product from the specific problem of inference in a chain model (which results in the forward/backward algorithm). Then extend the chain by branches to make it a tree (e.g, the latent topics Z in supervised LDA affected by the Dirichlet prior, document rates and content words). Then extend the tree to have loops.
• As sum-product is expressed by message-passing , it is suitable to be developed using parallel computing frameworks like MPI and BSP.
References:
1. Factor Graphs and the Sum-Product Algorithm, Frank R. Kschischang, Senior Member, Brendan J. Frey, and Hans-Andrea Loeliger, IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001
2. Pattern Recognition and Machine Learning, Christopher Bishop, 2004, Section 8.4.
3. Mixture Modeling by Affinity Propagation, NIPS 2006
4. Clustering by Passing Messages Between Data Points, Science, Vol 315, No 5814, pp 972-976, February 2007.

## Nov 30, 2008

### The Sum-Product Algorithm and Highly Scalable LDA?

Months before, Yuzhou Zhang notified me an interesting paper
Clustering by Passing Messages Between Data Points. I am very interested with this paper, because if an algorithm can be expressed in message-passing, it can be well scaled up using parallel computing frameworks like MPI and BSP.

Last week, Zhiyuan notified my that that basis of clustering by message passing is the sum-product algorithm, which is a well-studied probabilistic inference algorithm for tree models expressed by factor graphs. In the weekend, on my flight trip to Guangzhou, I learned the sum-product algorithm and its variant max-sum from Chapter 8 of Pattern Recognition and Machine Learning.

I am now had been curious about the fact that max-sum and sum-product are used for inference in tree or poly-tree models. But clustering is, in most cases, equivalent to learning a mixture model. In a mixture model, each observables variable has at least two parents --- the component index (label variable) and component parameters. This makes the graphical model no longer a tree (nor a poly-tree). Then the question is how could max-sum and sum-product used to do clustering?

After a discussion on the paper with Zhiyuan, I feel that the answer is as follows: It is true that the problem of "clustering of data points" can be modeled by a mixture model. However, the paper treats a problem of "graph partitioning" or "clustering of interconnected vertices". In this paper, the authors model the problem by a poly-tree --- roots are constraints like "if a vertex v is a centroid, it must be the centroid of itself"; the second level nodes in the tree correspond to the centroid assignments of vertices in the graph; and the third level nodes are the vertices in the graph. This poly-tree can be inferred using sum-product.

In P9 of the supporting material of Clustering by Passing Messages Between Data Points, the authors explained that:
The max-sum algorithm for the factor graph in Fig. S2A can be derived in a straightforward fashion (c.f. (S6)) and consists of sending messages from variables to functions and from functions to variables in a recursive fashion.
Where (S6) refers to an introductory paper Factor Graphs and the Sum-Product Algorithm. I read it and feels that it is a pretty good introduction to sum-product, but, to understand factor graphs, it would be better to refer to the book Pattern Recognition and Machine Learning.

My intention is, if max-sum/sum-product can be used in inference of a mixture model, I would like to make sure whether it is also applicable in inference of LDA? If so, we may have a novel highly-scalable solution to LDA.

As this paper uses sum-product to do graph partitioning instead of clustering a bag of words, even if we can generalize the sum-product to handle more than one graphs (as LDA generalizes mixture model to handle multiple bags of words), the result solution may be like frequent sub-graph mining, instead of a noval inference algorithm of LDA ...

## Nov 27, 2008

### Some Questions In Reading PRML

Here follows some questions that I met in reading Pattern Recognition and Machine Intelligence. Hope to get through them in the next week:
1. Why the Markov blanket of a node in a Bayesian network must include co-parents, whereas the Markov blanket of a node in a Markov Random Field includes only those direct neighbors. (ref. P383, Figure 8.26 and P385, Figure 8.28)
2. Lagrange multiplier with constraint g(X)>0. "Now, however, the sign of the Lagrange multiplier is crucial, because the function f(x) will only be at a maximum if its gradient is oriented away from the region g(x) > 0, as illustrated in Figure E.3. We therefore have ∇f(x) = −λ∇g(x) for some value of λ > 0."
3. On variational inference (P465, Equation 10.6): how to dissect out the dependence on one of the factors qj(Zj).
4. Factor graphs and the sum-product algorithm (also known as belief propagation).
5. Variational inference and expectation propagation.

### How to Label/Tag Documents?

I am interesting with labelling an arbitrary document with predefined or learned labels. I prefer using LDA with this problem, because we have a highly scalable implementation of LDA, and LDA can explain each document by topics. Nevertheless, I am doing a survey before programming. I have read the following papers. Would anyone please provide any hint?
• Simon Lacoste-Jullien, Fei Sha, and Michael I. Jordan.NIPS 2008.
• D. Blei and M. Jordan. Modeling Annotated Data. SIGIR 2003.
• Chemudugunta, C., Holloway, A., Smyth, P., & Steyvers, M. Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning. In: 7th International Semantic Web Conference, 2008.
• Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. WWW 2008
• Qiaozhu Mei, Xuehua Shen, ChengXiang Zhai. Automatic Labeling of Multinomial Topic Models, KDD 2007.
• Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai. Semantic Annotation of Frequent Patterns, ACM TKDD, 1(3), 2007.

## Nov 26, 2008

### How Did Topic Models Evolve

          Finite models                    Infinite models       --------------------------------------------------------------------                                                                          |          Latent Dirichlet allocation ---> Hierarchical Dirichlet process | mixtures            A                                A                            |             |                                |                            |Start -->  Finite mixture model ---------> Infinite mixture model         | mixture            |                                |                            |            V                                V                            |          Hidden Markov model -----------> Infinite hidden Markov model   | temporal

Finite Mixture Model
This is the very basic modeling technique --- represent a complex thing by a weighted sum of simple things --- e.g., Fourier transformation represents a complex signal by a weighted sum of sine curves. A typical example of finite mixture model is Gaussian mixture, which is often learned using EM. Another example is multinomial mixture with Dirichlet prior, which can be learned using Gibbs sampling because Dirichlet and multinomial are conjugate.

Infinite Mixture Model
By modeling the component prior using Dirichlet process, finite mixture model is extended to have infinite mumber of mixture components. If the likelihood is conjugate with the Dirichlet process prior, the model can be conveniently learned using Gibbs sampling [Neal 1998, Markov Chain Sampling Methods for Dirichlet Process Mixture Models]. Another interesting model is [Rasmussen 2000, Infinite Gaussian Mixture Model].

Note that the Dirichlet process can be defined on any measure space, even if a space whose each sample point is a component (parameters). So the generative process becomes that a Dirichlet process G generates a component θ for each observed object x. Because the discrete-and-clustering property of Dirichlet process, components of multiple x's are often the same, that is, these x's are of the same component/cluster. So we can say that G (of infinite mixture model) encodes both components and their weights (of finite mixture model). We often use another Dirichlet process parameterized by H as the prior of G. Because G~DP(H) and θ~G are conjugate, G can be integrated out. The likelihood x~F(θ) can be of any form (is this true? need to read the infinite Gaussian mixture model paper.).

Latent Dirichlet Allocation
Mixture models represent the probability distribution of "a bag of" samples. However, in some cases, we need to model more than bags, where objects in each bag is represented by a mixture model. To capture the commonality between bags, we want mixture models of these bags share the same set of components, but each bag has its own set of weights. When the components are multinomial and weight sets of bags share a Dirichlet prior, this hierarchical mixture model is known as latent Dirichlet allocatioin.

Exact inference of LDA is intractable. Approximate inference can be done using variational methods. In particular, if a Dirichlet prior is assumed to generate the set of multinomial components (shared by mixture models of documents), these components can be integrated out, and the model inference can be done using Gibbs sampling.

Hierarchical Dirichlet Process
In each bag j, each objects xji is generated by its correspond component θji. In the bag, all θji's are samples of Gj, a Dirichlet process (which encodes components and their weights within bag j). In the introduction of LDA, we explained that we hope that Gj's of various documents share components. This can be done by defining all Gj's on the same measure space, which covers all possible values of θ, and assuming that Gj's are samples of a global Dirichlet process G0. Note that Gj~G0 mimics a multinomial sampling, and θji~Gj also mimics a multinomial sampling, so we put a Dirichlet process prior on G0 and make the whole generation process conjugate, thus G0 and Gj's can all be integrated out. This is key to estimate HDP using Gibbs sampling.

Hidden Markov Model
In mixture model, components are exchangable. If we assume first-order temporal dependency between components, mixture model becomes hidden Markov model (HMM), where each output pdf of the HMM corresponds to a component of the mixture model. Inference of HMM can be done using the Viterbi algorithm. A usual hierarchical extension of HMM is to extend each output pdf by a mixture model.

Infinite Hidden Markov Model
HMM assumes the state/component in time t, denoted by θt, depends only on θt-1. Some high-order extensions to HMM assumes that θt depends only on θt-1, ... θt-L, where L is the known as the order of HMM and reflect the length of memory/history. Most high-order HMMs assumes L a fixed positive integer, and can be represented by a first-order HMM whose each state corresponds to a combination of θt-1, ... θt-L. In contrast to these fixed-order HMMs, the Variable-Length HMM (VLHMM) learns a set of combinations with variable lengths. This is under the intuition that in some times we need only short memory/history to predict the future accurately, while in some times, we need longer memory/history. Comparing with VLHMM, which facilitates a prefix-tree prunning approach to learn the memories with max length L, the infinite HMM seems much more elegant and can learn memories with infinite length. (I need to read [Beal etc, The Infinite Hidden Markov Model] carefully to ensure that above description is not wrong).

### Understanding Expectation-Maximization

I have been using EM for years without knowing some details. Thanks to Chris Bishop for the general description of EM in Section 9.4 of Pattern Recognition and Machine Learning. Here follows my learning note.

EM is designed to maximize p(X|θ) in case that optimization of p(X|θ) is difficult, but that optimization of the complete-data likelihood function p(X,Z|θ) is significantly easier.When we consider p(X,Z|θ) with the context of p(X|θ), we have to consider the distribution of Z, denoted by q(Z). Given q(Z), we can decompose p(X|θ) into two terms:

$\ln&space;p(\mathbf{X}\mid\mathbf{\theta})&space;=&space;\mathcal{L}(q,\mathbf{\theta})&space;+&space;\operatorname{KL}(q||p)$

where

$\mathcal{L}(q,\mathbf{\theta})&space;=&space;\sum_{\mathbf{Z}}&space;{&space;q(\mathbf{Z})&space;\ln&space;\left\{&space;\frac{&space;p(\mathbf{X},&space;\mathbf{Z}&space;|&space;\mathbf{\theta})&space;}{&space;q(&space;\mathbf{Z}&space;)&space;}&space;\right\}&space;}$
$\operatorname{KL}(q||p)&space;=&space;-&space;\sum_{\mathbf{Z}}&space;q(\mathbf{Z})&space;\ln&space;\left\{&space;\frac{&space;p(\mathbf{Z}&space;\mid&space;\mathbf{X},&space;\mathbf{\theta}&space;)&space;}{&space;q(\mathbf{Z})&space;}&space;\right\}$

Note that both terms are functionals of q(Z) and are larger than or equal to zero. When q is identical to the true posterior distribution P(Z|X,θ), KL(q||p) is minimized to be zero and L(q,θ) is maximized to be ln p(X|θ).The EM algorithm iteratively executes an E-step and an M-step. In the E-step, it estimates P(Z|X,θold) and takes it as q(Z). This minimizes KL(q||p) to be zero.

In the M-step, it fixes q(Z) and maximizes L(q,θ) with respect to θ. This maximization does not only increases L(q,θ), but also increases KL(q||p), unless a local optimum is reached. This is because the true posterior p changes with θ, and any change increases KL(q||p) from 0 to a positive number. (We had maximized KL(q||p) to 0 in the E-step.) As both terms increases, ln P(X|θ) increases. This is why EM converges monotonically.

This iterative process is similar with a "growing worm". (Describe the metaphor with more details later.)

It is notable that

$\mathcal{L}(q,\theta)&space;=&space;\sum_{Z}&space;p(Z|X,\theta^{old})&space;\ln&space;p(X,Z|\theta)&space;-&space;\sum_{Z}&space;p(Z|X,\theta^{old})&space;\ln&space;p(Z|X,\theta^{old})$

Because the second term to the right of the equation is independent of θ, the M-step maximizes only the first term, which is usually denoted by Q(θ,θold) in the literature.

The figure to the left illustrates the EM algorithm in the model parameter space.
When the model is a exponential distribution or a mixture of exponential distribution, L(q, θ) is convex (as shown by the blue curve). Because in the E-step, K(q||p) is minimized to be zero, so the curve of L(q, θ) touches the curve of ln p(X|θ) at θold. By maximizing L(q, θ) with respect to θ in the M-step, we find θnew, which corresponds to a larger ln p(X|θ) value. As explained previously, θnew increases not only L(q, θ) but also KL(q||p)=ln p(X|θ) - L(q, θ). (Note in the figure that KL(q||p) at θnew is larger than zero.)

Variational and Stochastic EM
Recall that we need to minimize KL(q||p) in the E-step by estimating p(Z|X,θold). For some models, this is intractable and we have to use some approximated methods. If we use variational methods, then EM becomes Variational EM. If we use stochastic methods, EM becomes Stochastic EM.

Generalized EM
In the M-step, we need to maximize L(q, θ). If L(q, θ) has multiple optima, or the maximization is intractable, we can only find an approximate solution. In such cases, EM becomes Generalized EM.

### Learning Materials on Non-Parametric Bayesian Methods and Topic Modeling

In the recent months, I have been learning non-parametric Bayesian
methods for topic modeling. Here follows some documents I feel
helpful in the learning process, and what I want to do using the

0. Measure Theory and sigma-algebra
http://en.wikipedia.org/wiki/Measure_theory
http://en.wikipedia.org/wiki/Sigma-algebra
Measure theory is the basis of Dirichlet process and many other
stochastic processes. Sigma-algebra is the support of measure theory.
They are keys to generalize finite latent factors to infinity.

1. Basics of Dirichlet process:
http://velblod.videolectures.net/2007/pascal/mlss07_tuebingen/teh_yee_whye/teh_yee_whye_dp_article.pdf
This introduction is a course note written by Yee Teh, the author
of hierarchical Dirichlet process (HDP).

2. Dirichlet Process Mixture Models
http://gs2040.sp.cs.cmu.edu/neal.pdf
This paper presents the Dirichlet process mixture model which is a
mixture with (potentially) infinite number of components. This paper
explains how to generalize traditional mixture models using a
Dirichlet process as the prior of components. This generalization
makes it possible to estimate the number of components using Gibbs
sampling in tractable amount of runtime complexity.

3. Hierarchical Dirichlet Process
http://www.cs.berkeley.edu/~jordan/papers/hdp.pdf
As LDA models each document using a mixture of finite number of
topics, hierarchical Dirichlet process (HDP) models each document by a
mixture of infinite number of topics, where each finite mixture is a
Dirichlet process mixture model.

### What are blocked/collapsed Gibbs sampling

I learned a little about Gibbs sampling from the text book. However, I had been confused when I learn Latent Dirichlet Allocation (LDA) and was told LDA can be trained by a technique known as "collapsed" Gibbs sampling. Fortunately, this post in the well-known Natural Language Processing Blog answered my question concisely:

The standard setup for Gibbs sampling over a space of variables a,b,c (I'll assume there are no exploitable independences) is:
1. Draw a conditioned on b,c
2. Draw b conditioned on a,c
3. Draw c conditioned on a,b
This is quite a simple story that, in some cases, be "improved." For instance, it is often possible to jointly draw a and b, yielding:
1. Draw a,b conditioned on c
2. Draw c conditioned on a,b
This is the "blocked Gibbs sampler." Another variant, that is commonly used in our community, is when one of the variables (say, b) can be analytically integrated out, yielding:
1. Draw a conditioned on c
2. Draw c conditioned on a
For LDA, by introducing a Dirichlet prior, beta, for the model parameters (i.e., word distributions of topics), we can integrate out model parameters. Therefore, the Gibbs sampling algorithm samples only the latent variables (i.e., topic assignments of words in the training corpus). To my understand, this is how Gibbs sampling in LDA is "collapsed".

### What is a Noisy-OR Model

I am interested with the paper "Noisy-OR Component Analysis and its Application to Link Analysis" published by Tomas Singliar and Milos Hauskrecht on JMLR 7 (2006). A very preliminary prerequisite to understand this paper is to know the "noisy-or" model. However, it seems that noisy-or is an old topic and no much can be found via Google. Fortunately, I got a very brief description from an old paper "Possibility theory and the generalized Noisy OR model". Snapshotting the section on Noisy OR as an image as attached:

From this paper, I also find the original paper that presents Noisy OR at the UAI's website.
Unfortunately, I do not have access to the full text in PDF format... :-(

### A test post experiencing Blogger.com

I create this post for experiencing Blogger.com. To the left of this post is a picture of sun flowers taken at Tebet.

I think Blogger.com is better than MSN Space. At least, I can save a post under work many times before I decide to publish it. Moreover, Blogger.com can auto-save my editing as what Google Docs does.

Often I would like to post images of math equations that are copy-and-pasted from a scientific paper in PDF format. This is done by using the gorgeous LaTeX editor hosted onCodeCogs.com. For example:
$P(\boldsymbol{X})&space;=&space;P(x_1,\ldots,x_N)&space;=&space;\int&space;\prod_{i=1}^N&space;P(x_i&space;\mid&space;\boldsymbol{\theta})&space;dP(\boldsymbol{\theta})$