Dec 27, 2008

Stochastic Sampling

The basic goal of stochastic sampling is to draw random samples from a given distribution p(x). Sometimes, p(x) is not exactly known, for example, knowing only f(x)=p(x)/K.

Given the ability of stochastic sampling, we may do the following things: (i) approximate p(x), (ii) integrate p(x), and (iii) optimize p(x).

A difficulty for stochastic sampling is that the form of p(x) or f(x) is usually complex, however, we can only sample directly from some easy-formed distributions, like discrete distribution, uniform distribution or unit Gaussian.

So, usually we sample from an easy-to-sample distribution q(x) instead of from p(x) or f(x), then we either assign each sample point a "weight", or an "acceptance probability" which decides whether we accept this point as a sample from p(x) or f(x).

Some basic sampling methods include:
  • Monte Carlo: Sample from q(x). No weight or acceptance probability is calculated. Just pretends that samples from q(x) is from p(x), and use the samples to compute the integration of p(x).
  • Weighted Sampling: Sample from q(x). Assign each sample x a weight w=p(x)/q(x). Use the weighted samples to compute the integration of p(x).
  • Rejection Sampling: Sample from q(x), where q(x) > p(x) in the whole domain of p(x). Accept each sample x by an acceptance probability a=p(x)/q(x).
In above methods, we have to choose q(x) which is easy-to-sample and approximate p(x). These two requirements are often contradictory. In order not to be bothered by choosing q(x), we may resort to Markov Chain Monte Carlo methods, which constructively approach p(x) without q(x) along a Markov chain.
  • Metropolis Sampling: Sample xt from a transition proposal q(xt|xt-1). Accept the sample by the probability
    \alpha = min\left[ \frac{f(x_t)}{f(x_{t-1})}, 1 \right]
    The intuition is to accept xt with high probability if f(xt) > f(xt-1). This requires that q(a|b)=q(b|a).
  • Metropolis-Hasting Sampling: Change the acceptance probability to
    \alpha = min\left[ \frac{f(x_t)q(x_t|x_{t-1})}{f(x_{t-1})q(x_{t-1}|x_t)}, 1 \right]
    This trick removes the requirement of q(a|b)=q(b|a).
  • Gibbs Sampling: In case that f(x) is very high-dimensional, it is expansive to compute f(x). So we avoid computing acceptance probability by assuming a=1, which means always accept the sample.
  • Simulated Annealing: Change the acceptance probability to
    \alpha = min\left[ \left(\frac{f(x_t)}{f(x_{t-1})}\right)^{1/T(t)}, 1 \right]
    where T(t) decreases from a large value to 1. The intuition is that initially, a large T increases the probability to accept an xt which does not increase the value of f(xt).
Few more words for simulated annealing: to take it as a sampling method, we decrease T down to 1; to make it an optimization method, we decrease T down to 0.

References:
  1. B. Walsh. Markov Chain Monte Carlo and Gibbs Sampling. Lecture notes. 2004.
  2. David J.C. MacKay. Information Theory, Inference, and Learning Algorithms. Chapter 29 and Chapter 30.

Dec 26, 2008

iPhone 和 gPhone 之比较 —— Android 恐怕将战胜 iPhone

(这是2008年3月11日,我在我的msn space上发表的一篇技术评论。转载到此,以便备份。同时改正了一些错误,补充了一些内容。)

最近有两种全新的手机非常火爆 —— Apple 的 iPhone 和 Google 的 Android。确切的说,Google Android 不是手机,而只是一套手机用的软件系统(包括操作系统,Middleware,应用软件和 SDK),这些软件可以被成百上千的手机开发商们用来打造成百上千种的 Google phones。而 iPhone 从软件到硬件都是 Apple 一家独立研发的。

我基本属于 iPhone 在中国的第一批用户(从美国出差带回来破解),被 iPhone 革命性的设计深深折服。现在基本每天都用 iPhone 看小说、看pdf论文、上网、游戏、照相、短信聊天、看电影、听音乐。最近(2008年2月底)又试了试 Android SDK。因为 Google 不是专业作消费类电子产品的,所以 Android 的开发机不如 iPhone 那么好用。从这个角度看来,目前 Android 和 iPhone 似乎尚无可比性。

但是作为一个工程师,在最近给 iPhone 和 Android 写程序的尝试中,我感觉这两套技术走的是完全相反的两条路子 —— Google Andoid = 开放自由 / Apple iPhone 闭关磨练 —— 以下更细致的比较,让我对 Android 更有信心。

(一) 操作系统

iPhone 上运行的是 Mac OS X,这个操作系统也运行在 Apple 的各类台式机和笔记本电脑上。这个操作系统是基于 BSD 开发的。Mac OS X的系统内核(Darwin)是“开源软件”(open source)。但实际上下载和编译都很困难,而且与 Mac OS X内核相关的中间件(middleware)等软件并不是 open source 的。更重要的是,“开源软件”和“自由软件”(free software)是不一样的两个概念——“开源”容许用户看到源码,但却不容许修改源码,所以并不能样形成一个开放的包括世界各地程序员的开发者群体。所以Mac OS X 的开发是牢牢掌握在 Apple 一家手里的。Steve Jobs 似乎是想通过这招来打造他唯一的“最好”。

在 Android 上运行的是 Linux 2.6(参见 Android Official Page),这个操作系统是毋庸置疑的被最广泛使用的自由软件(free software)之一;在台式机和笔记本电脑上的安装量仅次于 Microsoft Windows;是服务器和巨型机上的霸主。作为“自由软件”,世界各地的用户不仅可以拥有源码,而且可以任意修改源码并且重新发布。因此,用户中的程序员们行程了一个世界性的开发者群体;他们和Free Software Foundation Inc.的程序员们合作打造的 GNU/Linux 支持了目前整个 free software文化。因而也拥有了包括无数公司和个人的最大的专业开发者群体,拥有了最大数量的应用软件。因为人人都可以方便的下载、使用和修改这些 自由软件,所以那些成功的软件通常确实是因为吸引用户,而不是因为某家大公司的支持和商业操办。

(二) 开发工具和SDK

除了操作系统,另一个程序员很注意的环节是开发工具。支持 iPhone 开发的主流开发工具是 Apple 的 XCode 套件,这是在 Mac OS X 上工作的工程师们普遍使用的开发工具。这个软件是可以免费下载的,只要你注册加入 Apple Developmer Group。只要你不选择其他需要付费的项目,这个注册是不收费的。

Android 的开发工具的可选项太多了 —— 在 Linux下工作的程序员都可以很方便的使用他们日常熟悉的工具为 Android 些程序,比如 Eclipse, Emacs, InteliJ 等等。

虽然 iPhone 和 Andoid 的开发工具都是免费的,但是它们的 SDK 并不都免费。 Android SDK 是免费的。但是 iPhone SDK 很贵 —— 个人版本的US$99,企业版US$299。我用的是 iPhone 的 Unofficial SDK —— 这是 Apple 推出 iPhone SDK 之前,一些黑客们自己做的。可惜呀,这帮黑客的头目之一于去年9月被 Apple 挖走了,加入 Apple 作工程师去了。(消息出处很多,可以参加 iphoneatlas.com 的报道

(三)编程语言

接下来比较开发用的编程语言。Android 用的是 Java,这是目前和C++以及Python并列的用户群最大的语言。而 iPhone 和其他 Mac OS X 程序的主流开发语言是 Objective-C,这是一种出现于C和C++之间的过渡语言,目前基本没有人在用。至于为什么 Apple 会支持这种语言,这里面有个故事:

古时候(上个世纪80年代)曾经有一家叫做 NeXT 的公司,为各种 Unix 工作站做图形用户界面开发工具的。它们有一套当年赫赫有名的系统 NeXTSTEP。因为图形用户界面是一类典型的面向对象(object-oriented)系统,而 NeXT 率先使用当时最新的面向对象语言Objective-C 作为开发基础,所以 NeXT 一度红得发紫。可是好景不长,随着上世界90年代,Microsoft Windows + 高性能PC 战胜传统 Unix 工作站,NeXT不得不于1996年把 NeXTSTEP 移植到 Windows 上,叫做 OpenSTEP。但是这招也未能挽救 NeXTSTEP,因为 Microsoft Windows 有全套基于C 的开发工具,而使用 C++ 的程序员可以用 Borland 公司的 OWL 类库。而且后来Microsoft自己又推出了MFC 类库,战胜了Borland,以更强力的姿态支持 C++。至此 NeXTSTEP 算是丧失了工作站和PC两个市场,被惨淡变卖,几经转手,到了 Apple 手里。幸运的是,Apple 竟然看中 NeXTSTEP,将其改名换姓,叫做 Cocoa,成为了 Mac OS X 上的主打界面开发工具。这也就是为什么 Mac OS X程序员,包括 iPhone 程序员,不得不继续使用老掉牙的 Objective-C 的原因。(关于这个故事的细节,请参见 ADC Document / Cocoa / A Bit of History

(四)三维图形

既 然提到了图形用户界面,恐怕就不能落下三维图形 技术。iPhone和Android既然都是同时代的产品,也都装备了目前为移动设备设计的三维图形计算用的芯片。但是从程序员的角度来看,使用图形芯片的方法不同——iPhone 程序员依赖 Mac OS X 特有的三维图形编程接口;而 Android 支持 OpenGL ES,这是目前各大计算巨头和移动电子产品巨头联合开发的一套标准软件接口,使得程序员们可以方便的调用移动设备上的三维图形运算处理器的功能。 OpenGL ES 1.x 是基于 OpenGL 1.5 开发的,而 OpenGL 1.5 是从远古的 SGI 图形工作站上一路演化过来的,大浪淘沙,久经考验,被曾经的和目前的所有主流计算平台支持(Linux, Windows, BeOS, BSD ...)。

iPhone已经上市一年多了(到2008年12月),但是尚无有影响力的三维图形应用软件。而2008年2月我试用Android的开发机的时候,就能跑著名的三维图形电子游戏 Quake了。而且开发机上运行的 Google Map 的地球,也是用 OpenGL ES 渲染的。三维图形技术还被用来作 Google Map 的全景图。(参加附图:依次是 运行三维全景图的 Android、运行 Google Earth 的 Android、和运行 Quake 的 Android)

(五)回忆

Apple 制造 个人电脑(PC)的最早的厂家之一,比 IBM 设计 PC/AT 和 PC/XT 要早。当时 Apple 退出的 Apple I 和 Apple II 个人电脑曾经风靡全世界,其中包括刚从文革中苏醒的中国。我上小学三年级的时候,爸爸给买了一台当时中科院设计制造的 CEC-I(中华学习机),其实就是 Apple II 加上了一个汉卡,卡上有在没有硬盘的年代里用于存储汉字库的EPROM。(这个算不算技术侵权我就不知道了,就算算,我个人也是从中受益的)。Apple II 用的是6502处理器,现在用来控制空调和洗衣机,真是廉颇老矣。后来 Apple 开始基于 Motorola 的 680x0 CPU 推出 Macintosh 系列,这是和 IBM PC 并行发展的。记得我上初中的时候,在字典里查不到 Macintosh 的意思,后来一个同学问了我们中学(长沙雅礼中学)的一个外教,这小伙子当时是耶鲁大学的一个博士生,有一台 Macintosh。他告诉我们 Macintosh 是产于加州的一种大苹果——Steve Jobs到底还是没去种别的水果啊。

但是闭门造车的 Macintosh 终于没能干过开放式设计的 IBM PC 系列。其中一个重要原因是,IBM PC系统的硬件体系设计规范是完全公开的,所以各个设计厂商都能独立开发各种扩展硬件。比如今天要给我們的电脑买个显卡,在中关村里可是有上百个品种的。而且开放体系下大家(体系的设计着,独立开发商,程序员和用户)是共赢的,所以开发者和用户的数量暴涨,而产品成本随之迅速下降。我上初中的时候,雅礼中学放弃了购买 Macintosh,而是从意大利进口了一批 MR40 —— 一种便宜的 IBM PC 兼容机。当年 IBM PC Compatible 的标志随处可见,就像今天的 Intel Inside 一样。

与此同时,Apple 的 Macintosh 操作系统也没能干过 Microsoft DOS/Windows,后者是在所有的 IBM PC Compatible 的机器上都能跑的,而前者只能在 Apple 一家的机器上跑。而且 Microsoft 学了IBM 的开放式风格 —— 直到今天 Windows Vista 的 SDK 也是免费下载的(想想 Apple 竟然还在对 iPhone SDK 收费 ~~~Steve哥真是个执着的人~~)

IBM后来想跟Microsoft学,搞 了个 OS/2想和 Windows 竞争。可是已经晚了——Microsoft 已经在开放的操作系统框架下拉拢了很多熟悉 DOS/Windows 的驱动程序开发者,并导致大量的硬件设备只有 Windows 版本的驱动程序,而不支持 OS/2。我上高中的时候曾经兴冲冲的从湖南省电力厅的一个办公室,用好几十张5吋软盘拷贝了一个 OS/2,结果不能驱动我当时新买的声霸卡,而且在 486 上的启动速度太慢了,于是放弃转而使用 Microsoft Window NT (3.5?)。

目前为止,最开放式的系统莫过 于 GNU/Linux 和起上的各种应用软件的开发,也就是被 Richard Stallman (Free Software Foundation 的 founder)称为自由软件开发模式。这个东西真强大阿,已经无孔不入了,比 IBM PC 和 Microsoft Windows当年的发展更为迅猛。这篇博客就是用一台运行 Ubuntoo 的 Dell Precision 390 写的。Ubuntu 是一种 GNU/Linux distribution,它现在正驱动着 Nvidia Quadro GPU 和两台23吋显示器,这样我可以一边写博客,一边看我的机器学习算法输出的结果;而这个机器学习算法正并行的运行在 2500 台由 GNU/Linux 驱动的计算机上。

Steve Jobs 是个设计的天才,可是好像不愿意走大家共赢的道路。他又一次天才的挽救了Apple,但是似乎再次准备放弃开放式的道路。要是 Steve Jobs 怀孕要修产假去,Apple 可怎么办哟。。。。。

(六)闲话

虽然由于 Apple 和中国移动的扯皮,iPhone 尚未在中国上市,但是据说已经有十几万台流入了中国市场。尽管这些手机不能直接接入中国的移动网络,但是网上早已遍布破解iPhone的方法。我的 iPhone就是破解的。如果你不愿意自己尝试破解,也可以在中关村买到破解好了的 —— 8G存储器的iPhone大约 4800元。(此段内容写于2008年3月,不知现在破解版的iPhone多少钱。)

Dec 20, 2008

Some C++ Trickes

To output the content in a container:

copy(A.begin(), A.end(), ostream_iterator(cout, " "));

Dec 15, 2008

Reading Teh's Slides on Dirichlet Process

Gaussian distribution: Each draw G from a GP is a function over the real axis. Any finite marginalization of G is Gaussian distributed.

Finite mixture model: \pi ~ Dir(\alpha/K, ...) --- \alpha is the total number of observables in prior, /K distribute them evenly into K slots, 1/K, ... is the base distribution.

Agglomerative property of Dirichlet distributions: imaging the visualization of slots --- merging two slots sums their areas.

Decimative property of Dirichlet distributions: dividing a slot by (\beta, 1-\beta), the area is divided by the same ratio.

Conjugacy between Dirichlet and Multinomial: \pi ~ Dir(\alpha), z ~ Disc(\pi), then z ~ Disc(....), why? (P38/80)

Dec 12, 2008

Interesting Papers in NIPS 2008

  1. E. Sudderth, M. Jordan. Shared Segmentation of Natural Scenes Using Dependent Pitman-Yor Processes.
    I am interested in how Pitman-Yor process is able to model power-lay data.

  2. Amit Gruber, Michal Rosen-Zvi and Yair Weiss. Latent Topic Models for Hypertext.
    Topic model describes the generation of hyperlinks between document pairs.

  3. Simon Lacoste-Julien, Fei Sha and Michael I. Jordan. DiscLDA: Discriminative Learning for
    Dimensionality Reduction and Classification
    .
    I am not yet quiet clear with how the linear translations were introduced in this Bayesian modeling approach.

The Trace of China's Development

In an article of 《南方周末》Dec 11, 2008, the authors rank top-10 words in each report of national representative meeting of Chinese Communist Party(中国共产党全国代表大会). Here attaches a scanned version of this article.
However, as the authors rank words by N(w,c)/N(c), the frequency of the word, we see that non-discriminative/non-descriptive words like "我们", "党" appear in most list. In short, word ranking by term frequency is not semantically discriminative, or, not characteristic.

Using our method presented in this post, Zhiyuan ranked words in each report and generated the following ranked lists. Compared with those published in《南方周末》, ours are much more descriptive.























中共第七大报告,1945,延安,在抗日战争即将胜利的前夜"中共第八大报告,1956,刘少奇政治报告,邓小平党章的报告,周恩来第二个五年计划"中共第九大报告,1969,林彪,林彪作为“毛泽东同志的亲密战友和接班人”中共第十大报告,1973,王洪文,《关于修改党章的报告》中共第十一大报告,1977,叶剑英,《关于修改党的章程的报告》中共第十二大报告,1982,胡耀邦,《全面开创社会主义现代化建设的新局面》中共第十三大报告,1987,赵紫阳,《沿着有中国特色的社会主义道路前进》中共第十四大报告,1992,江泽民,《加快改革开放和现代化建设步伐,夺取有中国特色社会主义事业的更大胜利》中共第十五大报告,1997,江泽民,《高举邓小平理论伟大旗帜,把建设有中国特色社会主义事业全面推向二十一世纪》中共第十六大报告,2002,江泽民,《全面建设小康社会,开创中国特色社会主义事业新局面》中共第十七大报告,2007,胡锦涛,《高举中国特色社会主义伟大旗帜 为夺取全面建设小康社会新胜利而奋斗》
1中国主义阶级主义主席主义经济建设经济发展发展
2人民阶级革命阶级阶级社会改革经济主义建设社会
3国民党我国无产革命革命工作社会改革社会社会建设
4民主国家主席主席他们经济发展主义发展文化制度
5解放区工作主义斗争无产文明领导社会建设制度改革
6抗日农业斗争无产主义建设企业发展改革经济加强
7共产党方面专政人民四人帮精神体制开放中国加强完善
8政府生产修正历史专政必须必须市场加强完善坚持
9侵略必须刘少奇马克思斗争思想制度中国理论坚持体系
10纲领资本他们路线指示问题建设坚持坚持思想推进
11日本工业大革命专政资产组织管理加强制度改革提高
12军队错误指出世界反革命我国阶段国家文化领导中国
13要求他们资产国际路线错误国家体制企业工作开放
14政策资产马克思列宁大革命领导生产力现代化阶段统一科学
15战争产品伟大修正阴谋生产开放企业邓小平提高文化
16统治合作社群众代表亲自生活经营思想国家监督结构
17自由民族列宁共产党分子国家中央领导市场政治管理
18民族政策路线他们民主中央党员我国实现教育机制
19法西斯领导胜利刘少奇伟大斗争技术基础体制市场体制
20联合改造世界大革命政治方面机关政治提高科学创新

A Review on Attribute or Summarization Extraction

  1. Rayid Ghani, Katharina Probst, Yan Liu, Marko Krema and Andrew Fano. Text Mining to Extract Product Attributes. SIGKDD Explorations (2006).
  2. Popescu, Ana-Maria, Etzioni, Oren. Extracting Product Features And Opinions From Reviews. EMNLP 2005.
  3. Sujith Ravi, Marius Pasca. Using Structured Text for Large-Scale Attribute Extraction, CIKM 2008.
  4. Patrick Nguyen, Milind Mahajan and Geoffrey Zweig. Summarization of Multiple User Reviews in the Restaurant Domain. MSR Technical Report. 2007.
    A brief from this paper describing their system:
    "..., we build a classifier which predicts, from each review, a numerical rating associated to the review. Then, each review is segmented into ... snippets. .... Each snippet is then assigned a bipolar relevance score, to mark it as characteristically disparaging (bad) or praising (good). Snippets are categorized into predefined types (such as food, or service). To produce the final review, we tally the most relevant snippets, both good and bad, in each category."
    For the following reasons, I do NOT believe this method works nor I would cite this tech report: (1) Authors do NOT explain how they "tally the most relevant snippets". (2) I do not believe a classifier can well predict a rating from a review. (3) The usage of pLSA is wrong.

  5. Rada Mihalcea and Hakan Ceylan. Explorations in Automatic Book Summarization. EMNLP 2007.
    This paper presents segments a book into segments, and rank segments by TextRank, which is PageRank on textual vertices (ref [6]).

  6. Rada Mihalcea and Paul Tarau. TextRank: Bringing Order into Texts. EMNLP 2004.
    This paper addresses the Keyword Extraction problem. The solution is simple, construct a graph whose each vertex is a word from a corpus, and an edge indicate a co-occurrence relation between two words. Run PageRank on this graph to rank words. As this method does not employ latent topics, I guess words like "这些“, "非常" will in high priority in the rank list. It may be useful to compare this method with ranking by P(w|c), where w denotes words and c denotes domain.

  7. Rada Mihalcea. Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization. ACL 2004.
    This paper is very similar to [6]. Differences include: (1) rank sentences instead of words, (2) the similarity between two sentences is measured by Jaccard coefficient, and (3) use HITS in addition to PageRank. It does not explain how to use the authority and hub values respectively.

  8. Gunes¸ Erkan and Dragomir R. Radev. LexPageRank: Prestige in Multi-Document Text Summarization. EMNLP 2004.
    This is similar to [4], [5] and [6]. It ranks sentences from various documents using PageRank. Similarity between sentences is computed by cosine distance. And I do not see any explanation on what is the cosine distance between two sentences.

  9. Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai. Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs. WWW 2008.
Attachments:
  1. A Python implementation of PageRank, which may be used in our experiments.

Dec 10, 2008

Hidden Markov Model, Maximum Entropy Hidden Markov Model and Conditional Random Field

Thanks Peng Xu for teaching me about these three models. The differences between them are as follows:
  1. HMM: arrow from st-1 to st, and from st to ot
  2. Maximum Entropy HMM: arrow from st-1 to st, but from ot to st
  3. CRF: no arrow between st-1 and st, neither between st and ot
References:
  1. Andrew Mccallum, Dayne Freitag, Fernando Pereira. Maximum entropy Markov models for information extraction and segmentation, ICML 2000.

Dec 8, 2008

A Latent Topic Model for Domain-Specific Attributes Extraction

It is difficult to estimate P(w|d) directly because of the high-dimensionality of d and w, and the sparsity of data.
  1. Under the constraint that we consider a finite set of documents, for example, n pLSA, these are the training documents and may have up to 1M. Consider the vocabulary size is also a large number, P(w|d) has huge dimensionality and requires too much data to do sufficient statistics in estimation.
  2. In practice, we prefer to realize the fact that it is an infinite set of documents, as modeled by LDA. If there is not latent topics, it would be intractable to estimate P(w|d) either theoretically or practically.
So, we introduce latent topics to factorize
P(w|d) = \sum_{z=1}^K P(w|z) P(z|d)
Note that the number of topics K is much more smaller than the vocabulary size V and the number of documents which is infinite.

In practice, latent topic modeling is particularly valuable in the case with sparse data. For an instance, let us consider the problem of understanding the meaning of a Web search query. Usually, a query contains only few words. If there is not latent topics and we understand the query (document) as a word distribution, a query would correspond to few non-zero values in a V-dimensional distribution --- this is too little information, difficult to understand the document --- we do not even have any confidence that some of the few words are "noise".

In contrast, given latent topic modeling, each word in the document can be associated with a topic distribution by Bayesian inference. Note that each topic is a set of weighted words learned from the potentially huge (sufficient) training corpus. This means that latent topic modeling understands documents by sufficient knowledge collected from training data, instead of a word distribution with few non-zeros values.

We are recently using the idea of latent topic modeling to identify domain-specific words. In this case, a domain is similar to a super-document. Using latent topic factorizing, we design a hierarchical latent topic model (ref. figure to the right) to estimate P(w|z), P(z|d) and P(d|c), where w denotes a word, z denotes a topic, d denotes a document and c denotes a domain (category).

Note that the model to the right can handle the case where a document is labeled belonging to multiple domains. An extension to it using ideas from Pachinko allocation could enhance it to handle the case where domains are organized into a hierarchy.

Given the learning result, we can use Gibbs sampling to infer the word distribution of a given domain
p(w|c) = \sum_{z=1}^K \left( p(w|z) \int p(z|d) P(d|c) \; \mathrm{d}d \right)
where the integration can be computed in close-form using the conjugacy between Dirichlet and multinational. Similarly, we can infer p(c|w). These two quantities allow us to compute the rank words in a domain by the "characteristic" measure (proposed by David Cohn in his probabilistic HITS algorithm):
C(w|c) = P(w|c) P(c|w)
References:
Please refer to this post for details.

Dec 7, 2008

Tutorials on Conditional Random Fields by Andrew McCallum

  1. Hanna M. Wallach. Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania, 2004. (9 pages)
  2. An Introduction to Conditional Random Fields for Relational Learning. Charles Sutton and Andrew McCallum. Book chapter in Introduction to Statistical Relational Learning. Edited by Lise Getoor and Ben Taskar. MIT Press. 2006. (35 pages)

Multi-vocabulary LDA and Feature Weighting

How do we model documents which contain more than one type of words, or, say, contain words from multiple vocabularies? Usually, vocabularies have various noise-ratio, sparsity, etc, and we would like to given them various trust values.

It is notable that the full conditional posterior distribution of latent topics contains two factors --- (1) one related with the word under current focus, and (2) one on the global (document-level) topic distribution. So, even if the current word has ambiguous senses, the global topic distribution of the document trends to assign the word a "correct" topic. This is how LDA (but NOT pLSA) do disambiguation.

If some words in a document is from a vocabulary that is more trustable than others, we should scale up their occurrence counts, in order to make they larger impact to the global topic distribution.

Dec 6, 2008

Inference in Finite Mixture Models by Gibbs Sampling and Sum-Product

By Gibbs Sampling
Click the thumbnail to the left to download a 3-page note on inference in finite mixture model using Gibbs sampling. This note uses an example of estimating a mixture of "circles" in images, which, I thought about on my way to office by the subway. LaTeX source of this note is here.

Indeed, the posterior distribution of latent variables are easy to estimate, and there is no need to do approximate inference using Gibbs sampling. (This is also the reason people learn finite mixture models using EM.) What I wrote in this note is a preliminary step before understanding inference in infinite mixture model using Gibbs sampling.

By Sum-Product
Denote the likelihood of a mixture model by

p(x) = \sum_{k=1}^K \theta_k P_k(x)

where θ={θ1, ... θK} is the prior of latent component indicators, and Pk(x) is the k-th component.
Given N data points X={x1, ... xN}, and denoting the latent component indicators by Z={z1, ... zN}, the factor graph contains N disjoint sub-graphs, each is a chain Ri--zi--Li, where Ri is a function representing the prior of zi: Ri(zi)=θzi; and Li is a function representing the likelihood given zi: Li(zi)=Pzi(xi).

Dec 5, 2008

A Constructive Derivation of the Sum-Product Algorithm


I wrote this note using LaTeX.
Click the thumbnail to the left to download the PDF file.

I have a previous post of some key points of sum-product.

Some Questions about LDA with Partial Answers

  • Is the disambiguation ability of LDA related with mvLDA?
    Yes. This makes the weighting of various vocabularies reasonable. For details, please refer to this post in this Blog.

  • The feature weighting problem in mvLDA, sLDA and TOT.
    1. TOT seems do not has the feature weighting problem, because the time-stamp is a sample from a Beta distribution and is thus constrained in the range [0,1].
    2. DisLDA seems do not has the problem, as yd is a class indicator.
    3. sLDA has the problem, as stated in our paper draft.
    4. mvLDA also has the problem, which is related with the disambiguation ability of the full conditional posterior.

  • How could LDA used to identify domain specific words?
    Initial results look not bad. Progress, related data, trained model and action items are updated in this post.

  • How would LDA becomes if the training corpus contains only one document?
    This would degenerate LDA to a mixture of multinomial distributions. But, how the co-occurrences of words affect the training process and manifest in the learning result in this degenerated case?

  • Could sum-product (belief propagation) be used for inference in mixture models, pLSA and LDA?
    Yes. I believe it is applicable to all three models. A brief on inference in mixture models is in this post. I should write down how to do inference in pLSA and LDA with factor graphs.

  • Could sum-product algorithm (belief propagation) and expectation propagation used with GBR to scale up the LDA training process?
    No, they cannot scale up the LDA training process. Although they can be used with GBR to do inference, but practically, as documents are iid and can be sharded across various computers, we can use MapReduce to get all benefits that can be get using GBR in inference. A more serious problem is that after inference (E-step), however, we need an M-step which definitely requires AllReduce support. This answers the question posed in this previous post.

  • How to evaluate the "weighted-sum inference algorithm" for LDA?
    It is in fact the initialization step of Gibbs sampling / sum-product / expectation propagation.

Dec 4, 2008

Unoctify --- Decode Octals in Text


#! /usr/bin/ruby
STDIN.sync = true
STDOUT.sync = true
buf = ''
while c = STDIN.read(1)
if buf.empty? and c == '\\'
buf << length ="=" buf =" ''" buf =" ''">

The Highly Efficient AllReduce in MPICH2

The MPI specification defines a parallel computing API known as AllReduce. In MPICH2, the most famous implementation of MPI, AllReduce is implemented using a very smart algorithm. Thanks to Wenyen who explained this algorithm clearly in few minutes for me. To the right is the figure drawn by Wenyen on a white board.

Suppose we have n computers, and each computer has a data block with n cells. Denote the j-th data cell on the i-th computer by Dij, and all data cells on computer i by Di={Di1,...,Din}. The goal of AllReduce is to make each of the j-th computer has R(D)=\sum_i Di.

The AllReduce algorithm implemented in MPICH2 consists of two stages: (1) Reduce-scatter and (2) All-gather. Reduce-scatter makes computer i has R(Di)=\sum_j Dij. In all-gather stage, computers exchange their partial result to make each computer have R(D)={R(D1), ..., R(Dn)}.

References:
  1. Rajeev Thakur, Rolf Rabenseifner, and William Gropp, Optimization of Collective Communication Operations in MPICH, Int'l Journal of High Performance Computing Applications, (19)1:49-66, Spring 2005.

Distributed Inference of LDA using Expectation-Propagation and GBR?

After learning expectation-propagation (EP) from PRML, I think it is possible to program EP algorithms designed for inference in factorized graphical models using GBR. However, the following issues are to be confirmed:
  • There has no proof that EP converges to a (local or global) optima. However, the EP algorithm developed for LDA [1] seems converges.
  • The EP for LDA [1] is in fact an extended EP algorithm, but not literally an EP algorithm. Need to make sure that this extended EP can be programmed using GBR.
  • Need to estimate the number and size of messages need to passed between workers. This is an approximate estimate of the communication cost of the GBR program.
References:
  1. Thomas Minka and John Lafferty, Expectation-propagation for the generative aspect model, UAI 2002

Identify Domain-Specific Words

We had a solution to the question posed in this post.

Recently, there is a discussion among Kaihua, Zhiyuan and I, where Kaihua introduced an interesting problem --- how to identify words that are specific to a certain domain. To make this problem well define, let us suppose that we have both domain-specific data, e.g., a set of reviews of IT products, as well background data, e.g., a (much) larger set of reviews of various products. The question is, how to find those words that are used specific in the domain of "IT products".


I think it is valuable to solve this problem. For example, given "IT product"-specific words like "screen size electric consumption weight response time", we can easily locate valuable phrases/sentences in reviews of IT products. This may lead to a new way of Web search --- given a query on an IT product, returns a summarized review that is an aggregation of many reviews.

Let us denote the domain-specific documents by D', the background documents by D. Denote the vocabulary of D and D' by V and V', where D'\in D and V'\in V.

Now, Zhiyuan has provided some preliminary results, which look promising. This preliminary results are about the domain of film in English Wikipedia corpus. Zhiyuan's method is basing on the following signals:
  1. N(term|domain)
  2. N(term|corpus)
  3. JS{ p(term|domain) || p(term|corpus) }
given N(term) and N(corpus).

Now we need to do the following things:
  1. (zkh) Figure out how to use above signals or need other signals.
  2. (wyi) Bayesian modeling of domain-specific word identification, in particular, considering the hierarchy of domains:
    background -> domain -> sub-domain -> entities --> reviews.
  3. (lzy) More detailed experiments using Tianya.cn as background corpus, Dianping.com as domain specific data, restaurants in Dianping.com data as entities.
    1. The Tianya.cn LDA model is at: /gfs/wf/home/hjbai/tianya/model2/word_distribution-00000-of-00001
    2. The Dianping.com training documents are at: /home/chengxu/shenghuo_data/xmlfeed/reviews/dianping.com.tsv
    3. Edit reviews in Dianping.com (which can act as ground-truth) are at: /home/chengxu/shenghuo_data/xmlfeed/dianping_data.tsv
    Copied 2 and 3 to /home/wyi/dianping/ for convenient access of Zhiyuan.

Dec 1, 2008

Key Points about the Sum-Product Algorithm

  • The sum-product algorithm is not limited with inference of probabilistic graphical model. It is a general method for marginalization of high-dimensional functions [1].
  • It is required that the high-dimensional function can be factorized as a product of local functions with lower dimensionality. An intuitive representation of a factorization is factor graph.
  • Both Bayesian networks and Markov Random Fields can be represented by factor graphs (ref. Equation (8.5) and Equation (8.39) in [2], and details in Section 8.4.3).
  • Factor graphs are bipartite graphs, where edges connect variable nodes and function nodes. That is to say, neighbors of a variable node are all function nodes; neighbors of a function node are all variable nodes.
  • If there is no loop in the factor graph, sum-product on this graph is to pass messages in both ways along each edge.
  • If there are loops in the factor graph, sum-product should be done in an iterative fashion.
  • It is a good start point to understand sum-product from the specific problem of inference in a chain model (which results in the forward/backward algorithm). Then extend the chain by branches to make it a tree (e.g, the latent topics Z in supervised LDA affected by the Dirichlet prior, document rates and content words). Then extend the tree to have loops.
  • As sum-product is expressed by message-passing , it is suitable to be developed using parallel computing frameworks like MPI and BSP.
References:
  1. Factor Graphs and the Sum-Product Algorithm, Frank R. Kschischang, Senior Member, Brendan J. Frey, and Hans-Andrea Loeliger, IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 2, FEBRUARY 2001
  2. Pattern Recognition and Machine Learning, Christopher Bishop, 2004, Section 8.4.
  3. Mixture Modeling by Affinity Propagation, NIPS 2006
  4. Clustering by Passing Messages Between Data Points, Science, Vol 315, No 5814, pp 972-976, February 2007.