Feb 27, 2010

Could Latent Dirichlet Allocation Hanlde Documents with Various Length?

I heart some of my colleagues who are working on another latent topic model, which is different from LDA, complains that LDA like documents with similar lengths. I agree with this. But I feel that can be fixed easily. Here follows what I think.

The Gibbs sampling algorithm of LDA samples latent topic assignments from as follows

where V is the vocabulary size and Ld is the length of document d.

The second term is dependent with the document length. Just consider an example document is about two topics, A, and B, and half of its words are assigned topic A, the other half are assigned topic B. So the P(z|d) distribution should have two high bins (height proportional to L/2 + alpha), and all elsewhere are short bins (height proportional to alpha). So, you see, if the document has 1000 words, alpha has trivial effect to the shape of P(z|d); but if the document contains only 2 words, alpha would have more effects on building the shape of P(z|d).

An intuitive solution to above problem is to use small alpha for short document (and vice versa). But would this break the math assumptions under LDA? No. Because this is equivalent to use different symmetric Dirichlet prior on documents with different lengths. This does not break the Dirichlet-multinomial conjugacy required by LDA's Gibbs sampling algorithm, but just express a little more prior knowledge than using a symmetric prior for all documents. Let us set

for each document. And users need to specify parameter k as they need to specify alpha before.

Feb 21, 2010

S-shaped Functions

The logit (logistic sigmoid) function:

The tanh function:

The probit function:


For more on this approximation, look at Figure 4.9 of Pattern Recognition and Machine Learning.

Feb 20, 2010

Cavendish Experiment and Modern Machine Learning

This is just a joke, so do not follow it seriously...

The very usual case in modern machine learning is as follows:
  1. design a model to describe the data, for example, suppose some kind of 2D points are generated along a quadratic curve y = a*x^2 + b*x + c, and
  2. design an algorithm that estimates the model parameters, in our case, a, b, and c, given a set of data (observations), x_1,y_1,x_2,y_2,...x_n,y_n.
  3. The model parameters can be used in some way, say, given a new x and predict its corresponding y.
So, when I was reading Prof. Feynman's lecture notes, which mentions Cavendish experiment, I thought this experiment is some kind of "learning using machines" --- using the specially designed equipment (machine), Cavendish measured the gravitational constant G in Newton's law of universal gravitation:
F = G * m1 * m2 / r^2
And, using the estimated model parameter G, we can do somethings interesting. For example, measure the weight of the earth (by measuring the weight/gravity F of a known small ball m1, and put them back into the equation to get m2, the mass of earth).

However, this is a joke as I said so you cannot use it in your lecture notes on machine learning. The fact was that Cavendish did not measure G as stated in many textbooks. Instead, he measures the earth directly by comparing (1) the force that a big ball with known mass attracts a small ball with (2) the force that the earth attracts the small ball. If the ratio (2)/(1) is N, then the earth is N times weight of the big ball.

Feb 14, 2010

Highlights in LaTeX

To make part of the text highlighted in LaTeX, use the following two packages
And in the text, use macro \hl:
The authors use \hl{$x=100$ in their demonstration}.
Note that if you use only soul without color, \hl just fails to underlines.

Feb 4, 2010

Google Puts New Focus on Outside Research

It is recently reported that Google is stepping up its funding to support the research following four areas:
  • machine learning
  • the use of cellphones as data collection devices in science
  • energy efficiency
  • privacy
Among these four areas, machine learning is on the top. "Three years ago, three of the four research areas would not have been on the company’s priority list", Mr. Spector said. "The only one that was a priority then and now is machine learning, a vital ingredient in search technology."

Feb 3, 2010

Reduce Data Correlation by Recenter and Rescale

In the MATLAB statistics demo document, the training data (a set of car weights) are recentered and rescaled as follows:
% A set of car weights
weight = [2100 2300 2500 2700 2900 3100 3300 3500 3700 3900 4100 4300]';
weight = (weight-2800)/1000; % recenter and rescale
And the document explains the reason of recenter and rescale as
The data include observations of weight, number of cars tested, and number failed. We will work with a transformed version of the weights to reduce the correlation in our estimates of the regression parameters.
Could anyone tell me why the recenter and rescale can reduce the correlation?

Feb 2, 2010

Using aMule with VeryCD


Making Videos Playable on Android and iPhone

You might want to convert you home-made video (no pirated video :-!) into a format that your Android phone can play. The video formats that Android support are listed in Android developers' site:

Among the listed formats, H.264 (as a realization of the MPEG-4 standard) has been well accepted by the industry. Companies including Apple has switched to it. In the following, I will show you that using open source software on a Linux box can convert your video into H.264 with AVC video and AAC audio. I took the following post as a reference, but with updates.

First of all, you need to install the following software packages:
  1. mplayer: a multimedia player
  2. mencoder: MPlayers's movie encoder
  3. faac: an AAC audio encoder
  4. gpac: a small and flexible implementaiton of the MPEG-4 system standard
  5. x264: video encoder for the H.264/MPEG-4 AVC standard
Then we do the following steps to convert the video.avi into a .mp4 file in H.264 format.
  1. Extract the audio information from video.avi using MPlayer:
    mplayer -ao pcm -vc null -vo null video.avi
    This will generate a audiodump.wav file.

  2. Encode audiodump.wav into AAC format
    faac --mpeg-vers 4 audiodump.wav
    This generates a audiodump.aac file.

  3. Use mencoder to convert the video content of video.avi into YUV 4:2:0 format, and use x264 to encode the output into AVC format
    mkfifo tmp.fifo.yuv
    mencoder -vf scale=800:450,format=i420 \
    -nosound -ovc raw -of rawvideo \
    -ofps 23.976 -o tmp.fifo.yuv video.mp4 2>&1 > /dev/null &
    x264 -o max-video.mp4 --fps 23.976 --bframes 2 \
    --progress --crf 26 --subme 6 \
    --analyse p8x8,b8x8,i4x4,p4x4 \
    --no-psnr tmp.fifo.yuv 800x450
    rm tmp.fifo.yuv
    We created a named pipe to buffer between mencoder and x264. These command lines generate both Quicktime-compatible and H.264-compatible content. This is because Apple Quicktime can now hold H.264 content. Be aware to specify the same video size to mencoder and x264. In above example, the size is 800x450.

  4. Merge the AAC audio and AVC video into a .mp4 file using gpac
    MP4Box -add max-video.mp4 -add audiodump.aac \
    -fps 23.976 max-x264.mp4
    MP4Box is a tool in the gpac package.

如何在Mac OS X上配置一个Web服务器


启动Mac OS X上的Apache

Mac OS X自带了Apache。要启动它很容易。如下图所示:启动System Preference,在Internet & Wireless类别里选择Sharing。然后勾上Web Sharing。这样Apache就启动了。



我家是通过电信的ADSL服务上网的。为了让家里的几台电脑都能上网,我在ADSL modem上接了一个无线路由器。这样,家里的电脑启动的时候,是由无线路由器动态分配IP地址。而无线路由器的外部IP地址是电信通过ADSL服务给分配的。为了让Internet上的用户都能通过我的外部IP访问到我的Web服务,我需要在无线路由器上做一些设置【端口转发,port forwarding】,使得无线路由器能把Internet用户的访问请求转发到我的imac电脑的Web服务器程序。





  1. 在www.3322.org的首页上,免费注册一个用户。我的用户名是cxwangyi

  2. 到“我的控制台”页面,在“动态域名”一栏下,点击“新建”。然后选择一个域名后缀。3322.org提供了几个选择。我选了7766.org。我的域名是我在3322.org的用户名加上我选择的域名后缀,也就是cxwangyi.7766.org。这个配置页面能自动检测我们的外部IP地址,所以不用我们手工输入。其他选项也都选择默认值就行了。抓图如下:


lynx -mime_header -auth=cxwangyi:123456 \

很多系统(包括Mac OS X)没有自带lynx,但是附带了一个更简单的标准程序叫curl。用curl向3322.org汇报IP地址的命令行是:
curl -u cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=cxwangyi.7766.org"
while [ true ]; \
sleep 10000; \
curl -u cxwangyi:123456 \
"http://www.3322.org/dyndns/update?system=dyndns&hostname=myhost.3322.org"; \

用Emacs Muse创建技术内容

有了Web服务,还得有内容。有无数的工具软件用于帮助制作网页。我用的是Emacs Muse,一个Emacs插件,允许用户用一种简单的wiki语法书写内容(包括插图甚至复杂的数学公式),并且可以把结果输出成HTML(或者PDF等格式)。Emacs Muse的下载和安装可以参考其主页上的说明。安装之后,我在我的.emacs文件里加入了如下设置:
(add-to-list 'load-path "~/.emacs.d/lisp/muse")
(require 'muse-mode) ; load authoring mode
(require 'muse-html) ; load publishing styles I use
(require 'muse-latex)
(require 'muse-texinfo)
(require 'muse-docbook)
(require 'muse-latex2png) ; display LaTeX math equations
(setq muse-latex2png-scale-factor 1.4) ; the scaling of equation images.
(require 'muse-project) ; publish files in projects
"technotes-html" "html"
:style-sheet "<link rel=\"stylesheet\" type=\"text/css\" media=\"all\" href=\"../css/wangyi.css\" />")
(setq muse-project-alist
'(("technotes" ("~/TechNotes" :default "index")
(:base "technotes-html" :path "~/Sites/TechNotes"))))

其中,~/.emacs.d/lisp/muse是我的Muse的安装目录。~/TechNotes是存储我的技术文档的目录。我的每一篇技术文档是这个目录下的一个后缀为.muse的文本文件(比如HowToSetup.muse)。当我用Emacs编辑这个文件时,只要按组合键control-c control-p,Emacs Muse就自动将这个文档输出成HTML格式,存放在~/Sites/TechNotes目录下(HowToSetup.html)。