Tech Notes of Yi Wang: 02/2010

Feb 27, 2010

Could Latent Dirichlet Allocation Hanlde Documents with Various Length?

I heart some of my colleagues who are working on another latent topic model, which is different from LDA, complains that LDA like documents with similar lengths. I agree with this. But I feel that can be fixed easily. Here follows what I think.

The Gibbs sampling algorithm of LDA samples latent topic assignments from as follows
$z_i \sim P(w|z)P(z|d) \propto \frac{N(w,z) + \beta}{N(z) + V\beta} \frac{N(z,d) + \alpha}{L_d + \alpha}$
where V is the vocabulary size and L_d is the length of document d.

The second term is dependent with the document length. Just consider an example document is about two topics, A, and B, and half of its words are assigned topic A, the other half are assigned topic B. So the P(z|d) distribution should have two high bins (height proportional to L/2 + alpha), and all elsewhere are short bins (height proportional to alpha). So, you see, if the document has 1000 words, alpha has trivial effect to the shape of P(z|d); but if the document contains only 2 words, alpha would have more effects on building the shape of P(z|d).

An intuitive solution to above problem is to use small alpha for short document (and vice versa). But would this break the math assumptions under LDA? No. Because this is equivalent to use different symmetric Dirichlet prior on documents with different lengths. This does not break the Dirichlet-multinomial conjugacy required by LDA's Gibbs sampling algorithm, but just express a little more prior knowledge than using a symmetric prior for all documents. Let us set
$\alpha_d = k L_d$
for each document. And users need to specify parameter k as they need to specify alpha before.

Feb 21, 2010

S-shaped Functions

The logit (logistic sigmoid) function:
$\sigma(x) = \frac{1}{1 + \exp(-x)}$

The tanh function:
$\tanh(x) = 2\sigma(x) - 1$

The probit function:
$\Phi(x) = \int_{-\infty}^x \mathcal{N}(\theta;0,1) \;\mathrm{d}\theta$
where
$\Phi\left(\sqrt{\frac{\pi}{8}}x\right) \approx \sigma(x)$
For more on this approximation, look at Figure 4.9 of Pattern Recognition and Machine Learning.

Feb 20, 2010

Cavendish Experiment and Modern Machine Learning

This is just a joke, so do not follow it seriously...

The very usual case in modern machine learning is as follows:

design a model to describe the data, for example, suppose some kind of 2D points are generated along a quadratic curve y = a*x^2 + b*x + c, and
design an algorithm that estimates the model parameters, in our case, a, b, and c, given a set of data (observations), x_1,y_1,x_2,y_2,...x_n,y_n.
The model parameters can be used in some way, say, given a new x and predict its corresponding y.

So, when I was reading Prof. Feynman's lecture notes, which mentions Cavendish experiment, I thought this experiment is some kind of "learning using machines" --- using the specially designed equipment (machine), Cavendish measured the gravitational constant G in Newton's law of universal gravitation:

F = G * m1 * m2 / r^2

And, using the estimated model parameter G, we can do somethings interesting. For example, measure the weight of the earth (by measuring the weight/gravity F of a known small ball m1, and put them back into the equation to get m2, the mass of earth).

However, this is a joke as I said so you cannot use it in your lecture notes on machine learning. The fact was that Cavendish did not measure G as stated in many textbooks. Instead, he measures the earth directly by comparing (1) the force that a big ball with known mass attracts a small ball with (2) the force that the earth attracts the small ball. If the ratio (2)/(1) is N, then the earth is N times weight of the big ball.

Feb 14, 2010

Highlights in LaTeX

To make part of the text highlighted in LaTeX, use the following two packages

\usepackage{color}
\usepackage{soul}

And in the text, use macro \hl:

The authors use \hl{$x=100$ in their demonstration}.

Note that if you use only soul without color, \hl just fails to underlines.

Feb 8, 2010

A Tutorial on Network Traffic Analysis

http://www.chrissanders.org/?p=47

Feb 4, 2010

Google Puts New Focus on Outside Research

It is recently reported that Google is stepping up its funding to support the research following four areas:

machine learning
the use of cellphones as data collection devices in science
energy efficiency
privacy

Among these four areas, machine learning is on the top. "Three years ago, three of the four research areas would not have been on the company’s priority list", Mr. Spector said. "The only one that was a priority then and now is machine learning, a vital ingredient in search technology."

Feb 3, 2010

Reduce Data Correlation by Recenter and Rescale

In the MATLAB statistics demo document, the training data (a set of car weights) are recentered and rescaled as follows:

% A set of car weights
weight = [2100 2300 2500 2700 2900 3100 3300 3500 3700 3900 4100 4300]';
weight = (weight-2800)/1000;     % recenter and rescale

And the document explains the reason of recenter and rescale as

The data include observations of weight, number of cars tested, and number failed. We will work with a transformed version of the weights to reduce the correlation in our estimates of the regression parameters.

Could anyone tell me why the recenter and rescale can reduce the correlation?

Feb 2, 2010

Using aMule with VeryCD

http://hi.baidu.com/linsir/blog/item/c4b54839805a9af73a87cea2.html

Making Videos Playable on Android and iPhone

You might want to convert you home-made video (no pirated video :-!) into a format that your Android phone can play. The video formats that Android support are listed in Android developers' site:

http://developer.android.com/guide/appendix/media-formats.html

Among the listed formats, H.264 (as a realization of the MPEG-4 standard) has been well accepted by the industry. Companies including Apple has switched to it. In the following, I will show you that using open source software on a Linux box can convert your video into H.264 with AVC video and AAC audio. I took the following post as a reference, but with updates.

First of all, you need to install the following software packages:

mplayer: a multimedia player
mencoder: MPlayers's movie encoder
faac: an AAC audio encoder
gpac: a small and flexible implementaiton of the MPEG-4 system standard
x264: video encoder for the H.264/MPEG-4 AVC standard

Then we do the following steps to convert the video.avi into a .mp4 file in H.264 format.

Extract the audio information from video.avi using MPlayer:
```
mplayer -ao pcm -vc null -vo null video.avi
```
This will generate a audiodump.wav file.

Encode audiodump.wav into AAC format
```
faac --mpeg-vers 4 audiodump.wav
```
This generates a audiodump.aac file.

Use mencoder to convert the video content of video.avi into YUV 4:2:0 format, and use x264 to encode the output into AVC format
```
mkfifo tmp.fifo.yuv
mencoder -vf scale=800:450,format=i420 \
 -nosound -ovc raw -of rawvideo \
 -ofps 23.976 -o tmp.fifo.yuv video.mp4 2>&1 > /dev/null &
x264 -o max-video.mp4 --fps 23.976 --bframes 2 \
 --progress --crf 26 --subme 6 \
 --analyse p8x8,b8x8,i4x4,p4x4 \
 --no-psnr tmp.fifo.yuv 800x450
rm tmp.fifo.yuv
```
We created a named pipe to buffer between mencoder and x264. These command lines generate both Quicktime-compatible and H.264-compatible content. This is because Apple Quicktime can now hold H.264 content. Be aware to specify the same video size to mencoder and x264. In above example, the size is 800x450.

Merge the AAC audio and AVC video into a .mp4 file using gpac
```
MP4Box -add max-video.mp4 -add audiodump.aac \
  -fps 23.976 max-x264.mp4
```
MP4Box is a tool in the gpac package.

如何在Mac OS X上配置一个Web服务器

最近在我的imac上配置了一个Web服务器，用于管理我自己的技术笔记。和很多朋友们一样，我的机器通过连接ADSL的无线路由器上网，所以遇到的问题应该比较典型，因此我把配置流程记录下来，和大家共享。

启动Mac OS X上的Apache

Mac OS X自带了Apache。要启动它很容易。如下图所示：启动System Preference，在Internet & Wireless类别里选择Sharing。然后勾上Web Sharing。这样Apache就启动了。

现在可以在Web浏览器的地址栏里输入http://localhost/~wangyi访问到自己的主页了。【注意wangyi是我在我的imac上的用户名，你得使用你自己的用户名取代它。】这个页面对应的HTML文件就是/Users/wangyi/Sites/index.html。可以通过编辑它来定制你自己的主页。

让大家能访问我的主页

我家是通过电信的ADSL服务上网的。为了让家里的几台电脑都能上网，我在ADSL modem上接了一个无线路由器。这样，家里的电脑启动的时候，是由无线路由器动态分配IP地址。而无线路由器的外部IP地址是电信通过ADSL服务给分配的。为了让Internet上的用户都能通过我的外部IP访问到我的Web服务，我需要在无线路由器上做一些设置【端口转发，port forwarding】，使得无线路由器能把Internet用户的访问请求转发到我的imac电脑的Web服务器程序。

我使用的是TP-LINK的TL-WR541G+无线路由器。为了访问它的配置界面，只需要在浏览器里输入http://192.168.1.1。【192.168.1.1是我的无线路由器的内部IP地址；我的imac的内部IP地址是192.168.1.100。这些是TP-LINK无线路由器的默认设置。】如下图所示：

通过在“转发规则”的“虚拟应用程序”项目中，把“触发端口”和“开放端口”都设置成80（HTTP协议的标准端口号），Internet用户就可以在浏览器里输入我的外部IP来访问我的Web服务了。

设置域名

但是现在的问题是，大家通常不知道我的外部IP，因为这是我的ISP（中国电信）分配给我的无线路由器的。但是没问题，我知道我的外部IP，所以我只要注册一个域名指向我的外部IP，然后向大家公开我的域名就可以了。

通常来说，注册域名都是要花钱的。国内只有一家叫做3322.org的公司提供免费域名注册服务。为了图个便宜，我就用3322.org了。为此，需要访问www.3322.org：

在www.3322.org的首页上，免费注册一个用户。我的用户名是cxwangyi

到“我的控制台”页面，在“动态域名”一栏下，点击“新建”。然后选择一个域名后缀。3322.org提供了几个选择。我选了7766.org。我的域名是我在3322.org的用户名加上我选择的域名后缀，也就是cxwangyi.7766.org。这个配置页面能自动检测我们的外部IP地址，所以不用我们手工输入。其他选项也都选择默认值就行了。抓图如下：

更新外部IP地址

绝大多数ISP(包括中国电信）使用DHCP协议分配IP地址。这就意味着每隔一段时间，我们的IP地址就变了。所以上一步中和cxwangyi.7766.org绑定的IP地址，过了一段时间之后可能就分配给别人的机器了。为此，我们需要时不时的通知3322.org，报告我们最新的IP地址。一个笨办法是每隔一段时间时间访问上图中的设置界面，手工更新我们的IP地址。一个聪明一些的办法是下载3322.org客户端程序，它在运行期间，会自动向3322.org汇报我们的IP地址。第三个办法是用一些标准的工具程序，访问3322.org预留的一个URL，这样我们的IP地址就自然的随着HTTP协议，发给了3322.org，并且被记录下来。3322.org的页面上建议大家使用lynx；对应的命令行是：

lynx -mime_header -auth=cxwangyi:123456 \
"http://www.3322.org/dyndns/update?system=dyndns&hostname=cxwangyi.7766.org"

其中cxwangyi是我在3322.org上注册的用户名，123456是对应的口令。cxwangyi.7766.org是上一步里我们在3322.org上注册的域名。这些你都得用你自己的。

很多系统（包括Mac OS X）没有自带lynx，但是附带了一个更简单的标准程序叫curl。用curl向3322.org汇报IP地址的命令行是：

curl -u cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=cxwangyi.7766.org"

利用curl或者lynx，以下这个非常简单的Bash脚本每隔10秒钟，就向3322.org汇报一次当时的IP地址：

while [ true ]; \
sleep 10000; \
curl -u cxwangyi:123456 \
"http://www.3322.org/dyndns/update?system=dyndns&hostname=myhost.3322.org"; \
done

用Emacs Muse创建技术内容

有了Web服务，还得有内容。有无数的工具软件用于帮助制作网页。我用的是Emacs Muse，一个Emacs插件，允许用户用一种简单的wiki语法书写内容（包括插图甚至复杂的数学公式），并且可以把结果输出成HTML（或者PDF等格式）。Emacs Muse的下载和安装可以参考其主页上的说明。安装之后，我在我的.emacs文件里加入了如下设置：

(add-to-list 'load-path "~/.emacs.d/lisp/muse")
(require 'muse-mode) ; load authoring mode
(require 'muse-html) ; load publishing styles I use
(require 'muse-latex)
(require 'muse-texinfo)
(require 'muse-docbook)
(require 'muse-latex2png) ; display LaTeX math equations
(setq muse-latex2png-scale-factor 1.4) ; the scaling of equation images.
(require 'muse-project) ; publish files in projects
(muse-derive-style
"technotes-html" "html"
:style-sheet "<link rel=\"stylesheet\" type=\"text/css\" media=\"all\" href=\"../css/wangyi.css\" />")
(setq muse-project-alist
'(("technotes" ("~/TechNotes" :default "index")
(:base "technotes-html" :path "~/Sites/TechNotes"))))

其中，~/.emacs.d/lisp/muse是我的Muse的安装目录。~/TechNotes是存储我的技术文档的目录。我的每一篇技术文档是这个目录下的一个后缀为.muse的文本文件（比如HowToSetup.muse）。当我用Emacs编辑这个文件时，只要按组合键control-c control-p，Emacs Muse就自动将这个文档输出成HTML格式，存放在~/Sites/TechNotes目录下（HowToSetup.html）。

Tech Notes of Yi Wang