## Feb 27, 2010

### Could Latent Dirichlet Allocation Hanlde Documents with Various Length?

I heart some of my colleagues who are working on another latent topic model, which is different from LDA, complains that LDA like documents with similar lengths. I agree with this. But I feel that can be fixed easily. Here follows what I think.

The Gibbs sampling algorithm of LDA samples latent topic assignments from as follows
$z_i \sim P(w|z)P(z|d) \propto \frac{N(w,z) + \beta}{N(z) + V\beta} \frac{N(z,d) + \alpha}{L_d + \alpha}$
where V is the vocabulary size and Ld is the length of document d.

The second term is dependent with the document length. Just consider an example document is about two topics, A, and B, and half of its words are assigned topic A, the other half are assigned topic B. So the P(z|d) distribution should have two high bins (height proportional to L/2 + alpha), and all elsewhere are short bins (height proportional to alpha). So, you see, if the document has 1000 words, alpha has trivial effect to the shape of P(z|d); but if the document contains only 2 words, alpha would have more effects on building the shape of P(z|d).

An intuitive solution to above problem is to use small alpha for short document (and vice versa). But would this break the math assumptions under LDA? No. Because this is equivalent to use different symmetric Dirichlet prior on documents with different lengths. This does not break the Dirichlet-multinomial conjugacy required by LDA's Gibbs sampling algorithm, but just express a little more prior knowledge than using a symmetric prior for all documents. Let us set
$\alpha_d = k L_d$
for each document. And users need to specify parameter k as they need to specify alpha before.

## Feb 20, 2010

### Cavendish Experiment and Modern Machine Learning

This is just a joke, so do not follow it seriously...

The very usual case in modern machine learning is as follows:
1. design a model to describe the data, for example, suppose some kind of 2D points are generated along a quadratic curve y = a*x^2 + b*x + c, and
2. design an algorithm that estimates the model parameters, in our case, a, b, and c, given a set of data (observations), x_1,y_1,x_2,y_2,...x_n,y_n.
3. The model parameters can be used in some way, say, given a new x and predict its corresponding y.
So, when I was reading Prof. Feynman's lecture notes, which mentions Cavendish experiment, I thought this experiment is some kind of "learning using machines" --- using the specially designed equipment (machine), Cavendish measured the gravitational constant G in Newton's law of universal gravitation:
F = G * m1 * m2 / r^2
And, using the estimated model parameter G, we can do somethings interesting. For example, measure the weight of the earth (by measuring the weight/gravity F of a known small ball m1, and put them back into the equation to get m2, the mass of earth).

However, this is a joke as I said so you cannot use it in your lecture notes on machine learning. The fact was that Cavendish did not measure G as stated in many textbooks. Instead, he measures the earth directly by comparing (1) the force that a big ball with known mass attracts a small ball with (2) the force that the earth attracts the small ball. If the ratio (2)/(1) is N, then the earth is N times weight of the big ball.

## Feb 14, 2010

### Highlights in LaTeX

To make part of the text highlighted in LaTeX, use the following two packages
\usepackage{color}\usepackage{soul}
And in the text, use macro \hl:
The authors use \hl{$x=100$ in their demonstration}.
Note that if you use only soul without color, \hl just fails to underlines.

## Feb 4, 2010

### Google Puts New Focus on Outside Research

It is recently reported that Google is stepping up its funding to support the research following four areas:
• machine learning
• the use of cellphones as data collection devices in science
• energy efficiency
• privacy
Among these four areas, machine learning is on the top. "Three years ago, three of the four research areas would not have been on the company’s priority list", Mr. Spector said. "The only one that was a priority then and now is machine learning, a vital ingredient in search technology."

## Feb 3, 2010

### Reduce Data Correlation by Recenter and Rescale

In the MATLAB statistics demo document, the training data (a set of car weights) are recentered and rescaled as follows:
% A set of car weightsweight = [2100 2300 2500 2700 2900 3100 3300 3500 3700 3900 4100 4300]';weight = (weight-2800)/1000;     % recenter and rescale
And the document explains the reason of recenter and rescale as
The data include observations of weight, number of cars tested, and number failed. We will work with a transformed version of the weights to reduce the correlation in our estimates of the regression parameters.
Could anyone tell me why the recenter and rescale can reduce the correlation?

## Feb 2, 2010

### Using aMule with VeryCD

http://hi.baidu.com/linsir/blog/item/c4b54839805a9af73a87cea2.html

### Making Videos Playable on Android and iPhone

You might want to convert you home-made video (no pirated video :-!) into a format that your Android phone can play. The video formats that Android support are listed in Android developers' site:

Among the listed formats, H.264 (as a realization of the MPEG-4 standard) has been well accepted by the industry. Companies including Apple has switched to it. In the following, I will show you that using open source software on a Linux box can convert your video into H.264 with AVC video and AAC audio. I took the following post as a reference, but with updates.

First of all, you need to install the following software packages:
1. mplayer: a multimedia player
2. mencoder: MPlayers's movie encoder
3. faac: an AAC audio encoder
4. gpac: a small and flexible implementaiton of the MPEG-4 system standard
5. x264: video encoder for the H.264/MPEG-4 AVC standard
Then we do the following steps to convert the video.avi into a .mp4 file in H.264 format.
1. Extract the audio information from video.avi using MPlayer:
mplayer -ao pcm -vc null -vo null video.avi
This will generate a audiodump.wav file.

2. Encode audiodump.wav into AAC format
faac --mpeg-vers 4 audiodump.wav
This generates a audiodump.aac file.

3. Use mencoder to convert the video content of video.avi into YUV 4:2:0 format, and use x264 to encode the output into AVC format
mkfifo tmp.fifo.yuvmencoder -vf scale=800:450,format=i420 \ -nosound -ovc raw -of rawvideo \ -ofps 23.976 -o tmp.fifo.yuv video.mp4 2>&1 > /dev/null &x264 -o max-video.mp4 --fps 23.976 --bframes 2 \ --progress --crf 26 --subme 6 \ --analyse p8x8,b8x8,i4x4,p4x4 \ --no-psnr tmp.fifo.yuv 800x450rm tmp.fifo.yuv
We created a named pipe to buffer between mencoder and x264. These command lines generate both Quicktime-compatible and H.264-compatible content. This is because Apple Quicktime can now hold H.264 content. Be aware to specify the same video size to mencoder and x264. In above example, the size is 800x450.

4. Merge the AAC audio and AVC video into a .mp4 file using gpac
MP4Box -add max-video.mp4 -add audiodump.aac \  -fps 23.976 max-x264.mp4
MP4Box is a tool in the gpac package.

## 启动Mac OS X上的Apache

Mac OS X自带了Apache。要启动它很容易。如下图所示：启动System Preference，在Internet & Wireless类别里选择Sharing。然后勾上Web Sharing。这样Apache就启动了。

## 设置域名

1. 在www.3322.org的首页上，免费注册一个用户。我的用户名是cxwangyi

2. 到“我的控制台”页面，在“动态域名”一栏下，点击“新建”。然后选择一个域名后缀。3322.org提供了几个选择。我选了7766.org。我的域名是我在3322.org的用户名加上我选择的域名后缀，也就是cxwangyi.7766.org。这个配置页面能自动检测我们的外部IP地址，所以不用我们手工输入。其他选项也都选择默认值就行了。抓图如下：

## 更新外部IP地址

lynx -mime_header -auth=cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=cxwangyi.7766.org"

curl -u cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=cxwangyi.7766.org"

while [ true ]; \sleep 10000; \curl -u cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=myhost.3322.org"; \done

## 用Emacs Muse创建技术内容

(require 'muse-mode) ; load authoring mode
(require 'muse-html) ; load publishing styles I use
(require 'muse-latex)
(require 'muse-texinfo)
(require 'muse-docbook)
(require 'muse-latex2png) ; display LaTeX math equations
(setq muse-latex2png-scale-factor 1.4) ; the scaling of equation images.
(require 'muse-project) ; publish files in projects
(muse-derive-style
"technotes-html" "html"
:style-sheet "<link rel=\"stylesheet\" type=\"text/css\" media=\"all\" href=\"../css/wangyi.css\" />")
(setq muse-project-alist
'(("technotes" ("~/TechNotes" :default "index")
(:base "technotes-html" :path "~/Sites/TechNotes"))))