## Aug 17, 2010

### Parallel LDA Gibbs Sampling Using OpenMP and MPI

I created an open source project, ompi-lda, on Google Code. This project is inspired by another LDA project which I initialized at Google and a recent parallel programming effort using OpenMP by xlvector.

## Jun 28, 2010

### I Moved My Blog to Wordpress

Since I found it is convenient to insert code and LaTeX math into my posts using Wordpress, I moved this blog to

I may no longer update this blog.

## Jun 21, 2010

### Google Protocol Buffers 实用技术：解析.proto文件和任意数据文件

Google Protocol Buffers 是一种非常方便高效的数据编码方式（data serialization），几乎在Google的每个产品中都用到了。本文介绍 protocol buffers 的一种高级使用方法（在Google Protocol Buffer的主页上没有的）。

Protocol Buffers 通常的使用方式如下：我们的程序只依赖于有限的几个 protocol messages。我们把这几个 message 定义在一个或者多个 .proto 文件里，然后用编译器 protoc 把 .proto 文件翻译成 C++ 语言（.h和.cc文件）。这个翻译过程把每个 message 翻译成了一个 C++ class。这样我们的程序就可以使用这些 protocol classes 了。

1 把 codex 的基本功能（比如读取数据文件，打印文件内容等）实现在一个 .cc 文件里（比如叫做 codex-base.cc)
1 对给定的 .proto 文件，调用 protoc，得到对应的 .pb.h 和 .pb.cc 文件。
1 把 codex-base.cc 和 protoc 的结果一起编译，生成一个专门解析某一个 protocol message 内容的 codex 程序。

#include <google/protobuf/descriptor.h>#include <google/protobuf/dynamic_message.h>#include <google/protobuf/io/zero_copy_stream_impl.h>#include <google/protobuf/io/tokenizer.h>#include <google/protobuf/compiler/parser.h>//-----------------------------------------------------------------------------// Parsing given .proto file for Descriptor of the given message (by// name).  The returned message descriptor can be used with a// DynamicMessageFactory in order to create prototype message and// mutable messages.  For example:/*DynamicMessageFactory factory;const Message* prototype_msg = factory.GetPrototype(message_descriptor);const Message* mutable_msg = prototype_msg->New();*///-----------------------------------------------------------------------------void GetMessageTypeFromProtoFile(const string& proto_filename,                              FileDescriptorProto* file_desc_proto) {using namespace google::protobuf;using namespace google::protobuf::io;using namespace google::protobuf::compiler;FILE* proto_file = fopen(proto_filename.c_str(), "r");{ if (proto_file == NULL) {   LOG(FATAL) << "Cannot open .proto file: " << proto_filename; } FileInputStream proto_input_stream(fileno(proto_file)); Tokenizer tokenizer(&proto_input_stream, NULL); Parser parser; if (!parser.Parse(&tokenizer, file_desc_proto)) {   LOG(FATAL) << "Cannot parse .proto file:" << proto_filename; }}fclose(proto_file);// Here we walk around a bug in protocol buffers that  // |Parser::Parse| does not set name (.proto filename) in  // file_desc_proto.  if (!file_desc_proto->has_name()) { file_desc_proto->set_name(proto_filename);}}

//-----------------------------------------------------------------------------// Print contents of a record file with following format:////   { <int record_size> <KeyValuePair> }//// where KeyValuePair is a proto message defined in mpimr.proto, and// consists of two string fields: key and value, where key will be// printed as a text string, and value will be parsed into a proto// message given as |message_descriptor|.//-----------------------------------------------------------------------------void PrintDataFile(const string& data_filename,                const FileDescriptorProto& file_desc_proto,                const string& message_name) {const int kMaxRecieveBufferSize = 32 * 1024 * 1024;  // 32MB  static char buffer[kMaxRecieveBufferSize];ifstream input_stream(data_filename.c_str());if (!input_stream.is_open()) { LOG(FATAL) << "Cannot open data file: " << data_filename;}google::protobuf::DescriptorPool pool;const google::protobuf::FileDescriptor* file_desc = pool.BuildFile(file_desc_proto);if (file_desc == NULL) { LOG(FATAL) << "Cannot get file descriptor from file descriptor"            << file_desc_proto.DebugString();}const google::protobuf::Descriptor* message_desc = file_desc->FindMessageTypeByName(message_name);if (message_desc == NULL) { LOG(FATAL) << "Cannot get message descriptor of message: " << message_name;}google::protobuf::DynamicMessageFactory factory;const google::protobuf::Message* prototype_msg = factory.GetPrototype(message_desc);if (prototype_msg == NULL) { LOG(FATAL) << "Cannot create prototype message from message descriptor";}google::protobuf::Message* mutable_msg = prototype_msg->New();if (mutable_msg == NULL) { LOG(FATAL) << "Failed in prototype_msg->New(); to create mutable message";}uint32 proto_msg_size; // uint32 is the type used in reocrd files.  for (;;) { input_stream.read((char*)&proto_msg_size, sizeof(proto_msg_size)); if (proto_msg_size > kMaxRecieveBufferSize) {   LOG(FATAL) << "Failed to read a proto message with size = "              << proto_msg_size              << ", which is larger than kMaxRecieveBufferSize ("              << kMaxRecieveBufferSize << ")."              << "You can modify kMaxRecieveBufferSize defined in "              << __FILE__; } input_stream.read(buffer, proto_msg_size); if (!input_stream)   break; if (!mutable_msg->ParseFromArray(buffer, proto_msg_size)) {   LOG(FATAL) << "Failed to parse value in KeyValuePair:" << pair.value(); } cout << mutable_msg->DebugString();}delete mutable_msg;}

int main(int argc, char** argv) {string proto_filename, message_name;vector<string> data_filenames;FileDescriptorProto file_desc_proto;ParseCmdLine(argc, argv, &proto_filename, &message_name, &data_filenames);GetMessageTypeFromProtoFile(proto_filename, &file_desc_proto);for (int i = 0; i < data_filenames.size(); ++i) { PrintDataFile(data_filenames[i], file_desc_proto, message_name);}return 0;}

## Jun 10, 2010

### Intel’s Single-chip Cluster Computer

I just read a post on Intel's single-chip cluster computer. This is really a unique thing: multiple cores in a chip, each core has its own private-and-fast memory: cache. Main memory becomes more like the NFS. Programmers write programs for this chip using MPI.

So, this chip is a cluster with super-fast network connection (64GB/s) and very limited memory (because they are implemented as cache).

## Jun 5, 2010

### How to Install Dropbox in Ubuntu Using SOCKS Proxy

As a Chinese, I live behind the G.F.W. and have to get access Internet through an SSH tunnel. However, the Dropbox installation procedure do not support either SOCKS5 or http proxy. (Yes, the document says it works with http_proxy environment variable, but seems it does not.) Thanks to two great tools: proxychains and tsocks, which can launch any application and handle their network communication request through a pre-configured proxy. With the help of these tools, I can install Dropbox on a brand new Ubuntu machine behind the G.F.W. Here follows the steps:
1. Install tsocks (or proxychains) under Ubuntu using synaptic. Thanks to Ubuntu, who made these tools standard packages.
2. Get a SSH tunnel (in any way you like). I paid for an account. So I can setup a proxy tunnel by the following command line on my computer:
ssh -D 7070 my-username@my-service-provider-url
3. Configure your Web browser to use the tunnel. In Firefox, select to use SOCKS5 proxy: localhost:7070. This enables you access to Dropbox's homepage and download the Ubuntu package.
4. Install the package by clicking it in Nautilus. To check the installation, in a shell, type the command
dropbox start -i
If you can see some error messages complaining network access restriction, you made it.
5. Add the following lines to /etc/tsocks.conf:
server = 127.0.0.1server_type = 5server_port = 7070
If you are using proxychains, you need to modify /etc/proxychains.conf or make your own ~/.proxychains.conf.
6. This time, using tsocks to launch the Dropbox online install procedure:
tsocks dropbox start -i
Cross your fingers and wait for it to download and install Dropbox, until you see Dropbox icon appears on the top-right corner of your screen.
7. Right-click Dropbox icon, select "Preferences", and set SOCKS5 proxy like you did for Firefox. Hopefully, Dropbox starts to sync files you need now.
Good luck!

## Jun 4, 2010

### Building Distributed Programs Using GCC

In the case of distributed computing, a program is build (into a binary) and distributed on multiple computers for running. It is often that we do not want to install libraries depended by our program on every working computers. Instead, we can instruct GCC to link static libraries by setting the environment variable:
LDFLAGS=-static -static-libgcc

I have tried this method under Cygwin and Ubuntu Linux. However, if I do this under Darwin (Mac OS X 10.6 Snow Leopard), the linker complains that
ld: library not found for -lcrt0.o
In this Technical Q&A, Apple explains that they want to make Mac OS X upgrading easier, so they do not provide crt0.o to encourage dynamic linking.

## Jun 3, 2010

### Uint64 Constants

If we write uint64 a = 0xffffffffffffffff;, the compiler often complains:
error: integer constant is too large for ‘long’ type
What we need is to add the LLU suffix: uint64 a = 0xffffffffffffffffLLU;

## Jun 2, 2010

### Launch a Series of Hadoop Pipes Task

It is OK to call runTask multiple times from within main() in a Hadoop Pipes program to launch a series of MapReduce tasks. However, one thing to remember: MapReduce tasks must not have conflicts output directory, because Hadoop runtime does not start a MapReduce task whose output directory exists.

### Incremental Parsing Using Boost Program Options Library

I have to say I fell in love with boost::program_options since the first time I use it in developing my own C++ MapReduce implementation. It can parse command line parameters as well as configuration files. This makes it convenient for programs which supports and expects a bunch of options, where a MapReduce program is a typical example.

A special use-case of a command line parser is that a function need to parse some options out from the command line parameters, and then the rest parameters are passed to another function, which parse other options. For example, the MapReduce runtime requires to get options like "num_map_workers", "num_reduce_workers", etc, and the rest of the program (user customized map and reduce functions) need to parse application-specific options like "topic_dirichlet_prior", "num_lda_topics", etc. boost::program_options supports such kind of multi-round parsing, where the key is boost::program_options::allow_unregistered(). Here attaches a sample program: (For more explanation on this program, please refer to the official document of boost::program_options.)
#include <iostream>#include <string>#include <vector>#include <boost/program_options/option.hpp>#include <boost/program_options/options_description.hpp>#include <boost/program_options/variables_map.hpp>#include <boost/program_options/parsers.hpp>using namespace std;namespace po = boost::program_options;int g_num_map_workers;int g_num_reduce_workers;vector<string> foo(int argc, char** argv) { po::options_description desc("Supported options"); desc.add_options()   ("num_map_workers", po::value<int>(&g_num_map_workers), "# map workers")   ("num_reduce_workers", po::value<int>(&g_num_reduce_workers), "# reduce workers")   ; po::variables_map vm; po::parsed_options parsed =   po::command_line_parser(argc, argv).options(desc).allow_unregistered().run(); po::store(parsed, vm); po::notify(vm); cout << "The following options were parsed by foo:\n"; if (vm.count("num_map_workers")) {   cout << "num_map_workers = " << g_num_map_workers << "\n"; } if (vm.count("num_reduce_workers")) {   cout << "num_reduce_workers = " << g_num_reduce_workers << "\n"; } return po::collect_unrecognized(parsed.options, po::include_positional);}void bar(vector<string>& rest_args) { po::options_description desc("Supported options"); desc.add_options()   ("apple", po::value<int>(), "# apples")   ; po::variables_map vm; po::parsed_options parsed =   po::command_line_parser(rest_args).options(desc).allow_unregistered().run(); po::store(parsed, vm); po::notify(vm); cout << "The following options were parsed by bar:\n"; if (vm.count("apple")) {   cout << "apple = " << vm["apple"].as<int>() << "\n"; }}int main(int argc, char** argv) { vector<string> rest_options = foo(argc, argv); cout << "The following cmd args cannot not be recognized by foo:\n"; for (int i = 0; i < rest_options.size(); ++i) {   cout << rest_options[i] << "\n"; } bar(rest_options);}

Finally I have to tell that early boost version (e.g., 1.33.1 packed in Cygwin) has bugs in program_options, which leads to core dump in case of unknown options. The solution to download and build your own boost libraries. I just built 1.43.0 on Cygwin on my Windows computer.

## May 23, 2010

### The Efficiency of AWK Associative Array

I did a little experiment comparing C++ STL map with AWK associative array in counting word frequency of large text files. The result is astonishing: AWK associative array is about 6 times faster than the C++ code!

At first, I put my eyes on the C++ code that splits a line into words. I tried STL istringstream, C strtok_r, and a string splitting trick that I learned in Google. However, these choices do not affect the efficiency saliently.

Then I realized the lesson I learned from parallel LDA (a machine learning method, which has been a key part of my research for over three years) --- map is about 10 times slower than map. I found that this issue has been thoroughly explained by Lev Kochubeevsky, principle engineer at Netscape, in 2003. Unfortunately, seems no improvement to STL string emerged since then.

On the other hand, I highly suspect that AWK, an interpreted language, implements a trie-based data structure for maps with string-keys.

### SSHFS, Poor Guys' Network File-system

SSHFS is good: as long as one has SSH access, he can mount remote directories, even if he does not have administrative accesses required by traditional network filesystems like NFS or Samba.

## May 22, 2010

### Install and Configure MPICH2 on Ubuntu

The general procedure is described in
http://developer.amd.com/documentation/articles/pages/HPCHighPerformanceLinpack.aspx#four

I have encountered two cases where MPD on multiple nodes cannot communicate with each other:
1. The firewall prevents such communication, and
2. There are ambiguity in /etc/hosts.
For 1., we can disable iptables (the firewall commonly used in various Linux versions) and ufw (Ubuntu firewall) by the following commands:
sudo iptables -P INPUT ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo ufw disable

For 2., I just re-edit /etc/hosts to ensure all nodes are referred by their real IP addresses, instead of loop-back style addresses.

## May 18, 2010

### Hadoop Pipes Is Incompatible with Protocol Buffers

I just found another reason that I do not like Hadoop Pipes --- I cannot use a serialization of Google protocol buffer as map output key or value.

For those who are scratching your heads for weird bugs from your Hadoop Pipes programs using Google protocol buffers, please have a look at the following sample program:
#include <string>#include <hadoop/Pipes.hh>#include <hadoop/TemplateFactory.hh>#include <hadoop/StringUtils.hh>using namespace std;class LearnMapOutputMapper: public HadoopPipes::Mapper {public:LearnMapOutputMapper(HadoopPipes::TaskContext& context){}void map(HadoopPipes::MapContext& context) {  context.emit("", "apple\norange\0banana\tpapaya");}};class LearnMapOutputReducer: public HadoopPipes::Reducer {public:LearnMapOutputReducer(HadoopPipes::TaskContext& context){}void reduce(HadoopPipes::ReduceContext& context) {while (context.nextValue()) {  string value = context.getInputValue(); // Copy content      context.emit(context.getInputKey(), HadoopUtils::toString(value.size()));}}};int main(int argc, char *argv[]) {return HadoopPipes::runTask(HadoopPipes::TemplateFactory<LearnMapOutputMapper,                          LearnMapOutputReducer>());}

The reducer outputs the size of the map output values, which contains special characters: new-line, null-term and tab. If Hadoop Pipes allows such special characters, then we should see reduce outputs 26, the length of string
"apple\norange\0banana\tpapaya".

However, unfortunately, we see 12 in the output, which is the length of string
"apple\norange"

This shows that map outputs in Hadoop Pipes cannot contain the null-term character, which, however, may appear in a serialization of protocol buffer, as explained in the protocol buffers encoding scheme at:
http://code.google.com/apis/protocolbuffers/docs/encoding.html

I hate Hadoop Pipes, a totally incomplete but released MapReduce API.

## May 16, 2010

### MPI-based MapReduce Implementation

http://www.sandia.gov/~sjplimp/mapreduce/doc/Manual.html

MapReduce 的很多优势在论文里和论坛里都有人强调过了 —— 它可以处理海量的数据，可以支持 auto fault-recovery。但是我觉得最重要的一点是 MapReduce 的 API 很简单 —— 它容许程序员通过定义一个 map 函数和一个 reduce 函数就搞定一个并行程序。所有的 distributed IO、communications、task synchronization、load balancing、fault recovery都不用用户操心。

BTW：说到这里，顺便说一下， Sphere/Sector (http://sector.sourceforge.net/doc.html) 的 API 也不是 MapReduce API 。从 Sphere Sector 的 tutorial slides 里贴一下一个demo program：
Sector::init(); Sector::login(…)SphereStream input;SphereStream output;SphereProcess myProc;myProc.loadOperator(“func.so”);myProc.run(input, output, func, 0);myProc.read(result)myProc.close();Sector::logout(); Sector::close();

## Apr 19, 2010

### Running Hadoop on Mac OS X (Single Node)

I installed Hadoop, built its C++ components, and built and ran Pipes programs on my iMac running Snow Leopard.

Installation and Configuration

Basically, I followed Michael G. Noll's guide, Running Hadoop On Ubuntu Linux (Single-Node Cluster), with two things different from the guide.

In Mac OS X, we need to choose to use Sun's JVM. This can be done using System Preference. Then In both .bash_profile and $HADOOP_HOME/conf/hadoop-env.sh, set the JAVA_HOME environment variable: export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home I did not create special account for running Hadoop. (I should, for security reasons, but I am lazy and my iMac is only for personal development, but not real computing...) So, I need to chmod a+rwx /tmp/hadoop-yiwang, where yiwang is my account name, as well what${user.name} refers to in core-site.xml.

After finishing installation and configuration, we should be able to start all Hadoop services, build and run Hadoop Java programs, and monitor their activities.

Building C++ Components

Because I do nothing about Java, I write Hadoop programs using Pipes. The following steps build Pipes C++ library in Mac OS X:
1. Install XCode and open a terminal window
2. cd $HADOOP_HOME/src/c++/utils 3. ./configure 4. make install 5. cd$HADOOP_HOME/src/c++/pipes
6. ./configure
7. make install
Note that you must build utils before pipes.

Build and Run Pipes Programs

The following command shows how to link to Pipes libraries:
g++ -o wordcount wordcount.cc \-I${HADOOP_HOME}/src/c++/install/include \-L${HADOOP_HOME}/src/c++/install/lib \-lhadooputils -lhadooppipes -lpthread
To run the program, we need a configuration file, as shown by Apache Hadoop Wiki page.

Build libHDFS

There are some bugs in libHDFS of Apache Hadoop 0.20.2, but it is easy to fix them:
cd hadoop-0.20.2/src/c++/libhdfs./configureRemove #include "error.h" from hdfsJniHelper.cRemove -Dsize_t=unsigned int from Makefilemakecp hdfs.h ../install/include/hadoopcp libhdfs.so ../install/lib
Since Mac OS X uses DYLD to mange shared libraries, you need to specify the directory holding libhdfs.so using environment variable DYLD_LIBRARY_PATH. (LD_LIBRARY_PATH does not work.):
export DYLD_LIBRARY_PATH=$HADOOP_HOME/src/c++/install/lib:$DYLD_LIBRARY_PATH
You might want to add above line into your shell configure file (e.g., ~/.bash_profile).

## Apr 11, 2010

### Get Through GFW on Mac OS X Using IPv6

In my previous post, I explained how to get through the GFW on Mac OS X using Tor. Unfortunately, it seems that Tor has been banned by GFW in recently months. However, some blog posts and mailing list claims that GFW has not been able to filter IPv6 packets. So I resorted to the IPv6 tunneling protocol, Teredo. A well known software implementation of Teredo on Linux and BSD is Miredo. Thanks to darco, who recently ported Miredo to Mac OS X, in particular, 10.4, 10.5 and 10.6 with 32-bit kernel. You can drop by darco's Miredo for Mac OS X page or just download the universal installer directly. After download, click to install, and IPv6 tunneling via IPv4 is setup on your Mac.

Before you can use IPv6 to get through the GFW, you need to know IPv6 addresses of the sites you want to visit. You must add these addresses into your /etc/hosts file, so the Web browser has no need to resolve the addresses via IPv4 (which is under monitoring by GFW). This Google Doc contains IPv6 addresses to most Google services (including Youtube).

## Mar 27, 2010

### Customizing Mac OS X 10.6 For a Linux User

I have been a Linux user for years, and changed to Mac OS X 10.6 Snow Leopard in recent months. Here follows things I've done for Snow Leopard to make it suit for my work habits.
• Emacs.
I prefer Aquamacs version 20.1preview5 than the stable version 19.x when I wrote this post. Aquamacs has many useful Emacs plugins packed already, including AUCTeX for LaTeX editing.
• PdfTk.
Under Unix, Emacs/AUCTeX invokes pdf2dsc (a component in the pdftk package) to do inline preview in PDFLaTeX mode. Under Mac OS X, thanks to Frédéric Wenzel, who created a DMG of PdfTk for us.
• LaTeX/PDF Preview.
There is a free PDF viewer, Skim, under Mac OS X, which works like the ActiveDVI Viewer under Linux, but displays PDF files instead of DVI. Whenever you edit your LaTeX source and recompile, Skim will update automatically what it is displaying.
• Terminal.
As many others, I use iTerm. To support bash shortcut keys like Alt-f/b/d, you need to customize iTerm as suggested by many Google search results. In particular, remember to select "High interception priority" when you do such customization for iTerm under Snow Leopard.

### Chrome 的安全机制

Chrome 会启动两类进程：target 和 broker：
1. Target 进程执行那些容易被黑客影响并做坏事的工作；主要包括（1）Javascript 程序的解释，和（2）HTML rendering。Target 进程是由 broker 进程创建的。创建之初，broker 就把 target 进程的各种访问权限都剥夺了。也就是说虽然 target 进程可以像所有用户态进程那样通过操作系统调用，请操作系统内核做事，但是操作系统内核会认为 target 进程没有权限，因而拒绝之。【注：在现代操作系统中，用户进程对任何系统资源的访问都得通过“请操作系统内核帮忙”来完成。】所以 target 实际上只能通过进程间调用，请 broker 进程来帮忙做事。

2. Broker 进程扮演着操作系统内核的角色 —— 因为 broker 进程执行的代码是浏览器的作者写的，并且不易被坏人注入坏代码，所以我们可以依赖它检查 target 进程请它做的事情是不是靠谱。如果不靠谱，则拒绝之。

## Mar 26, 2010

### Data-Intensive Text Processing with MapReduce

A book draft, Data-Intensive Text Processing with MapReduce, on parallel text algorithms with MapReduce can be found here. This book has chapters covering graph algorithms (breath-first traversal and PageRank) and learning HMM using EM. The authors work great on presenting concepts using figures, which are comprehensive and intuitive.

Indeed, there are many other interesting stuff you can put into a book on MapReducing text processing algorithms. For example, parallel latent topic models like latent Dirichlet allocation, and tree pruning/learning algorithms for various purposes.

### Stochastic Gradient Tree Boosting

The basic idea of boosting as functional gradient descent and stages/steps as trees, known by gradient boosting, is presented by a Stanford paper:
• Jerome Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annuals of Statistics. 2001
The same author wrote a note on extending gradient boosting into its stochastic version, stochastic gradient boosting:
• Jerome Friedman. Stochastic Gradient Boosting. 1999.
The (stochastic) gradient boosting use regression/classification trees as base learners, and needs to learn trees in the procedure of training. If you are interesting with distributed learning of trees using MapReduce, you might want to refer to a recent Google paper:
• PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. VLDB 2009
A recent Yahoo paper shows implementing stochastic boosted decision trees using MPI and Hadoop:
• Stochastic Gradient Boosted Distributed Decision Trees. CIKM 2009

## Mar 24, 2010

### Fwd: Ten Commands Every Linux Developer Should Know

I like this article in Linux Journal, which reveals some very useful Linux commands that I have never used in my years experience with Unix's.

## Feb 27, 2010

### Could Latent Dirichlet Allocation Hanlde Documents with Various Length?

I heart some of my colleagues who are working on another latent topic model, which is different from LDA, complains that LDA like documents with similar lengths. I agree with this. But I feel that can be fixed easily. Here follows what I think.

The Gibbs sampling algorithm of LDA samples latent topic assignments from as follows
$z_i \sim P(w|z)P(z|d) \propto \frac{N(w,z) + \beta}{N(z) + V\beta} \frac{N(z,d) + \alpha}{L_d + \alpha}$
where V is the vocabulary size and Ld is the length of document d.

The second term is dependent with the document length. Just consider an example document is about two topics, A, and B, and half of its words are assigned topic A, the other half are assigned topic B. So the P(z|d) distribution should have two high bins (height proportional to L/2 + alpha), and all elsewhere are short bins (height proportional to alpha). So, you see, if the document has 1000 words, alpha has trivial effect to the shape of P(z|d); but if the document contains only 2 words, alpha would have more effects on building the shape of P(z|d).

An intuitive solution to above problem is to use small alpha for short document (and vice versa). But would this break the math assumptions under LDA? No. Because this is equivalent to use different symmetric Dirichlet prior on documents with different lengths. This does not break the Dirichlet-multinomial conjugacy required by LDA's Gibbs sampling algorithm, but just express a little more prior knowledge than using a symmetric prior for all documents. Let us set
$\alpha_d = k L_d$
for each document. And users need to specify parameter k as they need to specify alpha before.

## Feb 20, 2010

### Cavendish Experiment and Modern Machine Learning

This is just a joke, so do not follow it seriously...

The very usual case in modern machine learning is as follows:
1. design a model to describe the data, for example, suppose some kind of 2D points are generated along a quadratic curve y = a*x^2 + b*x + c, and
2. design an algorithm that estimates the model parameters, in our case, a, b, and c, given a set of data (observations), x_1,y_1,x_2,y_2,...x_n,y_n.
3. The model parameters can be used in some way, say, given a new x and predict its corresponding y.
So, when I was reading Prof. Feynman's lecture notes, which mentions Cavendish experiment, I thought this experiment is some kind of "learning using machines" --- using the specially designed equipment (machine), Cavendish measured the gravitational constant G in Newton's law of universal gravitation:
F = G * m1 * m2 / r^2
And, using the estimated model parameter G, we can do somethings interesting. For example, measure the weight of the earth (by measuring the weight/gravity F of a known small ball m1, and put them back into the equation to get m2, the mass of earth).

However, this is a joke as I said so you cannot use it in your lecture notes on machine learning. The fact was that Cavendish did not measure G as stated in many textbooks. Instead, he measures the earth directly by comparing (1) the force that a big ball with known mass attracts a small ball with (2) the force that the earth attracts the small ball. If the ratio (2)/(1) is N, then the earth is N times weight of the big ball.

## Feb 14, 2010

### Highlights in LaTeX

To make part of the text highlighted in LaTeX, use the following two packages
\usepackage{color}\usepackage{soul}
And in the text, use macro \hl:
The authors use \hl{$x=100$ in their demonstration}.
Note that if you use only soul without color, \hl just fails to underlines.

## Feb 4, 2010

### Google Puts New Focus on Outside Research

It is recently reported that Google is stepping up its funding to support the research following four areas:
• machine learning
• the use of cellphones as data collection devices in science
• energy efficiency
• privacy
Among these four areas, machine learning is on the top. "Three years ago, three of the four research areas would not have been on the company’s priority list", Mr. Spector said. "The only one that was a priority then and now is machine learning, a vital ingredient in search technology."

## Feb 3, 2010

### Reduce Data Correlation by Recenter and Rescale

In the MATLAB statistics demo document, the training data (a set of car weights) are recentered and rescaled as follows:
% A set of car weightsweight = [2100 2300 2500 2700 2900 3100 3300 3500 3700 3900 4100 4300]';weight = (weight-2800)/1000;     % recenter and rescale
And the document explains the reason of recenter and rescale as
The data include observations of weight, number of cars tested, and number failed. We will work with a transformed version of the weights to reduce the correlation in our estimates of the regression parameters.
Could anyone tell me why the recenter and rescale can reduce the correlation?

## Feb 2, 2010

### Using aMule with VeryCD

http://hi.baidu.com/linsir/blog/item/c4b54839805a9af73a87cea2.html

### Making Videos Playable on Android and iPhone

You might want to convert you home-made video (no pirated video :-!) into a format that your Android phone can play. The video formats that Android support are listed in Android developers' site:

Among the listed formats, H.264 (as a realization of the MPEG-4 standard) has been well accepted by the industry. Companies including Apple has switched to it. In the following, I will show you that using open source software on a Linux box can convert your video into H.264 with AVC video and AAC audio. I took the following post as a reference, but with updates.

First of all, you need to install the following software packages:
1. mplayer: a multimedia player
2. mencoder: MPlayers's movie encoder
3. faac: an AAC audio encoder
4. gpac: a small and flexible implementaiton of the MPEG-4 system standard
5. x264: video encoder for the H.264/MPEG-4 AVC standard
Then we do the following steps to convert the video.avi into a .mp4 file in H.264 format.
1. Extract the audio information from video.avi using MPlayer:
mplayer -ao pcm -vc null -vo null video.avi
This will generate a audiodump.wav file.

2. Encode audiodump.wav into AAC format
faac --mpeg-vers 4 audiodump.wav
This generates a audiodump.aac file.

3. Use mencoder to convert the video content of video.avi into YUV 4:2:0 format, and use x264 to encode the output into AVC format
mkfifo tmp.fifo.yuvmencoder -vf scale=800:450,format=i420 \ -nosound -ovc raw -of rawvideo \ -ofps 23.976 -o tmp.fifo.yuv video.mp4 2>&1 > /dev/null &x264 -o max-video.mp4 --fps 23.976 --bframes 2 \ --progress --crf 26 --subme 6 \ --analyse p8x8,b8x8,i4x4,p4x4 \ --no-psnr tmp.fifo.yuv 800x450rm tmp.fifo.yuv
We created a named pipe to buffer between mencoder and x264. These command lines generate both Quicktime-compatible and H.264-compatible content. This is because Apple Quicktime can now hold H.264 content. Be aware to specify the same video size to mencoder and x264. In above example, the size is 800x450.

4. Merge the AAC audio and AVC video into a .mp4 file using gpac
MP4Box -add max-video.mp4 -add audiodump.aac \  -fps 23.976 max-x264.mp4
MP4Box is a tool in the gpac package.

## 启动Mac OS X上的Apache

Mac OS X自带了Apache。要启动它很容易。如下图所示：启动System Preference，在Internet & Wireless类别里选择Sharing。然后勾上Web Sharing。这样Apache就启动了。

## 设置域名

1. 在www.3322.org的首页上，免费注册一个用户。我的用户名是cxwangyi

2. 到“我的控制台”页面，在“动态域名”一栏下，点击“新建”。然后选择一个域名后缀。3322.org提供了几个选择。我选了7766.org。我的域名是我在3322.org的用户名加上我选择的域名后缀，也就是cxwangyi.7766.org。这个配置页面能自动检测我们的外部IP地址，所以不用我们手工输入。其他选项也都选择默认值就行了。抓图如下：

## 更新外部IP地址

lynx -mime_header -auth=cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=cxwangyi.7766.org"

curl -u cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=cxwangyi.7766.org"

while [ true ]; \sleep 10000; \curl -u cxwangyi:123456 \"http://www.3322.org/dyndns/update?system=dyndns&hostname=myhost.3322.org"; \done

## 用Emacs Muse创建技术内容

(add-to-list 'load-path "~/.emacs.d/lisp/muse")
(require 'muse-mode) ; load authoring mode
(require 'muse-html) ; load publishing styles I use
(require 'muse-latex)
(require 'muse-texinfo)
(require 'muse-docbook)
(require 'muse-latex2png) ; display LaTeX math equations
(setq muse-latex2png-scale-factor 1.4) ; the scaling of equation images.
(require 'muse-project) ; publish files in projects
(muse-derive-style
"technotes-html" "html"
:style-sheet "<link rel=\"stylesheet\" type=\"text/css\" media=\"all\" href=\"../css/wangyi.css\" />")
(setq muse-project-alist
'(("technotes" ("~/TechNotes" :default "index")
(:base "technotes-html" :path "~/Sites/TechNotes"))))

## Jan 20, 2010

### Unix Philosophy

It had been a difficult problem to compare Windows and Linux. I had been holding the idea that in the Unix world, people write small and simple programs which work together via standard linkages like pipe and other inter-process communication mechanisms. However, under Windows, people tend to write a huge program which can do everything.

An example is word processing. In the Windows world, we have Microsoft Word, which has a huge number of functions: editing, rendering (WYSIWYG), spell checking, printing, and much more. However, in the Unix world, we use the TeX system, consisting of many programs, each does one simple thing -- TeX macro defines basic typesetting functions, LaTeX defines more, Emacs (or any other editor) edits, pdfLaTeX (and other converters) converts sources into PDF or other formats, AUCTeX or Lyx implements WYSIWYG, and etc.

Well, by mentioning above, I think I am not so bad as I see at least the Separation and Composition philosophy of the Unix world. However, there are many more that I have not been able to summarize. Anyway, the lucky thing is a master had summarized them for us, so, please refer to the great book The Art of Unix Programming.

## Jan 19, 2010

### Hierarchical Classification

A 2005 paper, Learning Classifiers Using Hierarchically Structured Class Taxonomies, discussed classification into a taxonomy. My general understand is that this problem can be solved by training a set of binary classifiers as the multi-label classification problem. More details delivered by this paper:

Types of Classification:
1. Traditional: classify instances into mutually exclusive class labels,
2. Multi-label: an instance may have more than one labels,
3. Taxonomic: multi-label and labels are from a hierarchical taxonomy.
Solutions Proposed:
1. Binarized: train a set of binary classifiers, each for a label in the taxonomy. In classification time, if an instance does not belongs to class C, then no need to check it with classifiers belonging to descendants of C.
2. Split-based: need to read more to understand this solution.
From the experiment results, it seems that above two solutions have similar performance. And both differs from the bottom-up solution that I saw in Google.

### Something New about LDA and HDP

The UC Irvin team have updated their work and published a JMLR 2009 paper: Distributed Algorithms for Topic Models. For HDP, they proposed a greedy approach to matching of new topics. I also like their ways to visualize the training process.

Diane Hu, a PhD student working on latent topic models for musical analysis, recently wrote a tutorial/survey on LDA, Latent Dirichlet Allocation for Text, Images, and Music, which introduced LDA basics as well its extension models for images and music.

### SSH on Mac OS X

This article shows how to start an SSH server on Mac OS X, how to set up loginless, and how to tunnelling unsecure protocols over SSH.

The article explains how to change the port number of SSH on Mac OS X. Note that from Mac OS X 10.4, the mechanism for launching sshd changed from using xinetd to launchd, so changing the port number becomes a little harder.

## Jan 18, 2010

### Get Across GFW Using Tor and Bridges

1. 给bridges@torproject.org写信，内容无所谓；稍后，对方会回复一封邮件，其中有几个可用的网桥。
2. 在Vidalia的界面中选择Settings；在Settings对话框里切换到Network页；选中“My ISP blocks connections to the Tor network”；然后通过“Add a Bridge”输入框一条一条加入邮件中的网桥。可以参见以下抓图：

## Jan 11, 2010

### Maximum Entropy Modeling

http://homepages.inf.ed.ac.uk/lzhang10/maxent.html

## Jan 8, 2010

### Generate Core Dump Files

If you want your program generates core dump files (including stack trace) when it encounters a segmentation fault, remember to set the following shell option
ulimit -c unlimited
Before running your program.

Once the core file is generated (say core), we can check the stack trace using GDB:
gdb program_file core
Then type GDB command
where
which will show you the stack trace.

Special Notes for Cygwin

When a process (i.e. foo.exe) cores, a default stackdump foo.exe.stackdump is generated. This stackdump contains (among other things) stack frames and functions addresses. You can make some sense of it by using the addr2line' utility, but it's not as convenient as using a debugger.

Which takes me to the actual useful bit on information in this post. You can instruct Cygwin to start your gdb debugger just in time when an fault occurs or have Cygwin generate a real core dump.

To achieve this, add error_start=action' to the Cygwin environment variable:

# start gdbexport CYGWIN="$CYGWIN error_start=gdb -nw %1 %2"# generate core dumpexport CYGWIN="$CYGWIN error_start=dumper -d %1 %2"

### A Step-by-Step Tutorial on Autotools

Autotools are so complicated for new users, however, I am lucky this evening and found an excellent step-by-step tutorial. By following it, I packed my Hadoop Streaming wrapper for C++ in few minutes! I would like to donate to the author if s/he wants. :-)

## Jan 7, 2010

### A C++ MapReduce "Implementation" Basing on Hadoop Streaming

Hadoop has two mechanisms to support using languages other than Java:
1. Hadoop Pipes, which provides a C++ library pair to support Hadoop programs in C/C++ only, and
2. Hadoop Streamining, which languages any executable files in map/reduce worker processes, and thus support any languages.
However, in Hadoop 0.20.1, the support to Pipes, known as Java code in package org.apache.hadoop.mapred.pipes have been marked deprecated. So I guess Hadoop 0.20.1 has not port to fully support Pipes. Some other posts in forums also discussed this issue.

So, I would like to turn to use Streaming and C++. Michael G. Noll wrote an excellent tutorial on Streaming using Python, which shows that Streaming is equivalent to invoke your map and reduce program using the following shell command:
cat input_file | map_program | sort | reduce_program
Of couse, as you know, Hadoop runs the shell pipes on a computing cluster in parallel.
hadoop jar \$HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \-file ./word_count_mapper -mapper ./word_count_mapper \-file ./word_count_reducer -reducer ./word_count_reducer \-input ./input/*.txt -output

Basing on Hadoop Streamming, I wrote a C++ MapReduce wrapper (more precisely, it should be called a MapReduce implementation, but the code is simple when built on Hadoop Streaming, that I feel embarrassed to call it an "implementation"). Anyway, I found it is interesting that this simple wrapper support secondary keys, whereas org.apache.hadoop.mapreduce does not yet. :-)

I have created a Google Code project to host this simple implementation: Hadoop Streaming MapReduce, and imported the code using the following command line:
svn import hadoop-streaming-mapreduce/ https://hadoop-stream-mapreduce.googlecode.com/svn/trunk -m 'Initial import'
. So you should be able to checkout the code now.

## Jan 6, 2010

### Map-Reduce-Merge for Join Operation

In this SIGMOD 2007 paper: Map-reduce-merge: simplified relational data processing on large clusters, the authors add to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules, and demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

However, I think the newly added merge stage could be implemented using another MapReduce job -- the mapper scans over key-value pairs of all lineage and output them identically; the shuffling stage will merge values of different lineage but the same key into reduce inputs; finally, the reducer can do whatever supposed to be done by merger.

Other people told me that there are more ways to do merge using MapReduce model. But above simple solution seems one of the most scalable. In short, if I am going to implement a parallel programming model given the objective to support joining of relational data, I would just implement MapReduce, rather than MapReduce-Merge.

## Jan 2, 2010

### Compare GNU GCJ with Sun's JVM

On this Wikipedia page, there is a link to Alex Ramos's experiment, which compares the performance of native binary generated by GNU's GCJ from Java program and bytecode binary generated by Sun's JDK and runs on JIT JVM. As Alex did the comparison on AMD CPU, I did more additional ones. Here are the results.

 System Java version Sum Mflops Sqrt Mflops Exp Mflops 2x AMD 64 5000+, Ubuntu JIT 1.6.0_14 99 43 10 GCJ 4.3.2 64 65 13 2x Intel Core2 2.4GHz, Ubuntu JIT 1.6.0_0 87.4 36.9 16.6 GCJ 4.2.4 150.6 39.3 30 Intel T2600 2.16GHz, Cygwin JIT 1.6.0_17 45.4 34.8 10.4 GCJ 3.4.4 84.1 23.7 12.1

The first comparison was done by Alex; I just copy-n-pasted his results. The second was done on my workstation. The third on my IBM T60p notebook computer. I also tried to do the comparison on my MacBook Pro, but MacPorts cannot build and install GCJ correctly.

Generally, GCJ beats JIT on numerical computing. However, I have to mention that it takes a lot more time to start the binary generated by GCJ. (I do not know why...)

Here attaches the Java source code (VectorMultiplication.java), which is almost identical to Alex's, but use much shorter vectors (1M v.s. 20M), so more computer can run it.
import java.util.Random;public class VectorMultiplication {  public static double vector_mul(double a[], double b[], int n, double c[]) {    double s = 0;    for (int i = 0; i < n; ++i)      s += c[i] = a[i] * b[i];    return s;  }  public static void vector_sqrt(double a[], double b[], int n) {    for (int i = 0; i < n; ++i)      b[i] = Math.sqrt(a[i]);  }  public static void vector_exp(double a[], double b[], int n) {    for (int i = 0; i < n; ++i)       b[i] = Math.exp(a[i]);  }  public static void main(String[] args) {    final int MEGA = 1000 * 1000;    Random r = new Random(0);    double a[], b[], c[];    int n = 1 * MEGA;    a = new double[n];    b = new double[n];    c = new double[n];    for (int i = 0; i < n; ++i) {      a[i] = r.nextDouble();      b[i] = r.nextDouble();      c[i] = r.nextDouble();    }    long start = System.currentTimeMillis();    vector_mul(a, b, n, c);    System.out.println("MULT MFLOPS: " +                       n/((System.currentTimeMillis() - start)/1000.0)/MEGA);    start = System.currentTimeMillis();    vector_sqrt(c, a, n);    System.out.println("SQRT MFLOPS: " +                       n/((System.currentTimeMillis() - start)/1000.0)/MEGA);    start = System.currentTimeMillis();    vector_exp(c, a, n);    System.out.println("EXP MFLOPS: " +                       n/((System.currentTimeMillis() - start)/1000.0)/MEGA);  }}
On my Core2 workstation, the way I invoked GCJ is identical to that used in Alex's experiment:
gcj -O3 -fno-bounds-check -mfpmath=sse -ffast-math -march=native \ --main=VectorMultiplication -o vec-mult VectorMultiplication.java
On my notebooks, I use
gcj -O3 -fno-bounds-check -ffast-math \ --main=VectorMultiplication -o vec-mult VectorMultiplication.java

## Jan 1, 2010

### Learning Java as a C++ Programmer

Primitive Data Types
• char is 16-bit. byte is 8-bit. boolean corresponds to bool in C++.

• All Java primitive types are signed.Why?
Casting
• Java is more strongly typed than C++. No way to convert between boolean and integer types.
Operators
• Java has two new operators, >>> and >>>=. Each of these performs a right shift with zero fill.

• Java operators cannot be overloaded, in order to prevent unnecessary bugs
Struct and union
No struct or union.
Arrays
• Arrays are objects, defined by Type [].

• Out of index accessing causes ArrayIndexOutOfBoundsException exception.

Classes

• Can set default value of class data members