Dec 29, 2009

Comparing Technologies Inside and Outside Google

Leading positions in some key technological fields have nominated Google the giant in the Internet industry. Now, the open source community and other companies are keeping up with Google in these fields. Here follows a comparison of published Google technologies and those developed outside of Google. I will update this list as I know more.

Development Tools Inside Google Outside Google Remark
Code review Mondrian Rietveld Both Mondrian and Rietveld are written by the inventor of Python
Infrastructure Inside Google Outside Google Remark
Distributed File System GFS HDFS Hadoop's distributed file system
CloudStore Formally known as Kosmos file system
File Format SSTable String Table Hadoop's file format of entries of key-value pairs
Distributed Storage Bigtable Hypertable Baidu is a main sponsor of Hypertable.
HBase A Hadoop sub-project as an alternative of Bigtable.
Parallel Computing MapReduce Hadoop Hadoop was initiated by guys formally work in Google's MapReduce team.
Remote Procedure Call Protocol Buffer Thrift Thrift was developed by Facebook and is now an Apache project.
Data Warehouse
Dremel Hive Hive was developed by Facebook and is now an Apache Hadoop project.
API Inside Google Outside Google Remark
Networking (I do not know) boost.asio boost.asio provides an C++ abstraction to network programming

Dec 28, 2009

Learning Boosting

  1. The Boosting Approach to Machine Learning An Overview, RE Schapire, 2001.
    Schapire is one of the inventor of AdaBoost. This article starts with the pseudo code of AdaBoost, which is helpful to understand the basic procedure of boosting algorithms.
What is Boosting?
Boosting is a machine learning meta-algorithm for performing supervised learning. Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? (From Wikipedia)

Boosting Algorithms
Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight (some boosting algorithms actually decrease the weight of repeatedly misclassified examples, e.g., boost by majority and BrownBoost). Thus, future weak learners focus more on the examples that previous weak learners misclassified.

The pseudo code of AdaBoost is as follows
As we can see from this algorithm:
  1. The weight distribution over training examples changes in each iteration, and the change ratio is determined by alpha.
  2. The choose of alpha is not arbitrary, insteads, it is based on the error of weak learner. Reer to [1] for details.
  3. The aggregation of weak learners uses alpha to weight each learner.

Emacs Moves to Bazaar

Noticed at Solidot that Emacs has moved their version control to Bazaar, a version control system that can work in traditional centralized way or a brand new distributed way. After going over Bazaar's users guide, I feel that the distributed way is suitable for developers all over the world and working on the same project, because each developer submit code into his local repository (the reason that Bazaar is called distributed) to record his own work, and publish his work via email and SFTP. It is then becomes the responsibility of the maintainer of the main repository to merge individual's work back into the main branch. I may need to read more about this, but anyway, it is interesting.

A Small Computing Cluster on Board

In my previous post, I mentioned a newly developed GPU-based parallel Gibbs sampling algorithm for inference of LDA. Of course, as you know, there are many other GPU-based parallel algorithms that can solve many interesting applications efficiently using NVidia's CUDA programming framework.

More over, by Googling "CUDA MapReduce", you will find MapReduce implementations based on CUDA and GPU, developed by researchers at UC Berkeley, U Texas, Hong Kong Univ. of Sci.&Tech, and etc.

About the supporting hardware, I recently noticed NVidia's Tesla processor board, which contains a 240-core Tesla 10 GPU and 4GB on-board memory. This card can be installed on workstations like Dell Precision T7500. At the time when this essay is written, the price of such a system is about RMB 43,000.

Last but not least, there are significant differences between GPU-clusters and relatively traditionally computer-clusters. Few listed here:
  1. There is no mature load-balancing mechanism on GPU-clusters. Currently, GPU-based parallel computing is in the early stage of CPU-based parallel computing, which I mean, no automatic balancing over processors used by a task, and no scheduling and balancing over tasks. This prevents multiple projects from sharing a GPU-cluster.
  2. GPU-cluster is based on shared-memory architecture, so it is suitable only for the class of computing-intensive but data-sparse tasks. I do not see more than a few problems in real world that fit in this class.
But anyway, it is not smart to compare GPU-based parallel computing directly with multi-core CPU based solutions, because the latter can be naturally incorporated into multi-computer parallel computing and achieve naturally much higher scalability.

Using Facebook Thrift

While I am looking for a general RPC solution, Thrift comes to my eyes. It is now an Apache project although developed at Facebook. Thrift is very similar to Google's protocol buffer, which was open sourced and hosted on Google Code. For me, Thrift seems support more languages and deployed with a whole set of surrounding support, including thread manager and server/client stubs. More interestingly, Thrift supports exception, that is, exceptions thrown by remote methods can be caught by client code. (I do not remember that Google protocol buffer support this ....)

However, Thrift does not yet have an official tutorial, so, here is a very brief one. (I would resort to an official one once it is published.)
  1. Download Thrift from
  2. Unpack the .tar.gz file to create /tmp/thrift-0.2.0
  3. Configure, build and install
    ./configure --prefix=~/wyi/thrift-0.2.0 CXXFLAGS='-g -O2'
    make install
  4. Generate source code from tutorial.thrift
    cd tutorial
    ~wyi/thrift/bin/thrift -r --gen cpp tutorial
    Note that the -r flag indicates generating also include files. The result source code will be placed into a sub-directory named gen-cpp.
  5. Compile example C++ server and client programs in tutorial/cpp
    cd cpp
    Note that you might want to change the Makefile to tell the lib and include directories where Thrift was installed.

Dec 27, 2009

A WordCount Tutorial for Hadoop 0.20.1

Because the document of Hadoop 0.20.1 describes a tutorial program which uses out-of-date APIs, I decided to write the following tutorial for Hadoop 0.20.1. It is notable that in 0.20.1, org.apache.hadoop.mapred.* are deprecated and it is recommended to use org.apache.hadoop.mapreduce.*. This tutorial is based on the new API.

For how to install and configure Hadoop, you might want to refer to my previous post. After Hadoop is installed, let us create a source code directory and put the following Java source file:
package org.sogou;

import java.lang.InterruptedException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {
* The map class of WordCount.

public static class TokenCounterMapper
extends Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
context.write(word, one);
* The reducer class of WordCount

public static class TokenCounterReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
context.write(key, new IntWritable(sum));
* The main entry point.

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
Then, we build this file and pack the result into a jar file:
mkdir classes
javac -classpath /Users/wyi/hadoop-0.20.1/hadoop-0.20.1-core.jar:/Users/wyi/hadoop-0.20.1//lib/commons-cli-1.2.jar -d classes && jar -cvf wordcount.jar -C classes/ .
Finally, we run the jar file in standalone mode of Hadoop
echo "hello world bye world" > /Users/wyi/tmp/in/0.txt
echo "hello hadoop goodebye hadoop" > /Users/wyi/tmp/in/1.txt
hadoop jar wordcount.jar org.sogou.WordCount /Users/wyi/tmp/in /Users/wyi/tmp/out

Install and Configure Hadoop on Mac OS X

Download and Install
  1. Download Hadoop (at the time of writing this essay, it is version 0.20.1) and unpack it into, say, ~wyi/hadoop-0.20.1.
  2. Install JDK 1.6 for Mac OS X.
  3. Edit your ~/.bash_profile to add the following lines
    export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
    export HADOOP_HOME=~wyi/hadoop-0.20.1
    export PATH=$HADOOP_HOME/bin:$PATH
  4. Edit ~wyi/hadoop-0.20.1/conf/ to define JAVA_HOME varialbe as
    export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home 
  5. Try to run the command hadoop
Run An Example Program
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
    cd ~/wyi/hadoop-0.20.1
    mkdir input
    cp conf/*.xml input
    bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
    cat output/*
Note that before you re-run this example, you need to delete directory output, otherwise, Hadoop will complain that directory exists.

Some Interesting Ant External Tools/Tasks


Ant Pretty Build is a tool to easily show and run Ant buildfiles directly from within a browser window. It consists of a single XSL file that will generate, on the fly, in the browser, from the .xml buildfile, a pretty interface showing project name, description, properties and targets, etc. sorted or unsorted, allowing to load/modify/add properties, run the whole project, or run selected set of targets in a specific order, with the ability to modify logger/logfile, mode and add more libs or command line arguments.


Checkstyle is a development tool to help programmers write Java code that adheres to a coding standard. Its purpose is to automate the process of checking Java code, and to spare humans of this boring (but important) task.

Checkstyle can be run via an Ant task or a command line utility.


Java code review tool. Performs automated code review. Contains 111 inspectors which check different aspects of code quality including coding standards, EJB, threading, ...


ProGuard is a free Java class file shrinker and obfuscator. It can detect and remove unused classes, fields, methods, and attributes. It can then rename the remaining classes, fields, and methods using short meaningless names.


Removes unneeded imports. Formats your import sections. Flags ambiguous imports.

Dec 24, 2009

High-dimensional Data Processing

The three top performing classes of algorithms for high-dimensional data sets are
  1. logistic regression,
  2. Random Forests and
  3. SVMs.
Although logistic regression can be inferior to non-linear algorithms, e.g. kernel SVMs, for low-dimensional data sets, it often performs equally well in high-dimensions, when the number of features goes over 10000, because most data sets become linearly separable when the numbers of features become very large.

Given the fact that logistic regression is often faster to train than more complex models like Random Forests and SVMs, in many situations it is the preferable method to deal with high dimensional data sets.

Dec 22, 2009

Native Multitouch for Linux

A research institute in France coworked with Linux kernel developers and created a Linux native muti-point touch tech. I like the following demo video on Youtube:

More details can be found at:

Skyfire: Mobile Web Browsing over GFW

Just tried Skyfire on my Nokia E71 mobile phone. It is not only able to get over GFW but also able to play Youtube videos (E71 does not have a fully functional Flash player). A colleague told me that Skyfire renders Web pages (including embedded video) on the sever and sends result images to my mobile phone. It is really thin-client, but I will pay a lot for communication to network carrier.

Dec 20, 2009

Collapsed Gibbs Sampling of LDA on GPU

Thanks to Feng Yan who sent me his newly published work on parallel inference of LDA on GPU.

The basic motivation is that in the circumstances of GPU, display card memory has too small capacity to maintain a copy of nwk matrix for each core in GPU. So the very basic requirement is to keep a global nwk matrix for all cores. This brings a new requirement that when multiple cores work together in sampling, they should not update the same element of nwk simultaneously. Feng gave a solution to partition the training data by not only documents but also words. This is viable due to the observation that:
  • for word w1 in document j1 and word w2 in document j2, if w1!=w2 and j1!=j2, simultaneious updates of topic assignment have no read/write conflicts on document-topic matrix njk nor wor-topic matrix nwk.
Feng also presents a preprocess algorithm which computes an optimal data partition under the goal of load balancing.

A Nice Introduction to Logistic Regression

Among the many text books and tutorials on logistic regression, the very preliminary one given by above link explains how the logistic regression model comes:

In the binary classification problem, it is intuitive to determine whether an instance x belongs to class 0 or class 1 by the ratio P(c=1|x) / P(c=0|x). Denoting P = P(c=1|x) and 1-P = P(c=0|x), the ratio becomes odds P/(1-P).

However, a bad property of odds is that it is asymmetric w.r.t. P. For example, swapping the values of P and 1-P does not negates the value of P/(1-P). However, the swapping does negates the logit ln P/(1-P). So, it becomes reasonable to make logit instead of odds our dependent variable.

By modeling the dependent variable by a linear form, we get:
ln P/(1-P) = a + bx
which is equivalent to
P = ea+bx / (1 + ea+bx)

Above tutorial also compares linear regression with logistic regression:
"If you use linear regression, the predicted values will become greater than one and less than zero if you move far enough on the X-axis. Such values are theoretically inadmissible."

This explains that logistic regression does not estimate the relation between x and c, instead it estimates x and P(c|x), and uses P(c|x) to determine whether x is in c=1 or c=0. So logistic regression is not regression, it is a classifier.

Additional information: