May 23, 2010

The Efficiency of AWK Associative Array

I did a little experiment comparing C++ STL map with AWK associative array in counting word frequency of large text files. The result is astonishing: AWK associative array is about 6 times faster than the C++ code!

At first, I put my eyes on the C++ code that splits a line into words. I tried STL istringstream, C strtok_r, and a string splitting trick that I learned in Google. However, these choices do not affect the efficiency saliently.

Then I realized the lesson I learned from parallel LDA (a machine learning method, which has been a key part of my research for over three years) --- map is about 10 times slower than map. I found that this issue has been thoroughly explained by Lev Kochubeevsky, principle engineer at Netscape, in 2003. Unfortunately, seems no improvement to STL string emerged since then.

On the other hand, I highly suspect that AWK, an interpreted language, implements a trie-based data structure for maps with string-keys.

SSHFS, Poor Guys' Network File-system

SSHFS is good: as long as one has SSH access, he can mount remote directories, even if he does not have administrative accesses required by traditional network filesystems like NFS or Samba.

May 22, 2010

Install and Configure MPICH2 on Ubuntu

The general procedure is described in

I have encountered two cases where MPD on multiple nodes cannot communicate with each other:
  1. The firewall prevents such communication, and
  2. There are ambiguity in /etc/hosts.
For 1., we can disable iptables (the firewall commonly used in various Linux versions) and ufw (Ubuntu firewall) by the following commands:
sudo iptables -P INPUT ACCEPT
sudo iptables -P OUTPUT ACCEPT
sudo ufw disable

For 2., I just re-edit /etc/hosts to ensure all nodes are referred by their real IP addresses, instead of loop-back style addresses.

May 18, 2010

Hadoop Pipes Is Incompatible with Protocol Buffers

I just found another reason that I do not like Hadoop Pipes --- I cannot use a serialization of Google protocol buffer as map output key or value.

For those who are scratching your heads for weird bugs from your Hadoop Pipes programs using Google protocol buffers, please have a look at the following sample program:

#include <string>
#include <hadoop/Pipes.hh>
#include <hadoop/TemplateFactory.hh>
#include <hadoop/StringUtils.hh>

using namespace std;

class LearnMapOutputMapper: public HadoopPipes::Mapper {
LearnMapOutputMapper(HadoopPipes::TaskContext& context){}
void map(HadoopPipes::MapContext& context) {
context.emit("", "apple\norange\0banana\tpapaya");

class LearnMapOutputReducer: public HadoopPipes::Reducer {
LearnMapOutputReducer(HadoopPipes::TaskContext& context){}
void reduce(HadoopPipes::ReduceContext& context) {
while (context.nextValue()) {
string value = context.getInputValue(); // Copy content
context.emit(context.getInputKey(), HadoopUtils::toString(value.size()));

int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<LearnMapOutputMapper,

The reducer outputs the size of the map output values, which contains special characters: new-line, null-term and tab. If Hadoop Pipes allows such special characters, then we should see reduce outputs 26, the length of string

However, unfortunately, we see 12 in the output, which is the length of string

This shows that map outputs in Hadoop Pipes cannot contain the null-term character, which, however, may appear in a serialization of protocol buffer, as explained in the protocol buffers encoding scheme at:

I hate Hadoop Pipes, a totally incomplete but released MapReduce API.

May 16, 2010

MPI-based MapReduce Implementation

离开Google之后就没有那么好用的MapReduce实现了。在拼命寻找替代品的时候,发现已经有人在用 MPI 实现 MapReduce。一个 open source 实现是:

下载之后用了一下,发现对 MapReduce API 的封装很不到位。因此编程方法根本不是在Google 使用 MapReduce 时候的那一套。相比较而言,Hadoop Pipes 对 MapReduce API 的封装更便于使用。

这就牵扯到一个问题:MapReduce 到底强在哪儿?为什么用过 Google MapReduce 的人都会喜欢它?

MapReduce 的很多优势在论文里和论坛里都有人强调过了 —— 它可以处理海量的数据,可以支持 auto fault-recovery。但是我觉得最重要的一点是 MapReduce 的 API 很简单 —— 它容许程序员通过定义一个 map 函数和一个 reduce 函数就搞定一个并行程序。所有的 distributed IO、communications、task synchronization、load balancing、fault recovery都不用用户操心。

很多人在骂 Hadoop 慢。作为一个 Java implementation,Hadoop 确实是我见过的诸多 MapReduce 实现中最慢的(实际使用起来往往比 的 Bash MapReduce 还要慢),但是用Hadoop的人很多。难不成原因之一就是 API 好用?

我的感觉是:如果一个 MapReduce implementation 失去了 MapReduce 给程序员带来的便利,它的其他各种优势恐怕都要大打折扣了。(离开Google一个多月,我已经记不得我写过哪些不是用 MapReduce 的程序了。)

BTW:说到这里,顺便说一下, Sphere/Sector ( 的 API 也不是 MapReduce API 。从 Sphere Sector 的 tutorial slides 里贴一下一个demo program:
Sector::init(); Sector::login(…)
SphereStream input;
SphereStream output;
SphereProcess myProc;
myProc.loadOperator(“”);, output, func, 0);
Sector::logout(); Sector::close();
可以看到各种initialization、finalization、operator-loading 之类的操作都是需要用户来写的。其实把这些封装成 MapReduce API 并没有技术难度。而封装一下可以给用户省去很多麻烦。