Jan 7, 2010

A C++ MapReduce "Implementation" Basing on Hadoop Streaming

Hadoop has two mechanisms to support using languages other than Java:
  1. Hadoop Pipes, which provides a C++ library pair to support Hadoop programs in C/C++ only, and
  2. Hadoop Streamining, which languages any executable files in map/reduce worker processes, and thus support any languages.
However, in Hadoop 0.20.1, the support to Pipes, known as Java code in package org.apache.hadoop.mapred.pipes have been marked deprecated. So I guess Hadoop 0.20.1 has not port to fully support Pipes. Some other posts in forums also discussed this issue.

So, I would like to turn to use Streaming and C++. Michael G. Noll wrote an excellent tutorial on Streaming using Python, which shows that Streaming is equivalent to invoke your map and reduce program using the following shell command:
cat input_file | map_program | sort | reduce_program
Of couse, as you know, Hadoop runs the shell pipes on a computing cluster in parallel.
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-file ./word_count_mapper -mapper ./word_count_mapper \
-file ./word_count_reducer -reducer ./word_count_reducer \
-input ./input/*.txt -output

Basing on Hadoop Streamming, I wrote a C++ MapReduce wrapper (more precisely, it should be called a MapReduce implementation, but the code is simple when built on Hadoop Streaming, that I feel embarrassed to call it an "implementation"). Anyway, I found it is interesting that this simple wrapper support secondary keys, whereas org.apache.hadoop.mapreduce does not yet. :-)

I have created a Google Code project to host this simple implementation: Hadoop Streaming MapReduce, and imported the code using the following command line:
svn import hadoop-streaming-mapreduce/ https://hadoop-stream-mapreduce.googlecode.com/svn/trunk -m 'Initial import'
. So you should be able to checkout the code now.

2 comments:

Unknown said...

Yi Wang,
Thanks for the instructions. Is there a README file to explain how I actually compile this code using the configure.ac and makefile.am? I've never used the output of automake before. Thanks.

Yeggstry said...

Just wanted to say thanks for this blog post, this has helped a lot for a proof-of-concept I am working on :)