Apr 19, 2010

Running Hadoop on Mac OS X (Single Node)

I installed Hadoop, built its C++ components, and built and ran Pipes programs on my iMac running Snow Leopard.

Installation and Configuration

Basically, I followed Michael G. Noll's guide, Running Hadoop On Ubuntu Linux (Single-Node Cluster), with two things different from the guide.

In Mac OS X, we need to choose to use Sun's JVM. This can be done using System Preference. Then In both .bash_profile and $HADOOP_HOME/conf/hadoop-env.sh, set the JAVA_HOME environment variable:
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home

I did not create special account for running Hadoop. (I should, for security reasons, but I am lazy and my iMac is only for personal development, but not real computing...) So, I need to chmod a+rwx /tmp/hadoop-yiwang, where yiwang is my account name, as well what ${user.name} refers to in core-site.xml.

After finishing installation and configuration, we should be able to start all Hadoop services, build and run Hadoop Java programs, and monitor their activities.

Building C++ Components

Because I do nothing about Java, I write Hadoop programs using Pipes. The following steps build Pipes C++ library in Mac OS X:
  1. Install XCode and open a terminal window
  2. cd $HADOOP_HOME/src/c++/utils
  3. ./configure
  4. make install
  5. cd $HADOOP_HOME/src/c++/pipes
  6. ./configure
  7. make install
Note that you must build utils before pipes.

Build and Run Pipes Programs

The following command shows how to link to Pipes libraries:
g++ -o wordcount wordcount.cc \
-I${HADOOP_HOME}/src/c++/install/include \
-L${HADOOP_HOME}/src/c++/install/lib \
-lhadooputils -lhadooppipes -lpthread
To run the program, we need a configuration file, as shown by Apache Hadoop Wiki page.

Build libHDFS

There are some bugs in libHDFS of Apache Hadoop 0.20.2, but it is easy to fix them:
cd hadoop-0.20.2/src/c++/libhdfs
./configure
Remove #include "error.h" from hdfsJniHelper.c
Remove -Dsize_t=unsigned int from Makefile
make
cp hdfs.h ../install/include/hadoop
cp libhdfs.so ../install/lib
Since Mac OS X uses DYLD to mange shared libraries, you need to specify the directory holding libhdfs.so using environment variable DYLD_LIBRARY_PATH. (LD_LIBRARY_PATH does not work.):
export DYLD_LIBRARY_PATH=$HADOOP_HOME/src/c++/install/lib:$DYLD_LIBRARY_PATH
You might want to add above line into your shell configure file (e.g., ~/.bash_profile).

Apr 11, 2010

Get Through GFW on Mac OS X Using IPv6

In my previous post, I explained how to get through the GFW on Mac OS X using Tor. Unfortunately, it seems that Tor has been banned by GFW in recently months. However, some blog posts and mailing list claims that GFW has not been able to filter IPv6 packets. So I resorted to the IPv6 tunneling protocol, Teredo. A well known software implementation of Teredo on Linux and BSD is Miredo. Thanks to darco, who recently ported Miredo to Mac OS X, in particular, 10.4, 10.5 and 10.6 with 32-bit kernel. You can drop by darco's Miredo for Mac OS X page or just download the universal installer directly. After download, click to install, and IPv6 tunneling via IPv4 is setup on your Mac.

Before you can use IPv6 to get through the GFW, you need to know IPv6 addresses of the sites you want to visit. You must add these addresses into your /etc/hosts file, so the Web browser has no need to resolve the addresses via IPv4 (which is under monitoring by GFW). This Google Doc contains IPv6 addresses to most Google services (including Youtube).