Tech Notes of Yi Wang: 01/2010

Jan 20, 2010

Unix Philosophy

It had been a difficult problem to compare Windows and Linux. I had been holding the idea that in the Unix world, people write small and simple programs which work together via standard linkages like pipe and other inter-process communication mechanisms. However, under Windows, people tend to write a huge program which can do everything.

An example is word processing. In the Windows world, we have Microsoft Word, which has a huge number of functions: editing, rendering (WYSIWYG), spell checking, printing, and much more. However, in the Unix world, we use the TeX system, consisting of many programs, each does one simple thing -- TeX macro defines basic typesetting functions, LaTeX defines more, Emacs (or any other editor) edits, pdfLaTeX (and other converters) converts sources into PDF or other formats, AUCTeX or Lyx implements WYSIWYG, and etc.

Well, by mentioning above, I think I am not so bad as I see at least the Separation and Composition philosophy of the Unix world. However, there are many more that I have not been able to summarize. Anyway, the lucky thing is a master had summarized them for us, so, please refer to the great book The Art of Unix Programming.

Jan 19, 2010

Hierarchical Classification

A 2005 paper, Learning Classifiers Using Hierarchically Structured Class Taxonomies, discussed classification into a taxonomy. My general understand is that this problem can be solved by training a set of binary classifiers as the multi-label classification problem. More details delivered by this paper:

Types of Classification:

Traditional: classify instances into mutually exclusive class labels,
Multi-label: an instance may have more than one labels,
Taxonomic: multi-label and labels are from a hierarchical taxonomy.

Solutions Proposed:

Binarized: train a set of binary classifiers, each for a label in the taxonomy. In classification time, if an instance does not belongs to class C, then no need to check it with classifiers belonging to descendants of C.
Split-based: need to read more to understand this solution.

From the experiment results, it seems that above two solutions have similar performance. And both differs from the bottom-up solution that I saw in Google.

Something New about LDA and HDP

The UC Irvin team have updated their work and published a JMLR 2009 paper: Distributed Algorithms for Topic Models. For HDP, they proposed a greedy approach to matching of new topics. I also like their ways to visualize the training process.

Diane Hu, a PhD student working on latent topic models for musical analysis, recently wrote a tutorial/survey on LDA, Latent Dirichlet Allocation for Text, Images, and Music, which introduced LDA basics as well its extension models for images and music.

SSH on Mac OS X

This article shows how to start an SSH server on Mac OS X, how to set up loginless, and how to tunnelling unsecure protocols over SSH.

The article explains how to change the port number of SSH on Mac OS X. Note that from Mac OS X 10.4, the mechanism for launching sshd changed from using xinetd to launchd, so changing the port number becomes a little harder.

Jan 18, 2010

Get Across GFW Using Tor and Bridges

在这个技术博客上，我很少用中文写文章。但是此文似乎只有中国用户需要看到。

作为一个中国网络用户，为了访问我的这个技术博客，我不得不翻墙。在我的iMac和MacBook Pro上，Tor是个很不错的工具。可惜最近不好用了。今晚稍微研究了一下，发现可以通过加入网桥来解决这个问题。

具体的做法是：

给bridges@torproject.org写信，内容无所谓；稍后，对方会回复一封邮件，其中有几个可用的网桥。
在Vidalia的界面中选择Settings；在Settings对话框里切换到Network页；选中“My ISP blocks connections to the Tor network”；然后通过“Add a Bridge”输入框一条一条加入邮件中的网桥。可以参见以下抓图：

Jan 11, 2010

Maximum Entropy Modeling

http://homepages.inf.ed.ac.uk/lzhang10/maxent.html

Jan 8, 2010

Generate Core Dump Files

If you want your program generates core dump files (including stack trace) when it encounters a segmentation fault, remember to set the following shell option

ulimit -c unlimited

Before running your program.

Once the core file is generated (say core), we can check the stack trace using GDB:

gdb program_file core

Then type GDB command

where

which will show you the stack trace.

Special Notes for Cygwin

When a process (i.e. foo.exe) cores, a default stackdump foo.exe.stackdump is generated. This stackdump contains (among other things) stack frames and functions addresses. You can make some sense of it by using the `addr2line' utility, but it's not as convenient as using a debugger.

Which takes me to the actual useful bit on information in this post. You can instruct Cygwin to start your gdb debugger just in time when an fault occurs or have Cygwin generate a real core dump.

To achieve this, add `error_start=action' to the Cygwin environment variable:

# start gdb
export CYGWIN="$CYGWIN error_start=gdb -nw %1 %2"

# generate core dump
export CYGWIN="$CYGWIN error_start=dumper -d %1 %2"

A Step-by-Step Tutorial on Autotools

Autotools are so complicated for new users, however, I am lucky this evening and found an excellent step-by-step tutorial. By following it, I packed my Hadoop Streaming wrapper for C++ in few minutes! I would like to donate to the author if s/he wants. :-)

Jan 7, 2010

A C++ MapReduce "Implementation" Basing on Hadoop Streaming

Hadoop has two mechanisms to support using languages other than Java:

Hadoop Pipes, which provides a C++ library pair to support Hadoop programs in C/C++ only, and
Hadoop Streamining, which languages any executable files in map/reduce worker processes, and thus support any languages.

However, in Hadoop 0.20.1, the support to Pipes, known as Java code in package org.apache.hadoop.mapred.pipes have been marked deprecated. So I guess Hadoop 0.20.1 has not port to fully support Pipes. Some other posts in forums also discussed this issue.

So, I would like to turn to use Streaming and C++. Michael G. Noll wrote an excellent tutorial on Streaming using Python, which shows that Streaming is equivalent to invoke your map and reduce program using the following shell command:

cat input_file | map_program | sort | reduce_program

Of couse, as you know, Hadoop runs the shell pipes on a computing cluster in parallel.

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.20.1-streaming.jar \
-file ./word_count_mapper -mapper ./word_count_mapper \
-file ./word_count_reducer -reducer ./word_count_reducer \
-input ./input/*.txt -output

Basing on Hadoop Streamming, I wrote a C++ MapReduce wrapper (more precisely, it should be called a MapReduce implementation, but the code is simple when built on Hadoop Streaming, that I feel embarrassed to call it an "implementation"). Anyway, I found it is interesting that this simple wrapper support secondary keys, whereas org.apache.hadoop.mapreduce does not yet. :-)

I have created a Google Code project to host this simple implementation: Hadoop Streaming MapReduce, and imported the code using the following command line:

svn import hadoop-streaming-mapreduce/ https://hadoop-stream-mapreduce.googlecode.com/svn/trunk -m 'Initial import'

. So you should be able to checkout the code now.

Jan 6, 2010

Map-Reduce-Merge for Join Operation

In this SIGMOD 2007 paper: Map-reduce-merge: simplified relational data processing on large clusters, the authors add to Map-Reduce a Merge phase that can efficiently merge data already partitioned and sorted (or hashed) by map and reduce modules, and demonstrate that this new model can express relational algebra operators as well as implement several join algorithms.

However, I think the newly added merge stage could be implemented using another MapReduce job -- the mapper scans over key-value pairs of all lineage and output them identically; the shuffling stage will merge values of different lineage but the same key into reduce inputs; finally, the reducer can do whatever supposed to be done by merger.

Other people told me that there are more ways to do merge using MapReduce model. But above simple solution seems one of the most scalable. In short, if I am going to implement a parallel programming model given the objective to support joining of relational data, I would just implement MapReduce, rather than MapReduce-Merge.

Jan 2, 2010

Compare GNU GCJ with Sun's JVM

On this Wikipedia page, there is a link to Alex Ramos's experiment, which compares the performance of native binary generated by GNU's GCJ from Java program and bytecode binary generated by Sun's JDK and runs on JIT JVM. As Alex did the comparison on AMD CPU, I did more additional ones. Here are the results.

System	Java version	Sum Mflops	Sqrt Mflops	Exp Mflops
2x AMD 64 5000+, Ubuntu	JIT 1.6.0_14	99	43	10
	GCJ 4.3.2	64	65	13

2x Intel Core2 2.4GHz, Ubuntu	JIT 1.6.0_0	87.4	36.9	16.6
	GCJ 4.2.4	150.6	39.3	30

Intel T2600 2.16GHz, Cygwin	JIT 1.6.0_17	45.4	34.8	10.4
	GCJ 3.4.4	84.1	23.7	12.1

The first comparison was done by Alex; I just copy-n-pasted his results. The second was done on my workstation. The third on my IBM T60p notebook computer. I also tried to do the comparison on my MacBook Pro, but MacPorts cannot build and install GCJ correctly.

Generally, GCJ beats JIT on numerical computing. However, I have to mention that it takes a lot more time to start the binary generated by GCJ. (I do not know why...)

Here attaches the Java source code (VectorMultiplication.java), which is almost identical to Alex's, but use much shorter vectors (1M v.s. 20M), so more computer can run it.


import java.util.Random;

public class VectorMultiplication {

  public static double vector_mul(double a[], double b[], int n, double c[]) {
    double s = 0;
    for (int i = 0; i < n; ++i)
      s += c[i] = a[i] * b[i];
    return s;
  }

  public static void vector_sqrt(double a[], double b[], int n) {
    for (int i = 0; i < n; ++i)
      b[i] = Math.sqrt(a[i]);
  }

  public static void vector_exp(double a[], double b[], int n) {
    for (int i = 0; i < n; ++i) 
      b[i] = Math.exp(a[i]);
  }

  public static void main(String[] args) {
    final int MEGA = 1000 * 1000;
    Random r = new Random(0);
    double a[], b[], c[];
    int n = 1 * MEGA;
    a = new double[n];
    b = new double[n];
    c = new double[n];

    for (int i = 0; i < n; ++i) {
      a[i] = r.nextDouble();
      b[i] = r.nextDouble();
      c[i] = r.nextDouble();
    }

    long start = System.currentTimeMillis();
    vector_mul(a, b, n, c);
    System.out.println("MULT MFLOPS: " +
                       n/((System.currentTimeMillis() - start)/1000.0)/MEGA);

    start = System.currentTimeMillis();
    vector_sqrt(c, a, n);
    System.out.println("SQRT MFLOPS: " +
                       n/((System.currentTimeMillis() - start)/1000.0)/MEGA);

    start = System.currentTimeMillis();
    vector_exp(c, a, n);
    System.out.println("EXP MFLOPS: " +
                       n/((System.currentTimeMillis() - start)/1000.0)/MEGA);
  }
}

On my Core2 workstation, the way I invoked GCJ is identical to that used in Alex's experiment:

gcj -O3 -fno-bounds-check -mfpmath=sse -ffast-math -march=native \
 --main=VectorMultiplication -o vec-mult VectorMultiplication.java

On my notebooks, I use

gcj -O3 -fno-bounds-check -ffast-math \
 --main=VectorMultiplication -o vec-mult VectorMultiplication.java

Jan 1, 2010

Learning Java as a C++ Programmer

Primitive Data Types

char is 16-bit. byte is 8-bit. boolean corresponds to bool in C++.

All Java primitive types are signed.Why?

Casting

Java is more strongly typed than C++. No way to convert between boolean and integer types.

Operators

Java has two new operators, >>> and >>>=. Each of these performs a right shift with zero fill.

Java operators cannot be overloaded, in order to prevent unnecessary bugs

Struct and union

No struct or union.Arrays

Arrays are objects, defined by Type [].

Out of index accessing causes ArrayIndexOutOfBoundsException exception.

Classes

Can set default value of class data members

Tech Notes of Yi Wang