Development Tools | Inside Google | Outside Google | Remark |
Code review | Mondrian | Rietveld | Both Mondrian and Rietveld are written by the inventor of Python |
Infrastructure | Inside Google | Outside Google | Remark |
Distributed File System | GFS | HDFS | Hadoop's distributed file system |
CloudStore | Formally known as Kosmos file system | ||
File Format | SSTable | String Table | Hadoop's file format of entries of key-value pairs |
Distributed Storage | Bigtable | Hypertable | Baidu is a main sponsor of Hypertable. |
HBase | A Hadoop sub-project as an alternative of Bigtable. | ||
Parallel Computing | MapReduce | Hadoop | Hadoop was initiated by guys formally work in Google's MapReduce team. |
Remote Procedure Call | Protocol Buffer | Thrift | Thrift was developed by Facebook and is now an Apache project. |
Data Warehouse | Dremel | Hive | Hive was developed by Facebook and is now an Apache Hadoop project. |
API | Inside Google | Outside Google | Remark |
Networking | (I do not know) | boost.asio | boost.asio provides an C++ abstraction to network programming |
Dec 29, 2009
Comparing Technologies Inside and Outside Google
Dec 28, 2009
Learning Boosting
- The Boosting Approach to Machine Learning An Overview, RE Schapire, 2001.
Schapire is one of the inventor of AdaBoost. This article starts with the pseudo code of AdaBoost, which is helpful to understand the basic procedure of boosting algorithms.
Boosting is a machine learning meta-algorithm for performing supervised learning. Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? (From Wikipedia)
Boosting Algorithms
Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight (some boosting algorithms actually decrease the weight of repeatedly misclassified examples, e.g., boost by majority and BrownBoost). Thus, future weak learners focus more on the examples that previous weak learners misclassified.
AdaBoost
The pseudo code of AdaBoost is as follows
As we can see from this algorithm:
- The weight distribution over training examples changes in each iteration, and the change ratio is determined by alpha.
- The choose of alpha is not arbitrary, insteads, it is based on the error of weak learner. Reer to [1] for details.
- The aggregation of weak learners uses alpha to weight each learner.
Emacs Moves to Bazaar
A Small Computing Cluster on Board
- There is no mature load-balancing mechanism on GPU-clusters. Currently, GPU-based parallel computing is in the early stage of CPU-based parallel computing, which I mean, no automatic balancing over processors used by a task, and no scheduling and balancing over tasks. This prevents multiple projects from sharing a GPU-cluster.
- GPU-cluster is based on shared-memory architecture, so it is suitable only for the class of computing-intensive but data-sparse tasks. I do not see more than a few problems in real world that fit in this class.
Using Facebook Thrift
- Download Thrift from http://incubator.apache.org/thrift
- Unpack the .tar.gz file to create /tmp/thrift-0.2.0
- Configure, build and install
./configure --prefix=~/wyi/thrift-0.2.0 CXXFLAGS='-g -O2'
make
make install - Generate source code from tutorial.thrift
cd tutorial
Note that the -r flag indicates generating also include files. The result source code will be placed into a sub-directory named gen-cpp.
~wyi/thrift/bin/thrift -r --gen cpp tutorial - Compile example C++ server and client programs in tutorial/cpp
cd cpp
Note that you might want to change the Makefile to tell the lib and include directories where Thrift was installed.
make
Dec 27, 2009
A WordCount Tutorial for Hadoop 0.20.1
package org.sogou;Then, we build this file and pack the result into a jar file:
import java.io.IOException;
import java.lang.InterruptedException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
/**
* The map class of WordCount.
*/
public static class TokenCounterMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
/**
* The reducer class of WordCount
*/
public static class TokenCounterReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
/**
* The main entry point.
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenCounterMapper.class);
job.setReducerClass(TokenCounterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
mkdir classesFinally, we run the jar file in standalone mode of Hadoop
javac -classpath /Users/wyi/hadoop-0.20.1/hadoop-0.20.1-core.jar:/Users/wyi/hadoop-0.20.1//lib/commons-cli-1.2.jar -d classes WordCount.java && jar -cvf wordcount.jar -C classes/ .
echo "hello world bye world" > /Users/wyi/tmp/in/0.txt
echo "hello hadoop goodebye hadoop" > /Users/wyi/tmp/in/1.txt
hadoop jar wordcount.jar org.sogou.WordCount /Users/wyi/tmp/in /Users/wyi/tmp/out
Install and Configure Hadoop on Mac OS X
- Download Hadoop (at the time of writing this essay, it is version 0.20.1) and unpack it into, say, ~wyi/hadoop-0.20.1.
- Install JDK 1.6 for Mac OS X.
- Edit your ~/.bash_profile to add the following lines
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
export HADOOP_HOME=~wyi/hadoop-0.20.1
export PATH=$HADOOP_HOME/bin:$PATH - Edit ~wyi/hadoop-0.20.1/conf/hadoop-env.sh to define JAVA_HOME varialbe as
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
- Try to run the command hadoop
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
cd ~/wyi/hadoop-0.20.1
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
cat output/*
Some Interesting Ant External Tools/Tasks
Ant Pretty Build is a tool to easily show and run Ant buildfiles directly from within a browser window. It consists of a single XSL file that will generate, on the fly, in the browser, from the .xml buildfile, a pretty interface showing project name, description, properties and targets, etc. sorted or unsorted, allowing to load/modify/add properties, run the whole project, or run selected set of targets in a specific order, with the ability to modify logger/logfile, mode and add more libs or command line arguments.
Checkstyle is a development tool to help programmers write Java code that adheres to a coding standard. Its purpose is to automate the process of checking Java code, and to spare humans of this boring (but important) task.
Checkstyle can be run via an Ant task or a command line utility.
HammurapiJava code review tool. Performs automated code review. Contains 111 inspectors which check different aspects of code quality including coding standards, EJB, threading, ...
ProGuardProGuard is a free Java class file shrinker and obfuscator. It can detect and remove unused classes, fields, methods, and attributes. It can then rename the remaining classes, fields, and methods using short meaningless names.
CleanImportsRemoves unneeded imports. Formats your import sections. Flags ambiguous imports.
Dec 24, 2009
High-dimensional Data Processing
- logistic regression,
- Random Forests and
- SVMs.
Dec 22, 2009
Native Multitouch for Linux
Skyfire: Mobile Web Browsing over GFW
Dec 20, 2009
Collapsed Gibbs Sampling of LDA on GPU
- for word w1 in document j1 and word w2 in document j2, if w1!=w2 and j1!=j2, simultaneious updates of topic assignment have no read/write conflicts on document-topic matrix njk nor wor-topic matrix nwk.
A Nice Introduction to Logistic Regression
- A C++ implementation of large-scale logistic regression (together with a tech-report) can be found at:
http://stat.rutgers.edu/~madigan/BBR - A Mahout slides show that they have received a proposal to implement logistic regression in Hadoop from Google Summer school of Code, but I have not seen the result yet.
- Two papers on large-scale logistic regression was published in 2009:
1. Parallel Large-scale Feature Selection for Logistic Regression, and
2. Large-scale Sparse Logistic Regression