Development Tools | Inside Google | Outside Google | Remark |
Code review | Mondrian | Rietveld | Both Mondrian and Rietveld are written by the inventor of Python |
Infrastructure | Inside Google | Outside Google | Remark |
Distributed File System | GFS | HDFS | Hadoop's distributed file system |
CloudStore | Formally known as Kosmos file system | ||
File Format | SSTable | String Table | Hadoop's file format of entries of key-value pairs |
Distributed Storage | Bigtable | Hypertable | Baidu is a main sponsor of Hypertable. |
HBase | A Hadoop sub-project as an alternative of Bigtable. | ||
Parallel Computing | MapReduce | Hadoop | Hadoop was initiated by guys formally work in Google's MapReduce team. |
Remote Procedure Call | Protocol Buffer | Thrift | Thrift was developed by Facebook and is now an Apache project. |
Data Warehouse | Dremel | Hive | Hive was developed by Facebook and is now an Apache Hadoop project. |
API | Inside Google | Outside Google | Remark |
Networking | (I do not know) | boost.asio | boost.asio provides an C++ abstraction to network programming |
Dec 29, 2009
Comparing Technologies Inside and Outside Google
Dec 28, 2009
Learning Boosting
- The Boosting Approach to Machine Learning An Overview, RE Schapire, 2001.
Schapire is one of the inventor of AdaBoost. This article starts with the pseudo code of AdaBoost, which is helpful to understand the basic procedure of boosting algorithms.
Boosting is a machine learning meta-algorithm for performing supervised learning. Boosting is based on the question posed by Kearns: can a set of weak learners create a single strong learner? (From Wikipedia)
Boosting Algorithms
Most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are typically weighted in some way that is usually related to the weak learners' accuracy. After a weak learner is added, the data is reweighted: examples that are misclassified gain weight and examples that are classified correctly lose weight (some boosting algorithms actually decrease the weight of repeatedly misclassified examples, e.g., boost by majority and BrownBoost). Thus, future weak learners focus more on the examples that previous weak learners misclassified.
AdaBoost
The pseudo code of AdaBoost is as follows
As we can see from this algorithm:
- The weight distribution over training examples changes in each iteration, and the change ratio is determined by alpha.
- The choose of alpha is not arbitrary, insteads, it is based on the error of weak learner. Reer to [1] for details.
- The aggregation of weak learners uses alpha to weight each learner.
Emacs Moves to Bazaar
A Small Computing Cluster on Board
- There is no mature load-balancing mechanism on GPU-clusters. Currently, GPU-based parallel computing is in the early stage of CPU-based parallel computing, which I mean, no automatic balancing over processors used by a task, and no scheduling and balancing over tasks. This prevents multiple projects from sharing a GPU-cluster.
- GPU-cluster is based on shared-memory architecture, so it is suitable only for the class of computing-intensive but data-sparse tasks. I do not see more than a few problems in real world that fit in this class.
Using Facebook Thrift
- Download Thrift from http://incubator.apache.org/thrift
- Unpack the .tar.gz file to create /tmp/thrift-0.2.0
- Configure, build and install
./configure --prefix=~/wyi/thrift-0.2.0 CXXFLAGS='-g -O2'
make
make install - Generate source code from tutorial.thrift
cd tutorial
Note that the -r flag indicates generating also include files. The result source code will be placed into a sub-directory named gen-cpp.
~wyi/thrift/bin/thrift -r --gen cpp tutorial - Compile example C++ server and client programs in tutorial/cpp
cd cpp
Note that you might want to change the Makefile to tell the lib and include directories where Thrift was installed.
make
Dec 27, 2009
A WordCount Tutorial for Hadoop 0.20.1
package org.sogou;Then, we build this file and pack the result into a jar file:
import java.io.IOException;
import java.lang.InterruptedException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
/**
* The map class of WordCount.
*/
public static class TokenCounterMapper
extends Mapper<Object, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
/**
* The reducer class of WordCount
*/
public static class TokenCounterReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
}
/**
* The main entry point.
*/
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "Example Hadoop 0.20.1 WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenCounterMapper.class);
job.setReducerClass(TokenCounterReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
mkdir classesFinally, we run the jar file in standalone mode of Hadoop
javac -classpath /Users/wyi/hadoop-0.20.1/hadoop-0.20.1-core.jar:/Users/wyi/hadoop-0.20.1//lib/commons-cli-1.2.jar -d classes WordCount.java && jar -cvf wordcount.jar -C classes/ .
echo "hello world bye world" > /Users/wyi/tmp/in/0.txt
echo "hello hadoop goodebye hadoop" > /Users/wyi/tmp/in/1.txt
hadoop jar wordcount.jar org.sogou.WordCount /Users/wyi/tmp/in /Users/wyi/tmp/out
Install and Configure Hadoop on Mac OS X
- Download Hadoop (at the time of writing this essay, it is version 0.20.1) and unpack it into, say, ~wyi/hadoop-0.20.1.
- Install JDK 1.6 for Mac OS X.
- Edit your ~/.bash_profile to add the following lines
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
export HADOOP_HOME=~wyi/hadoop-0.20.1
export PATH=$HADOOP_HOME/bin:$PATH - Edit ~wyi/hadoop-0.20.1/conf/hadoop-env.sh to define JAVA_HOME varialbe as
export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home
- Try to run the command hadoop
By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.
cd ~/wyi/hadoop-0.20.1
mkdir input
cp conf/*.xml input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
cat output/*
Some Interesting Ant External Tools/Tasks
Ant Pretty Build is a tool to easily show and run Ant buildfiles directly from within a browser window. It consists of a single XSL file that will generate, on the fly, in the browser, from the .xml buildfile, a pretty interface showing project name, description, properties and targets, etc. sorted or unsorted, allowing to load/modify/add properties, run the whole project, or run selected set of targets in a specific order, with the ability to modify logger/logfile, mode and add more libs or command line arguments.
Checkstyle is a development tool to help programmers write Java code that adheres to a coding standard. Its purpose is to automate the process of checking Java code, and to spare humans of this boring (but important) task.
Checkstyle can be run via an Ant task or a command line utility.
HammurapiJava code review tool. Performs automated code review. Contains 111 inspectors which check different aspects of code quality including coding standards, EJB, threading, ...
ProGuardProGuard is a free Java class file shrinker and obfuscator. It can detect and remove unused classes, fields, methods, and attributes. It can then rename the remaining classes, fields, and methods using short meaningless names.
CleanImportsRemoves unneeded imports. Formats your import sections. Flags ambiguous imports.
Dec 24, 2009
High-dimensional Data Processing
- logistic regression,
- Random Forests and
- SVMs.
Dec 22, 2009
Native Multitouch for Linux
Skyfire: Mobile Web Browsing over GFW
Dec 20, 2009
Collapsed Gibbs Sampling of LDA on GPU
- for word w1 in document j1 and word w2 in document j2, if w1!=w2 and j1!=j2, simultaneious updates of topic assignment have no read/write conflicts on document-topic matrix njk nor wor-topic matrix nwk.
A Nice Introduction to Logistic Regression
- A C++ implementation of large-scale logistic regression (together with a tech-report) can be found at:
http://stat.rutgers.edu/~madigan/BBR - A Mahout slides show that they have received a proposal to implement logistic regression in Hadoop from Google Summer school of Code, but I have not seen the result yet.
- Two papers on large-scale logistic regression was published in 2009:
1. Parallel Large-scale Feature Selection for Logistic Regression, and
2. Large-scale Sparse Logistic Regression
Nov 23, 2009
Clouding Computing Using GPUs
Nov 2, 2009
How to Write a Spelling Correction Program
Nov 1, 2009
Emacs — Tab vs. Space
M-x set-variableOr in your .emacs file:indent-tabs-mode nil
(setq-default indent-tabs-mode nil)
Oct 20, 2009
C++ digraphs and additional keywords
A digraph is a keyword or combination of keys that lets you produce a character that is not available on all keyboards.
The digraph key combinations are:
Key Combination Character Produced <% { %> } <: [ :> ] %% #
Additional keywords, valid in C++ programs only, are:
Keyword Character Produced bitand & and && bitor | or || xor ^ compl ~ and_eq &= or_eq |= xor_eq ^= not ! not_eq !=
Oct 1, 2009
To Make Firefox Display PDF on Mac OS X
Sep 29, 2009
VLHMM for Web Applications
Sep 27, 2009
Aug 31, 2009
Posting Code into Blogger Posts
Bloom Filter
Practical applicaitons of Bloom filter including fast test that whether a request could be handled by a server instance, whether a data element is in a replicate in a redundent system.
Aug 12, 2009
Be Careful with stl::accumulate
accumulate(timbre_topic_dist.begin(), timbre_topic_dist.end(), 0.0);
You do not want to be lazy and write 0.0 as 0, which will be interpreted by the compiler as an integer, which is used to infer the type of intermediate and final result of accumulate. For your reference, here attaches one of the multiple prototypes of accumulate:
template < typename _InputIterator, typename _Tp >Note that the partial result is stored in _Tp __init, which means even we explicitly use plus
_Tp accumulate(_InputIterator __first, _InputIterator __last, _Tp __init) {
for (; __first != __last; ++__first)
__init = __init + *__first;
return __init;
}
accumulate(timbre_topic_dist.begin(), timbre_topic_dist.end(), 0, // Wrong
plus
Jul 21, 2009
A Paper on ISMIR 2008
- Oh Oh Oh Whoah! Towards Automatic Topic Detection In Song Lyrics, Florian Kleedorfer et al.
- Relation between PLSA and NMF and implications, SIGIR 2006.
- On the equivalence between Non-negative Matrix Factorization and Probabilistic Latent Semantic Indexing. Computational Statistics and Data Analysis 52 (2008) 3913–3927.
Jul 15, 2009
A Non-Sense Braille Translator
Jul 8, 2009
Lock Screen in Mac OS X
Lock Screen in Mac OS X
Jun 29, 2009
解决英文版Windows的中文乱码问题
- Open "Control Panel"
- Switch to "Category View"
- Open "Regional and Language Options"
- Switch to "Languages" tab
- Click "Install files for East Asian languages", and click "Apply".
- Switch to "Advanced" tab
- In the combo box "Language for non-Unicode programs", select "Chinese(PRC)"
- Switch to "Regional Options" tab
- Select "Chinese(PRC)" and "China" respectively for each of the two combo boxes.
- Click "OK"
Jun 28, 2009
DocView Mode for Emacs
Shift-+ to enlarge the displaying (.pdf)
C-c C-c to switch between docview mode and text mode
Jun 23, 2009
Make Emacs Warns for Long Lines
(add-hook 'c++-mode-hookOn Mac OS X, above method fails. However, With Carbon Emacs or Aquamacs, we can do as follows:
'(lambda () (font-lock-set-up-width-warning 80)))
(add-hook 'java-mode-hook
'(lambda () (font-lock-set-up-width-warning 80)))
(add-hook 'python-mode-hook
'(lambda () (font-lock-set-up-width-warning 80)))
; for CarbonEmacs (MacOSX)An easier solution for 80-column-rule is lineker. The usage is pretty simple (tested on my IBM T60p, Emacs for Windows): add the following into your .emacs file.
(defun font-lock-width-keyword (width)
"Return a font-lock style keyword for a string beyond width WIDTH
thatuses 'font-lock-warning-face'."
`((,(format "^%s\\(.+\\)" (make-string width ?.))
(1 font-lock-warning-face t))))
(font-lock-add-keywords 'c++-mode (font-lock-width-keyword 80))
(font-lock-add-keywords 'objc-mode (font-lock-width-keyword 80))
(font-lock-add-keywords 'python-mode (font-lock-width-keyword 80))
(font-lock-add-keywords 'java-mode (font-lock-width-keyword 80))
(require 'lineker)
(add-hook 'c-mode-hook 'lineker-mode)
(add-hook 'c++-mode-hook 'lineker-mode)
GCC Does Not Support Mutable Set/MultiSet Iterator
// _GLIBCXX_RESOLVE_LIB_DEFECTSThis would lead many STL algorithms incompatible with set and multiset. For example, the following code does not compile in GCC 4.0.x:
// DR 103. set::iterator is required to be modifiable,
// but this allows modification of keys.
typedef typename _Rep_type::const_iterator iterator;
typedef typename _Rep_type::const_iterator const_iterator;
setIt is notable that Microsoft Visual C++ 7.0 and later versions are more restrictive to the STL standard on above issue. Above code works with Visual C++.myset; // or multiset myset;
*myset.begin() = 100; // fails due to begin() returns const_iterator
remove_if(myset.begin(), myset.end(), Is71()); // remove_if invokes remove_copy_if, which requires mutable myset.begin().
Jun 22, 2009
Fast Approximate of 2D Water Ripples
Jun 21, 2009
Useful Documents for CUDA Development
- Official Programming Guide from NVidia.
- CUDA Programming, a slides from Johan Seland.
Make CUDA Works on MacBook Pro
- Download and install CUDA toolkit and CUDA SDK.
- When installing the CUDA toolkit, click the Customize button on the Installation Type panel of the installer. Then be sure that CUDAKext is selected for installation. If we do not do this, CUDA applications will complain "no CUDA capable device".
- After installing add the following to .bash_profile.
export PATH=/usr/local/cuda/bin:$PATH
export DYLD_LIBRARY_PATH=/usr/local/cuda/lib:$DYLD_LIBRARY_PATH - After installing the CUDA SDK,
cd /Developer/CUDA/lib
Otherwise, we will get the following linker error when building CUDA applications:
ranlib *.lib
ld: in ../../lib/libcutil.a, archive has no table of contents - Build CUDA sample applications:
cd /Developer/CUDA
The result application binaries will be installed to /Developer/CUDA/bin/darwin/release.
make
A Good Xcode/C++/QuickDraw Tutorial
Jun 20, 2009
Mix Intel IPP with OpenCV
Build OpenCV under Mac OS X
- Install Xcode on Mac OS X computer.
- Download OpenCV source package (for Linux) from SourceForge.
- Unpack the source package
- Generate Makefile.in/am
autoreconf -i --force - Configure
./configure --prefix=/usr/local --with-python --with-swig - Build and test building result
make
make check - Install
make install - Building applications
g++ -o capcam main.cc -I /usr/local/include/opencv -L/usr/local/lib -lcxcore -lcv -lcvaux -lhighgui -lml
Package Management under Mac OS X
sudo port install cmakeMacports will download the source package and compile it for you.
Note: To make Macports knows the most recent package list, type the following commands regularly:
sudo port -v selfupdate
Note: After installing Macports, open a new terminal windows (program), which will use the system environment variables newly updated by the installation program. Using a terminal window which had been opened before Macports installation will leads to an error complaining cannot find 'port'.
Jun 18, 2009
ActionScript 3.0 Mode for Emacs
(load-file "~/.emacs.d/actionscript-mode.el")
(autoload 'actionscript-mode "javascript" nil t)
(add-to-list 'auto-mode-alist '("\\.as\\'" . actionscript-mode))
Jun 14, 2009
Gmsh: a three-dimensional finite element mesh generator
Jun 13, 2009
Fullscreen Mode of Aquamacs
Robust Ray Intersect with Triangle Test
Alt and Meta in Aquamacs
Jun 10, 2009
Jun 9, 2009
Password-less Login Using SSH
To generate the pair of keys, on mbp, type
ssh-keygen -t rsaAccept all default answers, and we get two files:
~/.ssh/id_rsa --- the private keyNow, copy the public key file to tsingyi by typing following command on mbp:
~/.ssh/id_rsa.pub --- the public key
scp ~/.ssh/id_rsa.pub wyi@tsingyi:/home/wyi/.ssh/id_rsa-mbp.puband add the public key of mbp to ~/.ssh/authorized_keys of tsingyi by typing following command on tsingyi:
cat ~/.ssh/id_rsa-mbp.pub >> ~/.ssh/authorized_keys
Here we are. We should be able to ssh to tsingyi from mbp without typing password now.
Buld OpenGL/GLUT Applications under Mac OS X
Writing Code
An GLUT C/C++ program for Mac OS X should include three header files:
#includeNote that the locations of these header files in Mac OS X differs from where they are in Linux and Windows (e.g., GL/glut.h)// Header File For The OpenGL32 Library
#include// Header File For The GLu32 Library
#include// Header File For The GLut Library
Building Using GCC
The command line using GCC to build a program main.c is as follows:
gcc -framework GLUT -framework OpenGL -framework Cocoa main.c -o learningIt is notable here that MacOSX uses the concept of so called frameworks. Instead of adding include paths and library names yourself, you add a framework to your compiler call. This is a MacOSX specific extension to gcc.
Building Using Xcode IDE
We can also manage our OpenGL/GLUT projects using Xcode IDE. To create a project, select the project type of "Cocoa Application". To add/edit the code, remove the auto-generated main.m, add a new main.c, and write our code into main.c. To specify the frameworks in IDE, right click the project and choose "Add Exisiting Frameworks" to add OpenGL and GLUT.
Jun 8, 2009
Open Source Software on Mac OS X
http://www.freemacware.com/
http://www.linuxbeacon.com/doku.php?id=opensourcemac
Jun 7, 2009
Crystalization
- Stochastic and Deterministic Simulation of Nonisothermal Crystalization of Polymers
- The Structure of Crystals.
May 27, 2009
To Get Forward (Alt-f), Backward (Alt-b) and Delete (Alt-d) Word Works for iTerm
- Open iTerm.
- Go to Bookmarks > Manage Profiles
- Choose Keyboard Profiles on the left and edit the Global Profile
- Next to Mapping, click the + sign.
- For Key, choose hex code.
- In the text box next to hex code, enter 0x62 for b, 0x64 for d or 0x66 for f.
Note that 0x62, 0x64 and 0x66 are ASCII codes for characters b, d, and f respectively. - For Modifier, check the Option Box
- For Action, choose send escape sequence.
- Write b, d or f in the input field.
May 14, 2009
Mar 18, 2009
MATLAB code for Sampling Gaussian distribution
mu = mu(:);
n=length(mu);
[U,D,V] = svd(Sigma);
M = randn(n,N);
M = (U*sqrt(D))*M + mu*ones(1,N);
M = M';
To Generate Random Numbers from a Dirichlet Distribution
function r = drchrnd(a,n)
% take a sample from a dirichlet distribution
p = length(a);
r = gamrnd(repmat(a,n,1),1,n,p);
r = r ./ repmat(sum(r,2),1,p);
The following is an example that generates three discrete distributions from a symmetric Dirichlet distribution Dir( \theta ; [ 1 1 1 1 ] ):
>> A = drchrnd([1 1 1 1], 3)
A =
0.3889 0.1738 0.0866 0.3507
0.0130 0.0874 0.6416 0.2579
0.0251 0.0105 0.2716 0.6928
>> sum(A, 2)
ans =
1
1
1
Exponential, Power-Law and Log-normal Distributions
- For the standard form of exponential and power-law, refer to Wikipedia.
- For an intuitive introduction to power-law, refer to 幂律分布研究简史.
- For generative models for power-law and log-normal, refer to "A Brief History of Generative Models for Power Law and Lognormal Distributions".
- For how power-law distributions can be generated from random processes, refer to "Producing power-law distributions and damping word frequencies with two-stage language models".
Dirichelt Processes and Nonparametric Bayesian Modelling
Mar 4, 2009
Questions on DP, CRP and PYP
From the definition of DP, a generalization of Dirichlet distribution, DP should produce multinomial distributions.
But why is the relation between DP and CRP?
References:
- Producing power-law distributions and damping word frequencies with two-stage language models.
Feb 3, 2009
LDA IR
to information retrieval and shown that it can significantly
outperform – in terms of precision-recall – alternative methods
such as latent semantic analysis.