Mar 27, 2010

Customizing Mac OS X 10.6 For a Linux User

I have been a Linux user for years, and changed to Mac OS X 10.6 Snow Leopard in recent months. Here follows things I've done for Snow Leopard to make it suit for my work habits.
  • Emacs.
    I prefer Aquamacs version 20.1preview5 than the stable version 19.x when I wrote this post. Aquamacs has many useful Emacs plugins packed already, including AUCTeX for LaTeX editing.
  • PdfTk.
    Under Unix, Emacs/AUCTeX invokes pdf2dsc (a component in the pdftk package) to do inline preview in PDFLaTeX mode. Under Mac OS X, thanks to Frédéric Wenzel, who created a DMG of PdfTk for us.
  • LaTeX/PDF Preview.
    There is a free PDF viewer, Skim, under Mac OS X, which works like the ActiveDVI Viewer under Linux, but displays PDF files instead of DVI. Whenever you edit your LaTeX source and recompile, Skim will update automatically what it is displaying.
  • Terminal.
    As many others, I use iTerm. To support bash shortcut keys like Alt-f/b/d, you need to customize iTerm as suggested by many Google search results. In particular, remember to select "High interception priority" when you do such customization for iTerm under Snow Leopard.

Chrome 的安全机制

今天看到多篇新闻报道:在 Pwn2Own 2010 黑客大赛上,针对各种浏览器的攻击中,只有 Google Chrome 屹立不倒。随便 Google 一下,会发现很多黑客把 Chrome 的安全性归结于 Chrome 的 sandbox(沙箱)机制。我因此好奇的看了看 Chromium(Google Chrome 的 open source project)的文档,
大概了解了一下 Chrome sandbox 的基本原理。

Chrome 会启动两类进程:target 和 broker:
  1. Target 进程执行那些容易被黑客影响并做坏事的工作;主要包括(1)Javascript 程序的解释,和(2)HTML rendering。Target 进程是由 broker 进程创建的。创建之初,broker 就把 target 进程的各种访问权限都剥夺了。也就是说虽然 target 进程可以像所有用户态进程那样通过操作系统调用,请操作系统内核做事,但是操作系统内核会认为 target 进程没有权限,因而拒绝之。【注:在现代操作系统中,用户进程对任何系统资源的访问都得通过“请操作系统内核帮忙”来完成。】所以 target 实际上只能通过进程间调用,请 broker 进程来帮忙做事。

  2. Broker 进程扮演着操作系统内核的角色 —— 因为 broker 进程执行的代码是浏览器的作者写的,并且不易被坏人注入坏代码,所以我们可以依赖它检查 target 进程请它做的事情是不是靠谱。如果不靠谱,则拒绝之。
简单的说,Chrome 的 sandbox 机制复制了操作系统的两层安全概念 —— 用户进程(target 进程)没实权,实权由操作系统内核(broker 进程)把持。实际上是装了第二把锁 —— 当用户和第三方软件对操作系统的错误配置导致操作系统安全机制失效的时候,第二把锁的作用就显示出来了。

Mar 26, 2010

Data-Intensive Text Processing with MapReduce

A book draft, Data-Intensive Text Processing with MapReduce, on parallel text algorithms with MapReduce can be found here. This book has chapters covering graph algorithms (breath-first traversal and PageRank) and learning HMM using EM. The authors work great on presenting concepts using figures, which are comprehensive and intuitive.

Indeed, there are many other interesting stuff you can put into a book on MapReducing text processing algorithms. For example, parallel latent topic models like latent Dirichlet allocation, and tree pruning/learning algorithms for various purposes.

Stochastic Gradient Tree Boosting

The basic idea of boosting as functional gradient descent and stages/steps as trees, known by gradient boosting, is presented by a Stanford paper:
  • Jerome Friedman. Greedy Function Approximation: A Gradient Boosting Machine. The Annuals of Statistics. 2001
The same author wrote a note on extending gradient boosting into its stochastic version, stochastic gradient boosting:
  • Jerome Friedman. Stochastic Gradient Boosting. 1999.
The (stochastic) gradient boosting use regression/classification trees as base learners, and needs to learn trees in the procedure of training. If you are interesting with distributed learning of trees using MapReduce, you might want to refer to a recent Google paper:
  • PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. VLDB 2009
A recent Yahoo paper shows implementing stochastic boosted decision trees using MPI and Hadoop:
  • Stochastic Gradient Boosted Distributed Decision Trees. CIKM 2009

Mar 24, 2010

Fwd: Ten Commands Every Linux Developer Should Know

I like this article in Linux Journal, which reveals some very useful Linux commands that I have never used in my years experience with Unix's.

Mar 15, 2010

Some Interesting Data Sets