May 23, 2010

The Efficiency of AWK Associative Array

I did a little experiment comparing C++ STL map with AWK associative array in counting word frequency of large text files. The result is astonishing: AWK associative array is about 6 times faster than the C++ code!

At first, I put my eyes on the C++ code that splits a line into words. I tried STL istringstream, C strtok_r, and a string splitting trick that I learned in Google. However, these choices do not affect the efficiency saliently.

Then I realized the lesson I learned from parallel LDA (a machine learning method, which has been a key part of my research for over three years) --- map is about 10 times slower than map. I found that this issue has been thoroughly explained by Lev Kochubeevsky, principle engineer at Netscape, in 2003. Unfortunately, seems no improvement to STL string emerged since then.

On the other hand, I highly suspect that AWK, an interpreted language, implements a trie-based data structure for maps with string-keys.

No comments: