May 18, 2010

Hadoop Pipes Is Incompatible with Protocol Buffers

I just found another reason that I do not like Hadoop Pipes --- I cannot use a serialization of Google protocol buffer as map output key or value.

For those who are scratching your heads for weird bugs from your Hadoop Pipes programs using Google protocol buffers, please have a look at the following sample program:

#include <string>
#include <hadoop/Pipes.hh>
#include <hadoop/TemplateFactory.hh>
#include <hadoop/StringUtils.hh>

using namespace std;

class LearnMapOutputMapper: public HadoopPipes::Mapper {
public:
LearnMapOutputMapper(HadoopPipes::TaskContext& context){}
void map(HadoopPipes::MapContext& context) {
context.emit("", "apple\norange\0banana\tpapaya");
}
};

class LearnMapOutputReducer: public HadoopPipes::Reducer {
public:
LearnMapOutputReducer(HadoopPipes::TaskContext& context){}
void reduce(HadoopPipes::ReduceContext& context) {
while (context.nextValue()) {
string value = context.getInputValue(); // Copy content
context.emit(context.getInputKey(), HadoopUtils::toString(value.size()));
}
}
};

int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<LearnMapOutputMapper,
LearnMapOutputReducer>());
}

The reducer outputs the size of the map output values, which contains special characters: new-line, null-term and tab. If Hadoop Pipes allows such special characters, then we should see reduce outputs 26, the length of string
"apple\norange\0banana\tpapaya".

However, unfortunately, we see 12 in the output, which is the length of string
"apple\norange"

This shows that map outputs in Hadoop Pipes cannot contain the null-term character, which, however, may appear in a serialization of protocol buffer, as explained in the protocol buffers encoding scheme at:
http://code.google.com/apis/protocolbuffers/docs/encoding.html

I hate Hadoop Pipes, a totally incomplete but released MapReduce API.

2 comments:

Christopher Smith said...

Actually, this isn't Hadoop's fault. This is a bug in your code.

In your mapper you have the line:

context.emit("", "apple\norange\0banana\tpapaya");

The signature for the emit method is:

TaskContext::emit(const std::string&, const std::string&);

Since you pass the value as a string literal, C++ must convert it to an std::string. Following C++'s implict conversion rules, it does so using std::string::string(const char*), which as defined by the standard looks for the first occurence of a null character to determine the length of the string. So, when emit gets called, it is being invoked with an std::string whose last character is the 'e' in orange. Not Hadoop's fault at all. You'd get the same result if you passed that literal to any function that expects an std::string (or for that matter pretty much any function where all you passed was that literal). Try this:

#include
#include

void print(const std::string& aString) {
std::cout << aString;
}

int main(int argc, char** argv) {
print("apple\norange\0banana\tpapaya");
std::cout << std::endl;
}

You'll get the same disappointing result.

If you change your code as follows it'll work fine:

void map(HadoopPipes::MapContext& context) {
const char valueString[] = "apple\norange\0banana\tpapaya";
context.emit("", std::string(valueString, sizeof(valueString) - 1));
}

All this does is make sure you give it a string which is properly length terminated.

Yi Wang said...

Hi, Christopher,

Many thanks for pointing out my mistake and post the solution here! I corrected my post following your comment. The updated post is here. I moved my blog from Blogger to Wordpress since the latter provides convenient code and LaTeX insertion.