Dec 4, 2008

The Highly Efficient AllReduce in MPICH2

The MPI specification defines a parallel computing API known as AllReduce. In MPICH2, the most famous implementation of MPI, AllReduce is implemented using a very smart algorithm. Thanks to Wenyen who explained this algorithm clearly in few minutes for me. To the right is the figure drawn by Wenyen on a white board.

Suppose we have n computers, and each computer has a data block with n cells. Denote the j-th data cell on the i-th computer by Dij, and all data cells on computer i by Di={Di1,...,Din}. The goal of AllReduce is to make each of the j-th computer has R(D)=\sum_i Di.

The AllReduce algorithm implemented in MPICH2 consists of two stages: (1) Reduce-scatter and (2) All-gather. Reduce-scatter makes computer i has R(Di)=\sum_j Dij. In all-gather stage, computers exchange their partial result to make each computer have R(D)={R(D1), ..., R(Dn)}.

  1. Rajeev Thakur, Rolf Rabenseifner, and William Gropp, Optimization of Collective Communication Operations in MPICH, Int'l Journal of High Performance Computing Applications, (19)1:49-66, Spring 2005.


Wen-Yen said...

Niu bee! I don't even know you have a blog on blogger. I was trying to maintain blog but couldn't continue on the writing cuz I'm lazy :p

Yi Wang said...

I just feel too much ideas in my head. Need to write them down and choose which is good to be research topics. :-)