This work is done under the great collaboration with professor Nail A. Gumerov and professor Ramani Duraiswami. Basically, we devoloped a new distributed FMM algorithm for both single heterogeneous workstation and clusters by optimizing the computations between CPU and GPU to acheive high performance. The key idea of our algorithm is that: tremendous but highly parallizable particle related computations (direct sum) are assigned to GPU, while the extensive and complex space box related computations (translation) are assigned to CPU. This way can take the best andvanges of both CPU and GPU hardware architecutre and achieve the state of art performance. Using this algorithm, the single work station with 2 Tesla C1060 can compute 1 million nbody interactions in 0.24 sec and 4 work stations with 8 GPUs can achieve the comparable performance which won 2009 Bell prize using 256 GPUs (Hamada et al., 2009 ). Using Chimera, a 32 node cluster at UMIACS, we can run the nbody computation for up to 1 billion particles.

Scalable fast multipole methods on distributed heterogeneous clusters