Scalable fast multipole methods on distributed heterogeneous clustersThis work is done under the
great
collaboration
with professor Nail A.
Gumerov and professor Ramani
Duraiswami. Basically, we devoloped a new distributed FMM algorithm
for both single heterogeneous workstation and clusters by optimizing
the computations between CPU and GPU to acheive high performance. The
key idea of our algorithm is that:
tremendous but highly parallizable particle related computations
(direct sum) are
assigned to GPU, while the extensive and complex space box related
computations (translation) are assigned to CPU. This way can take the
best andvanges
of
both CPU and GPU hardware architecutre and achieve the state of art
performance. Using this algorithm, the single work station with 2 Tesla
C1060 can compute 1
million nbody interactions in 0.24 sec and 4 work stations with 8 GPUs
can
achieve the comparable performance which won 2009 Bell prize using 256
GPUs (Hamada
et al., 2009 ). Using Chimera,
a 32 node cluster at UMIACS,
we can run the nbody computation for up to 1 billion
particles.
Our paper:
The details of our
algorithm and
performance evaluations can be found in this
paper, which will be
present at SC11
Seattle. This paper is one of the four SC11 best student paper
finalists.
|
Tweet
|