Yi Qing
www.cs.rice.edu/~qingyi
Transforming Loop Structures
Over the past 20 years, the speed of processors in computer
systems has increased much faster than the speed of memories.
To achieve high performance, applications must exploit enough
locality so that data can be reused before being evicted from
the cache memories between the processor and main memory.
Many compiler techniques have been developed to automatically
transform applications to enhance locality. These
techniques are both effective and efficient when the loop structures
of the applications are simple, preferably perfectly nested loops.
However, on more complicated loop structures, existing techniques
are either ineffective or require too much computation time to be
practical for a commercial compiler.
This talk introduces techniques to transform arbitrary loop
structures both effectively and efficiently. Building on
traditional {\em unimodular} transformation
techniques on perfectly nested loops, we have developed a
technique called {\em computation slicing} to re-arrange the loop
structure of arbitrary nested loops. First, we adopt a
loop transformation model that is not limited by the original
nesting structure of the program and extend traditional dependence
representations to model dependence relations between iterations of
arbitrary loops. We then apply transitive analysis on the
dependence graph and compute complete transitive dependence
information between these loops. These transitive dependence
edges are used to decide the feasibility of a set of loop
transformations independent of their original nesting
structure. The original program is then transformed
directly according to the accumulated information.
The complexity of a slicing transformation is $N^2 + N*L*D$,
where $N$ is the number of statements, $L$ is the maximum
depth of loop nests, $D$ is the size of the dependence
graph. This complexity is comparable to that of standard
unimodular loop transformations on perfectly nested loops.
We also developed a transformation framework which systematically
applies computation slicing to block arbitrary loop nests for
better locality. The complexity of this framework is
$N^2*L + N*L^2*D$. We believe this framework is efficient
enough to be used in most commercial compilers.
We have implemented the framework as a source-to-source
translator and applied it to successfully transform four
numerical benchmark kernels: Cholesky, QR, LU factorization
without pivoting and LU with partial pivoting.
We choose these benchmarks because they are considered very
hard to block automatically. To our knowledge,
no previous compiler implementation has completely automated
the blocking of QR and LU with pivoting. Our implementation
has successfully blocked all these benchmarks and
achieved performance improvement comparable to those code
manually blocked by experienced algorithm designers~\cite{LAPACK3}.
This implies that this framework is in practice comparable in power
and effectiveness to most previous general transformation frameworks.