Yi Qing <qingyi@cs.rice.edu>
www.cs.rice.edu/~qingyi

Transforming Loop Structures

Over the past 20 years, the speed of processors in computer 
systems has increased much faster than the speed of memories. 
To achieve high performance, applications must exploit enough 
locality so that data can be reused before being evicted from 
the cache memories between the processor and main memory. 
Many compiler techniques have been developed to automatically
transform applications to enhance locality.   These
techniques are both effective and efficient when the loop structures
of the applications are simple, preferably perfectly nested loops.
However, on more complicated loop structures, existing techniques 
are either ineffective or require too much computation time to be
practical for a commercial compiler.  

This talk introduces techniques to transform arbitrary loop 
structures both effectively and efficiently. Building on 
traditional {\em unimodular} transformation
techniques on perfectly nested loops, we have developed a 
technique called {\em computation slicing} to re-arrange the loop 
structure of arbitrary nested loops. First, we adopt a
loop transformation model that is not limited by the original 
nesting structure of the program and extend traditional dependence 
representations to model dependence relations between iterations of 
arbitrary loops.  We then apply transitive analysis on the 
dependence graph and compute complete transitive dependence 
information between these loops.  These transitive dependence 
edges are used to decide the feasibility of a set of loop 
transformations independent of their original nesting 
structure.  The original program is then transformed 
directly according to the accumulated information. 
The complexity of a slicing transformation is $N^2 + N*L*D$, 
where $N$ is the number of statements, $L$ is the maximum 
depth of loop nests, $D$ is the size of the dependence 
graph. This complexity is comparable to that of standard 
unimodular loop transformations on perfectly nested loops.

We also developed a  transformation framework which systematically 
applies computation slicing to block arbitrary loop nests for 
better locality. The complexity of this framework is 
$N^2*L + N*L^2*D$.  We believe this framework is efficient 
enough to be used in most commercial compilers. 
We have implemented the framework as a source-to-source 
translator and applied it to successfully transform four 
numerical benchmark kernels: Cholesky, QR, LU factorization 
without pivoting and LU with partial pivoting.
We choose these benchmarks because they are considered very
hard to block automatically. To our knowledge, 
no previous compiler implementation has completely automated 
the blocking of QR and LU with pivoting.  Our implementation 
has successfully blocked all these benchmarks and
achieved performance improvement comparable to those code 
manually blocked by experienced algorithm designers~\cite{LAPACK3}. 
This implies that this framework is in practice comparable in power 
and effectiveness to  most previous general transformation frameworks.