The HELL OF CACHES - cache types

the three cache types

Questions:

1. What are some key points I need to understand this section?

2. Which terms do I need to know?

3. What is a Direct Mapped Cache?

4. Okay, then, what's Set-Associative?

5. What's different about a fully associative cache?

1. What are some key points I need to understand this section?

(1) Computer memory is linear. All memory is addressed, stored, accessed, written to, copied, moved, and so on in a linear way. Square bitmaps are stored one scanline after another, rectangular arrays of numbers are stored in memory as a series of linearly appended rows, and so on. Caches work by mapping blocks of slow main RAM into fast cache RAM. This is accomplished by copying the contents into the faster cache.

(2) You usually can't improve anything in computers without giving up a little of something else. If you could, somebody would already have done it and you wouldn't be debating it. For instance, one might ask why direct and set-associative caches (see below) would even need to exist when fully associative caches are so much more flexible. There's a pretty damn good reason, since modern CPUs almost exclusively use set-associative and direct mapped caches.

(3) Measures such as "work per cycle" or "instructions per clock" are meaningless metrics for comparing different architectures. It isn't just a matter of cranking the clockspeed up. If it was, they would have done it. (This is relevant to comparisons in H&P of the 200MHz 21064 Alpha AXP [first generation Alpha] to the IBM 70MHz POWER2 [first generation POWER2 chipset].)

(4) All caches are broken into blocks of fixed size. Except as noted, the size of all caches is an integer multiple of the block size. The block size is a design decision. Caches usually contain a power of 2 # of blocks but this isn't a requirement. Blocks contain some number of words, depending on their size.

(5) Caches are loaded in terms of blocks. A block is the smallest unit that may be copied to or from memory. For small moves (accessing one byte in a block, or writing one byte in a block) copying the whole block wastes some memory bandwidth but this is necessary concession to design practicality.

2. Which terms do I need to know?

destination block: the block in the cache or in memory to be written to (e.g., when loading, the destination block is in the cache, when writing a changed cache block to memory, the destination block is main memory).

source block: the block in main memory to be copied into the cache or the block in the cache being written back to main memory

discard / discarded : a discard is a block that has been flushed or removed from the cache and replaced with a block newly read from memory.

tag: some set of bits attached to a block that define it's characteristics (i.e., the address it is currently mapped to, whether it is "dirty" or not, etc.)

thrash: load and unload a cache block repeatedly -- cache loads, then some other data is accessed and it's flushed and loads the new data, and then flushed and the original is loaded, and so on. Thrashing can reduce a cache to the same performance (or worse) than a system with no cache at all.

"dirty" block: a cache block that has been written to but which has not been copied back to main memory is "dirty." It must be copied back to main memory when that block is discarded.

write through: on a write to memory, memory is written directly (or the write goes into a write buffer, discussed below, but either way the write is issued immediately). Tf there is a block in the cache for that address, it is updated also.

write back: on a write to memory, if the address being written is mapped by a cache block, it is updated and marked as dirty. Otherwise the block is updated in memory or read into the cache, written, and re-written to memory (see write+allocate and write+noallocate below) (if the same block of memory is written repeatedly a write-back cache can save a lot of memory stalls).

3. What is a Direct Mapped Cache?

A direct mapped cache of size M blocks maps memory addresses directly into cache blocks. The mapping is arbitrary but for the context of H&P assume that the mapping method is a modulo operation ( the cache block into which a main memory block with address a is given by A % M).

Observation: a direct mapped cache that uses mod to determine the destination block will be mapped _linearly_ into memory. Assume a direct mapped cache of size 5 (in blocks).

	address 0 maps to 0
	address 1 maps to 1
	address 2 maps to 2
	address 3 maps to 3
	address 4 maps to 4
	address 5 maps to 0
	address 6 maps to 1
	address 7 maps to 2

Hmmm. What's interesting about this is that if we want to cache memory that is going to be accessed linearly (typically), a direct mapped cache is a pretty good choice, because any block fetched is likely to be fully used (i.e., all of the words in the block are going to be used) and we can optimize the design to load more than one block at once (like two, or three) into sequential cache memory blocks.

What kind of memory is accessed sequentially? INSTRUCTION MEMORY. Except for branches (conditional and unconditional), the stretch of memory occupied by an instruction stream is entirely linear. Even better, it is only accessed in one direction -- you never execute code backwards. A direct mapped cache even favors loops, because the stream of instructions containing the loop must exceed the size of the direct mapped cache before the first block containing the loop code is discarded and replaced.

Imagine that in our example above, each block is 4 32bit words. An instruction in DLX is 1 32bit word. Then a 5 block direct mapped instruction cache could completely hold a loop with 4 * 5 = 20 instructions and the loop _instructions_ could execute directly from the i-cache without any delays to access memory at all! This is a best case scenerio for a direct mapped cache.

It's not hard to construct a worst-case scenerio for a direct mapped cache, though. Imagine we're adding the contents of two arrays together, and one array is at address 1000 and the other one at 15000. What happens if we have a direct mapped data cache?

EVERY BLOCK ACCESSED TO READ THE FIRST ARRAY MAPS INTO THE SAME BLOCK AS EVERY BLOCK ACCESSED TO READ THE SECOND ARRAY!

Just watch what happens:

(remember: these are block addresses, not word addresses, so each block address contains >1 array element)

address 1000  maps to 0
address 15000 maps to 0
address 1001  maps to 1
address 15001 maps to 1
.
.

Now assume that you've got an unrolled loop that reads a bunch of seperate elements, adds them together, and then writes the whole bunch back to main memory. If the unrolled loop is naive (for instance, it reads element 1 from each array, then element 2, then element 3, and so on. If we have something like (the addresses are wrong here)

LW	R1, 1000(R0);	load element 1 array 1
LW    R2, 15000(R0); 	load element 1 array 2
LW	R3, 1001(R0);	load element 2 array 1
LW    R4, 15001(R0); 	load element 2 array 2
LW	R5, 1002(R0);	load element 3 array 1
LW    R6, 15002(R0); 	load element 3 array 2
.
.
.

our memory access pattern is going to end up hitting main memory on EVERY SINGLE LW INSTRUCTION because the same blocks are being read and overwritten again and again. Even a smarter unroll like the following (which takes advantage of the fact that 1 block = many words)

LW	R1, 1000(R0);	load element 1 array 1
LW	R3, 1001(R0);	load element 2 array 1
LW	R5, 1002(R0);	load element 3 array 1
LW    R2, 15000(R0); 	load element 1 array 2
LW    R4, 15001(R0); 	load element 2 array 2
LW    R6, 15002(R0); 	load element 3 array 2
.
.
.

still doesn't help much, though at least you get some benefit from the cache since two of three in each set of loads above will be taking advantage of the cache.

It gets even worse. If you assume an optimized cache fill policy that attempts to load more than one block each time the CPU hits memory (making loads take a little longer but eliminating half of the cache misses if memory is accessed linearly), which might work very well for an Icache, for this particular case that policy would be even worse than just loading one block, since the extra block would only be overwritten by the load for the second array. The cache buys you essentially nothing -- your CPU is going to spend almost all of its time waiting on memory.

This kind of example exploits a structural conflict that occurs with direct mapped caches. A direct mapped cache has no "2nd choice" for where it can place a block from main memory into the cache. It is forced by it's definition to put block X into cache block Y and no other block. Constructing worst-case examples is only a matter of setting up a situation where two addresses map to the same block. The odds for instructions are low because instruction stream access goes in one direction and is mostly linear, whereas data accesses can be scattered all over memory.

It's also important to note that from the CPU designer standpoint, checking whether a particular block from memory is in the cache (i.e., determining whether a miss is about to occur) is trivial -- only one tag needs to be checked.

4. Okay, then, what's Set-Associative?

A set associative cache of size M blocks is broken into N sets, with the same number of blocks allocated to all sets. Set associative caches are described as "N-Way set associative." An 8-way SA cache with 4 blocks per set has a total of 32 blocks.

An SA cache works similarly to a direct mapped cache on the high level - the set into which a block at memory attress M is stored is given by M mod N. That's where the similarity ends -- the destination block in the set is ANY block in the set.

Numerous methods are used to choose the actual destination block from among the blocks in the set (determined by the source address). Random replacement is one (take a random block and replace it), sequential replacement is another, another is to check and see which blocks are oldest (a FIFO scheme) or which haven't been referenced recently (the LRU option). Finally, if any of the blocks are "dirty" it's probably better not to replace them because that would incur a delay as the block is written to memory. The actual method used is up to the designers of the chip.

Observation: The pathological example given above that renders the direct mapped cache useless is no longer nearly as bad with a set associative cache, because the cache management no longer REQUIRES that a particular cache block (in the case above, in first unrolled example, a block we're going to LW from in the next instruction) be discarded and replaced. Even with random replacement and only two blocks per set, the odds are only 50% that the block we want to stay in the cache will be discarded. That's a lot better than 100%.

You can still construct a pathological case for a set associative cache because structural conflicts aren't going to be as common (instead of one possible destination block, there are now several), but it's more difficult and less realistic.

Note that a degenerate set-associative cache with 1 block per set and N sets is identical to a direct mapped cache of size N blocks. This should make it clear that the complexity of implementing a set associative cache versus a direct mapped cache is related directly to the hardware that controls block replacement. Checking for a miss now requires checking the tags for all of the blocks in a set.

5. What's different about a fully associative cache?

A fully associative cache is specified as a cache of size N blocks. Each block can map to any portion of memory. Fully associative caches are then the most flexible, because they give the replacement policy much more flexibility (has more blocks to choose from).

On the other hand, a fully associative cache could be thought of as a set associative cache with a single set and N blocks in that set. As noted above, determining whether a cache miss has occurred (on a read or write) involves checking the tags for all of the blocks in the set. Since this is the entire cache in a fully associative cache, every single block tag must be checked before the CPU knows whether a miss has occurred (or not).

This makes checking a FA cache _quickly_ more difficult (in practice it can be done in parallel, but for large caches this gets complex == lower clockspeed, at which point you might be better off cranking it up and using a set associative or direct mapped cache and enduring a few more misses and enjoying a lot more clocks when you don't miss).

A worst case scenerio for fully associative caches is any program which uses more data than can fit in the cache and exhibits very low locality. This is actually a worst case for ALL caches, since it exploits their limited size. A naive large matrix inversion will exhibit performance so poor that it will resemble the performance of a machine without any cache AT ALL.

body text Copyright 1997 Robert Rodgers graphics images Copyright 1997 Robert Rodgers HTML code copyright 1997 Elizabeth Rodgers and Robert Rodgers. The tables in section IV were derived from several graphs and tables in Hennessey and Patterson's COMPUTER ARCHITECTURE: A QUANTITATIVE APPROACH. Permission is granted to provide these pages in their original, unaltered form online for educational purposes. They may not be republished or included on a CDROM, disk or other media or with any published work without the express permission of the copyright holders (this includes FAQ books).