the HELL OF CACHES

cache management issues

Questions:

1. Why do caches miss?

2. What is a write miss? What if we're writing to a block not in the cache?

3. Is there a simpler way to handle write misses?




1. Why do caches miss?

Misses can be categorized into three categories.

(1) Compulsory - this is the common sense miss. If no data in the block containing the address has been referenced, the block will have never had any reason to be loaded into the cache. It's not in the cache, so you miss. Easy. The first access to any address in a block results in a compulsory miss (unless the block is prefetched -- for prefetching, see IV, below)

(2) Capacity - another common sense miss. If you have more data than can fit in the cache, you're going to miss, because whatever didn't fit in the cache is (obviously) not in the cache when you go to access it.

Capacity misses are often induced by naive programmers who think they know more than the compiler. The same ideas for loop unrolling on the assembly level (wrt: registers) apply to complex loops. With complex loops you want all of the code to fit in the L2 cache so that the loop runs as fast as possible (e.g., you're writing the inner loop to a 3d texture mapper for a video game). But you also want to get as much done as you can by merging and unrolling loops. This works until you exceed the size of your cache and you start to thrash. Then your performance falls through the floor. The same kind of thing can occur with a Look-up table. It's can be a lot faster to use a lookup table for cosine & sine than to use cos and sin functions but a large LUT will occupy a lot of the cache and performance will be worse than if you just used sin and cos functions, since you wont incur memory stalls from filling and refilling the cache to make up for the blocks pushed out by the LUT.

Conflict - this is the only cause of cache misses that isn't common sense. Conflict misses arise because the block replacement policy for the cache discarded a block that is now being referenced and must be loaded (as if it was never in the cache) from main memory. The section on cache types (above) gives examples of structural-conflicts which induce worst case performance for direct and set-associative caches.




2. What is a write miss? What if we're writing to a block not in the cache?

There are two ways to manage writes to memory for a block that is not in the cache. They are

write+allocate: the block is fetched from memory into the cache before being written (read -> write -> store to memory)

write+noallocate: = write around. The block is directly modified in the lower level without being written first (just store to memory)

The two policies exist because a block's contents may not be in the cache when the address is written. To support write+noallocate the CPU must support writing to an arbitrary address in memory, which means that the hardware to move data into and out of memory must support multiple sizes (blocks, words, any other size supported by the CPU). Write+noallocate requires this kind of flexibility, but this added complexity may have a clockspeed penalty, meaning that the simpler approach (write+allocate) may be faster, because it only assumes an optimized block transfer between memory and the cache (because memory copies are extremely common, this is an optimized path).




3. Is there a simpler way to handle write misses?

Another option when a cache block is being written to memory is to add a buffer to the write logic. This way one or two blocks can be stuffed into the write buffer, which can then initiate transfer to memory without halting the CPU.

The main problem with write buffers is that they make cache management problematic -- suppose the destination of a block being writen corresponds to a cache block that has been discarded from the cache. Further suppose that a load instruction has caused a cache miss for an address in that block. Since the block is no longer in the cache, we have a read miss, and the cache control logic will attempt to load the block from main memory. However, even though it's no longer in the cache, the block is still in the write buffer. This is a RAW hazard.

Options:

(1) flush all write buffers before servicing reads

(2) search the write buffers after searching the cache

(1) is sub-optimal. for one thing, if you're going to flush every time you read, the only time having a write buffer can benefit you is when reads and writes are not back-to-back, which is the worst case for a write buffer. Worse, it makes it even more difficult to implement coherent caches in a shared memory multiprocessor (e.g., a 4-CPU Alpha workstation, or a dual processor Pentium machine running Windows NT, and so on) since the cache coherency logic has to implement either (1) or (2) for the buffers as well. If the CPU implements (1), then all reads by ANY CPU in the system (e.g., any of the four in a four CPU machine) forces ALL of the CPUs in the system to flush. If the CPU implements (2), the cache synchronization logic is just that little bit more complex and...

(everyone together now)

This added complexity might hurt your clockspeed It may be better just to stick with the simple solution and use the extra cycles it probably buys you.




back to the table of contents




body text Copyright 1997 Robert Rodgers graphics images Copyright 1997 Robert Rodgers HTML code copyright 1997 Elizabeth Rodgers and Robert Rodgers. The tables in section IV were derived from several graphs and tables in Hennessey and Patterson's COMPUTER ARCHITECTURE: A QUANTITATIVE APPROACH. Permission is granted to provide these pages in their original, unaltered form online for educational purposes. They may not be republished or included on a CDROM, disk or other media or with any published work without the express permission of the copyright holders (this includes FAQ books).