Introduction to Caches

Introduction

In the simplest model of a computer, there is a CPU, and there is physical memory (RAM) and there is a bus to connect the two. Over the years, computers have become more sophisticated, and implementation of memory systems has also become more sophisticated.

In particular, RISC machines became popular. Even though Intel processors dominate the market, they use a lot of RISC ideas. Recall that RISC machine have the following features:

Many general-purpose registers (32 or more)
All instructions have same number of bits, i.e. equal width instructions (32 bits is typical)
One clock cycle per instruction (if possible)
Fewer kinds of instructions than CISC
Faster clock speeds than CISC
Load-store architecture (all instructions, except load-store, use registers as operands. Load-store are only instructions to access memory)

Basically, in RISC, some complex instructions were eliminated in favor of running several simpler instructions that did the same task. Initially, RISC used as few instructions as they could get away with. Some academic ISAs had only about a dozen instructions.

Over time, people realized this was not a reasonable way to develop an ISA. The decision about which instructions to keep and which instructions to get rid of were based on benchmark programs. If simulations showed that including an instruction in the ISA would improve performance (speed) in many typical programs (i.e., benchmarks), then the instruction was kept in the ISA. Otherwise, it was discarded.

As a result, a typical RISC program might contain many more instructions than its equivalent CISC program, since it may take more RISC instructions to run the equivalent CISC code. (Admittedly, it was expected that this would happen infrequently).

How can RISC programs possibly be faster? How can running more RISC instructions be faster than running fewer CISC instructions? Several ways. First, if you design a CPU that has a small set of instructions, then it's easier to design and optimize the hardware. You can then run the CPU at a faster clock rate.

The CPU may run more instructions, but each instruction will run quicker because of the higher clock rate. The hope is that the slowdown from executing more instructions is more than offset by running each instruction more quickly.

Second, some CISC instructions run in many clock cycles. For example, it may take many clock cycles to run an integer divide. CISC instructions may not always run in one clock cycle, while most RISC instructions attempt to run in one clock cycle (it doesn't always succeed either, but it succeeds more often than for CISC instructions).

Third, if RISC is to be successful, then memory has to be cheap and quick. A bigger program uses up more memory, and you need to be able to access instructions that must faster. It's not surprising that RISC became a viable way to design a CPU when memory costs started to decline. When memory was expensive, the key consideration was to keep total program lengths as short as could be managed, even if it meant the instructions took longer to execute on the CPU.

Even cheap memory is not enough. Memory (RAM) is often the bottleneck when executing code. This is an important fact to remember. You may have, say, 2 GHz machines, but typical RAM can not retrieve instructions from memory at that speed.

A solution is to use small, but fast memories. For example, accessing registers is very quick, while accessing RAM is comparatively slow. This suggests that if we want speed, we need more registers, and hope that most data stays in the registers, and that we rarely need to access RAM.

This is not a reasonable assumption. Registers simply don't hold enough memory.

However, if we can use small, fast memory---this is going to be called cache, then this memory is faster than RAM, but slower than registers, and we can keep the most commonly used data (or instructions) in the cache. There's going to more data in cache (typically 512 K or so) than in registers (typically, 128 bytes).

There's a basic principle about memory. If you want lots of memory, it will be cheap (i.e., it doesn't cost much) and it's slow (to access data). If you want very fast memory, it will be small (i.e., hold little data) and expensive (per byte of memory). Historically, there used to be two levels in the memory hierarchy: registers (very fast, but very few of them) and RAM (slow, but lots of it).

A Range of Speed and Cost

Normally, one thinks of memory in three levels: registers, RAM (physical memory) and disk. This forms part of the memory hierarchy.

In the last 20 years or so (since about 1990), cache has become increasingly important. It lies between registers and physical memory in terms of size and speed. From slowest to fastest: disk, RAM, cache, registers. Each one is faster, yet smaller.

On the left are registers. They give you the fastest access time. Currently, access times for a register is about

Intuitive Example of Caching

Consider the following analogy of caching. You are planning to write a very long paper on the history of mythology of many different cultures. You start with ancient Babylonian myths, and work your way to Greek, Roman, Norse, Native American, Asian, etc. myths. You've divided your paper so that each chapter covers one culture's mythology. Thus, at any one time, you are studying one mythology.

You have a computer on your desk, and get your books online. You can download only so many books because your computer is limited in memory. The advantage of downloading the books onto your own computer is so that you can access the material quickly. Downloading the books is quite slow, so you hope not to do it very often.

If it weren't for the fact that you collect related books and write on those books for each chapter (say, 10 books on Greek mythology, then 10 books on Norse mythology), you might be forever downloading books. Even though the majority of books remain at the central library site, the books you need are on your computer, which is quick to access.

What happens if you have already downloaded 10 books (assuming they all take the same amount of space), and you want an 11th book? You will need to delete one of your books off your computer, to make room for the next book.

Which one should you pick? You could pick the one that you haven't read in the longest time. That policy is called "least recently used". Or you could see which book you've had on your computer the longest time. That policy is called "first in first out". Or perhaps the one you've looked at least. That's called "least frequently used".

In any case, you use some policy to decide which book to get rid of to make space for the new book.

An Improved Analogy

While the previous analogy works pretty well, it's not very accurate. The real problem with the analogy is that it doesn't accurately portray the penalty involved when the data is NOT found in the cache. Let's modify the analogy to make it more accurate.

Let's assume your computer can still store 10 books. There are two sites you can access books. However, there's the "Virtual Central Library" (VCL), which has any book that you might want. There's also a "Local Library" (LL).

Whenever you want to access a book, you will look in the Local Library (LL) first. If it's at the LL, you will copy the book to your own computer. If it's not there, the LL will contact the Virtual Central Library (VCL). The LL will copy the book to its own library (possibly removing a book from the LL to make space), and then you will copy the book from the LL to your own computer.

The LL is much smaller than the VCL, but can store more books than you can on your computer.

Thus, the LL serves as a "middle man". If the book you want is in the LL, you access it much faster than if you have to download it from the Virtual Central Library (VCL). For example, let's say it takes 1 second to access the LL, and 100 seconds to access the VCL.

However, when you are unable to find the book in the LL, you spend 100 seconds getting a copy from the VCL to the LL, and one more second to get it from the LL to your local computer. Thus, it's actually slower to use the LL (not by much) if the book you're looking for is not there. This is where we get a more accurate representation for cache.

Cache works very similarly. Basically, you want some data (or an instruction) at some address. This data either appears in the cache (which is like the LL) or it isn't. If it's not there, you need to access the data from RAM (which is like the VCL). This data is then copied to the cache, and then from the cache to the registers. Thus, the registers act like your local computer.

In fact, cache has become so useful, most modern CPUS often have two levels of cache. They are called L1 and L2 caches for level 1 and level 2 cache.

Here's how to visualize an L2 cache. Imagine a "State Library" (SL), which is bigger than the LL, but smaller than the VCL.

Every book in the LL is in the State Library (SL). However, the SL has more books than the LL. Every book the SL is also in the VCL. Thus, LL is a subset of SL, which is a subset of VCL.

To access a book, you go to the LL. If it's not there, you check the SL. If it's not there, you go to the VCL. That data gets copied to the SL, then back to the LL, and finally to your computer.

In principle, you can carry this idea of caching many, many levels.

Each level is a subset of the next level. The higher the level, the more memory, and the slower it gets. For example, we could label the levels of memory as L1, L2, L3, etc. where all the data in level L_i is a subset of that in level L_i+1.

When you look for data at a particular address, you search for it at the smallest level, and if you can't find it there, you look at the next level. Once you find the data you are looking for, you copy it down to the lower levels, possibly getting rid of a cache line in the process.

Definition A cache line is the "unit" of data you transfer to a cache. Typically, this number is a power of 2 (say, at least, 16 bytes).

Other Forms of Caching

The concept of caching occurs everywhere in the computer world. Anytime you have to access data from a location that is slow, you may want to cache a copy at a location which allows you to access the data faster.

For example, suppose you visit a particular webpage at a website. Initially, that webpage may be slow to download. But, when you visit another webpage and come back, it seems much faster "download" the old webpage. Why is that?

A local copy of the webpage is kept on your computer. Thus, your computer's hard drive (or possibly memory) is being used as a cache for that webpage. This can sometimes create problem, especially if the content of the webpage is always being updated (sports scores for a live games, for example).

These days, hard disks also have memory that serves as cache. Recall that hard disks are used to store files. Recall that hard disks are also slow. So, one idea is to use RAM as a cache for the disk. As files are accessed, place it in a special RAM just for files. When you want to edit a file, you check in the file RAM first, and if it's not there, then you check on the disk. This is done automatically for you, via the operating system.

As you can see, the principle of caching is simple. Use faster memory to store frequently accessed data.

Properties Needed for Effective Caching

We're going to use the same idea for the memory hierarchy. Storing a small subset of the data from RAM into cache will allow the CPU to access certain data very quickly, as long as it resides in the cache.

The memory hierarchy wouldn't be very effective if two facts about programs weren't true. Programs exhibit spatial locality and temporal locality.

spatial locality Spatial locality says that if you access memory address x, you are like to access memory address x plus or minus Delta where Delta is a small number. That is, you are likely to access addresses very near x in the near future.
Do programs exhibit spatial locality? Yes, they do. For example, when you accessing data from arrays, you are accessing memory locations that are contiguous. Also, instructions usually run sequentially (with the occasionaly branch or jump). Since instructions are contiguous in memory, they exhibit spatial locality. When you run one instruction, you're likely to run the next one too. That's the nature of running programs.
temporal locality Temporal locality says that if you access memory address x at time t, you are like to access memory address x at time t +/- Delta. That is, you are likely to access the same address sometime soon.
This happens when you run code in a loop. You access the same instructions over and over. If you process an array in a loop (say, bubble sorting), then you access array elments over and over again too.

How Locality Helps

Why is it important to have these properties? In particular, why is temporal locality important? Let's consider our library example. Suppose you wish to access random books from the VCL.

Each time you download a book, you would look in the local library. Assuming it's random, the probabilty that the book is in the LL would be low (since it stores only a small subset of the VCL). So you would have to go to the Virtual Central Library to find the book. Since accessing the VCL is very slow, you pay a fairly large cost (in time) for accessing books.

Furthermore, you have to copy that book from the VCL to the LL too. We copy the book to the LL, so it can be cached, and accessed again in the near future. However, since you're accessing books randomly, the books in the LL won't be used in the near future, and you will always have to access the VCL.

In effect, the LL is useless to you.

You get benefit from using the Local Library if you're accessing the books found in the LL all the time. When you search for the same books over and over, they appear in the LL, and access time is much smaller to access the LL than the VCL.

To apply the analogy to hardware, if you want to to access data and it's found in cache, you get savings in access time, since accessing data (or instructions) found in the cache is much quicker than accessing main memory, all the time.

On the other hand, if you don't access data in the cache very often, then the cache is not useful, and may in fact, cause a small delay than not having the cache at all.

Taking Advantage of Spatial and Temporal Locality

How do we apply the principles of spatial and temporal locality to cache design? Temporal locality is simple. You simply place data in the cache, because it's like to be accessed in the near future. Since cache access is quicker than physical memory (RAM) access, placing data you just used into the cache should make it much quicker to access in the near future.

What about spatial locality? This says that if you access address X, you are like to access addresses near X too.

Spatial locality suggests the following strategy when accessing data. Load the data from many contiguous addresses near X from RAM into the cache. That is, copy a block of data from memory to cache. This block should be larger than 4 bytes which is the typical amount you would access from RAM to registers.

This idea wouldn't be so great if it weren't quicker to access the entire block at once, as opposed to accessing the bytes one at a time. This makes some sense if you think about it. Suppose you were ordering books. If ordering 10 books at once took just as much time as ordering each book individually, then there's no need to order 10 books at a time. Just order them whenever you want them. (There's price----i.e., you might save money if you order 10 books at once, but there's no real analogy to price for hardware).

To take advantage of temporal locality, place data that's accessed into the cache, which has quicker access time than physical memory (RAM).
To take advantage of spatial locality, copy a block of data from RAM that includes address X, the desired address. This block is called the cache line. This block should contain more than four bytes (typically 2^k bytes where k > 2).

Accessing a Block of Data

How do we access a block of data? One way to do it is to access all data from address X - Delta to X + Delta. That is, pick addresses before and after X.

However, this turns out to be inconvenient. It's easier to access the data in the following way.

Suppose we want to copy a data block of 32 bytes from memory (instead of just 1 or 4 bytes). We need 5 bits to specify each of the 32 bits. Thus, 00000 is byte 0, and 11111 is byte 31.

Suppose you want to access address A_31-0. Instead of accessing only that single address, generate the following 32 addresses:

A_31-5 00000
A_31-5 00001
A_31-5 00010
A_31-5 00011
....
....
A_31-5 11111

That is, take the upper 27 bits of the address. Concatenate that with all possible 5 bit bitstrings (there are 32 such bitstrings).

This creates 32 consecutive addresses. Copy the data in those addresses from RAM to the cache. That data is called the cache line.

To see if you understand what's going on, how would you create the addresses if you wish to copy 64 bytes in a cache line.

Instead of using the upper 27 bits, you now use the upper 26 bits. You then create all possible 6-bit bitstrings (there are 64 of those) and concatenate them to the 26 bits. This creates 64 addresses in memory. Copy the 64 bytes from main memory to the cache.

We will give the upper 27 bits a name. We will call it the tag. Thus, when you have 32 bytes, you know exactly where they come from in memory, if you know the tag.

When we copy the 32 bytes, we place them in the cache in the same order they appear in memory. We can imagine that data being stored in an array with indexes 00000 up to 11111. (There are 32 elements in this array, which should not be so surprising, since that's how much we copied).

If we want to access address A_31-5 00011, then we look at index 00011 of the "array", to get the data.

00011 is considered the offset into the array.

Given the tag and the offset, we know what address the data at a given offset is. For example, if we access the data at 00011, then this data is the same as the one at the address given by the tag concatenated with 00011, i.e., address A_31-5 00011.

Slots

Let's get into more specific details about the cache. In particular, we'll describe an individual slot.

A slot consists of the following:

V the valid bit, indicating whether the slot holds valid data. If V = 1, then the data is valid. If V = 0, the data is not valid. Initially, it's invalid. Once data is placed into the slot, it's valid.
D the dirty bit. This bit only has meaning if V = 1. This indicates that the data in the slot (to be discussed momentarily) has been modified (written to) or not. If D = 1, the data has been modified since being in the cache. If D = 0, then the data is the same as it was when it first entered the cache.
Tag The tag represents the upper bits of the address. The size of the tag is 32 - lg N where N is the number of bytes in the data part of the slot.
Cache Line This is the actual data itself. There are N bytes, where N is a power of 2. We will also call this the data block.

The total amount of memory used in a slot that stores 32 bytes in a cache line can be easily computed. There's 2 bits (one for V and D), plus the tag (27 bits), plus the data in the cache line (32 x 1 byte = 32 bytes = 256 bits). That's a total of 285 bits.

Here's a diagram of one slot.

There's variation to the slot, which we'll discuss when we talk about direct mapped cache, and set associative cache.

Cache Hits and Cache Misses

When you look for data at a given address, and find it stored in cache, then you have a cache hit. If you don't find the data, then it's a cache miss.

You want to maximize cache hits, because then you have much improved performance (speed).

What Happens in a Cache Miss

As you run your program, it needs data at an address. This can either be the address of actual data being load or stored, or it can be the address of an instruction.

You look for the address to see if it's one of the addresses in cache. If the address (and its contents) are NOT in the cache, then you must access it from main memory. This means that you must copy the data from memory to the cache.

Since the cache is a small subset of main memory, there may be data that you need to remove from the cache (just as you had to remove books from your computer, if you needed other books to replace it).

Instruction and Data Cache

Instructions and data have different access patterns, and access different regions of memory. Thus, having the same cache for both instructions and data may not always work out.

Thus, it's rather common to have two caches: an instruction cache that only stores instructions, and a data cache that only stores data.

You see some indication of this in the CPU design where there's "instruction memory" and "data memory".

Write Back vs. Write Through

Suppose you need to modify data at some address located in the cache. How do you do this?

There are two ways to do this:

write-back This says that you update main memory only when the cache line is evicted. Otherwise, only update the cache.
This approach creates an inconsistency between cache and memory. Cache has one view of memory (which is accurate), while main memory does not have the latest updates.
Normally, this is not a problem. After all, there's only one CPU, so who else would need to access memory? However, if you had more than one CPU, each with their own cache, this problem of memory consistency becomes a much bigger issue. A CPU accessing memory may not have the latest update, because another CPU has modified a memory location, but the update is only in the cache.
Write-back is convenient because you don't have to access main memory, which can be slow.
write-through This says that you update main memory at the same time you update cache. Does that sound like a bad idea? The key is to avoid waiting for the write to memory to complete. Essentially, once the write to cache is done, the program can proceed, while the write to memory is done "in the background".
The disadvantage occurs when there are many writes to memory. This can cause a backlog waiting for the writes to occur.
The advantage of write-through is that you don't have to worry about memory consistency, since cache and memory should have the same values (even though it may take memory a little longer to get the correct value).

Categories of Cache Misses

We can subdivide cache misses into one of three categories:

a compulsory miss (or cold miss) occurs when there's little to no data in the cache. Initially, you're going to have a cache miss, no matter how big the cache is, and no matter how many bytes the cache line contains. Once a program has been running some, the cache becomes more fully utilized, and such misses don't occur.
a conflict miss usually occurs in direct-mapped caches. Two cache lines may map to the same cache slot, even as there may be empty slots, that could store both cache lines. Such a conflict would force an unnecessary eviction of a cache line, but this is the price you pay for direct-mapped cache. Set-associative caches have this problem too, to a lesser degree. Fully associative caches avoid this kind of problem.
a capacity miss can be one of two sorts. One is a miss that would not have occurred if the cache size were larger, i.e., more slots. The other occurs with the number of bytes in a cache line. Suppose you are processing a large int array. If the cache line has 16 bytes, you will get a cache miss after processing 4 elements of the array. Had the cache line been larger, you wouldn't have had this miss.

These misses also occur in virtual memory.

Slot Replacement Policy

Suppose you have a cache miss, and all the slots in the cache are being used. You will now have to pick a slot to get rid of. You are evicting a cache line.

Which slot should you pick? There are many policies. Any will do. Here's a list:

LRU Least Recently Used. Pick the slot that hasn't been used in the longest time.
LFU Least Frequently Used. Pick the slot that has been used the least often.
FIFO First In First Out. Pick the slot that has been in the cache the longest (which is NOT the same as LRU).

These policies often require additional hardware to indicate time of use or frequency of use. There have also been other fancier schemes to decide how to get rid of a slot.

When you choose a slot to be evicted, you look at the dirty bit (we assume a write-back policy). If D = 1, i.e., the slot is dirty, then copy the cache line back into memory. Otherwise, you can overwrite the cache line with the new data, tag, etc.

When the new cache line enters,

V is set to 1, (it's valid)
D is set to 0, (no longer dirty)
Tag bits are copied in
Cache line is copied in from main memory

Assembly Programmers?

When you are programming in assembly, you aren't even aware of the cache. Much of the actual interactions with cache are handled by the hardware and the operating system.

This is convenient, because the cache scheme can be changed without having to modify any programs. Often, there's a question as to how much the actual implementation (called the ISP, the instruction set processor) should be visible to the ISA programmer. The more that's visible, the more that you restrict the implementation, and the fewer chances for optimization.

Thus, you only want to reveal just enough of the real hardware for programmers to get the job done, and no more. In effect, the ISA is a kind of specification for the actual CPU, which can be implemented in one of many ways.

Word Alignment

One motivation for having words at word-aligned addresses is that all four bytes will always fall within a cache line. A cache line is, for lack of a better phrase, super word-aligned.

Super word-aligned means that if you have 2^k consecutive bytes, the first byte must reside at an address divisible by 2^k. Thus, if the cache line has 16 bytes, it must be at an address divisible by 16, which means it's automatially word-aligned (since word-aligned means divisible by 4).

You'll never have half a word in one cache line, and the other half in another cache line (provided the cache line contains at least 4 bytes) as long as the word is word-aligned.

To convince yourself of this, try to have a word that is word-aligned, and see if you can create a cache line where only part of the word is within it.

If you can do it, more than likely, you have misunderstood how cache lines work.

Summary

Cache is a mechanism used to store frequently accessed data in fast memory. Cache is a subset of the data found in RAM, which means any data you find in cache, you can also find in RAM.

When data at an address is needed, the CPU first searches in the cache. If it finds it there, then a load or store is performed to or from a register.

If a cache miss occurs, then a cache line needs to be copied from main memory to a slot in a cache, which may require the eviction of a slot.

Ideally, you want the percentage of cache hits to be high, to give the illusion that all the useful data is in the cache, and being accessed at the speed of the cache, as opposed to the speed of main memory. How effective this is depends on how much spatial and temporal locality the particular program being run has.

Web Accessibility