computer organization
c m s c 311  
f a l l   2 0 0 2  


The Memory Hierarchy

The following is a ordering of "memory" (i.e., places to store bits) from fastest to slowest:

The rule of thumb for memory is this: As memory gets quicker and quicker, the cost is more expensive (per byte), and the total number of bytes decreases.

For example, registers are the quickest memory, but typically you have very little of it (for example, MIPS has 32 integer registers). There's typically more RAM (these days, even 1 G of RAM is not too uncommon), but it's much slower than accessing information stored in registers, but it is cheaper per byte than registers.

Finally, there's disk (20 G or more is not uncommon), which is slower than RAM, and there's a lot of it, and it's cheapest of all the memory (per byte).

Thus, if you want faster memory, it's going to cost you, and you can't get too much of it. This is a combination of both economics and physics.

The Need for Cache

The goal of computer hardware designers is to make hardware go as fast as possible. There are two ways to do this. Either you do a brute force approach: make everything faster (via new technology), or you make a few observations about the kind of operations and the kind of data you are working on, and make hardware faster by being smart about what needs to get faster.

RISC architectures were proposed and developed in the early 1980s. The ideas behind RISC architecture were:

In particular, because each instruction is "simple", the amount of code is expected to increase. Thus, you might have, say, 300 instructions to run instead of 250. As you might guess, running more instructions takes more time than running fewer instructions.

The hope is to execute each of those 300 instructions much faster than the 250 instructions. Some of that comes with making simple instructions, but some of it also comes from placing the instructions in cache.

So, the simple idea is this. Create a fast memory that sits between the physical memory and the registers. This fast memory is a subset of the data that is stored in physical memory. It acts as a middle man.

Here's the analogy. Imagine you are in a large digital library (which represents RAM or physical memory). You sit at a study carrel (a desk with some space for CDs) and have a bunch of CD books. Let's say the carrel can hold up to 20 CDs, each CD being a copy of some "book" in the digital library.

Unfortunately, the technology isn't perfect. When you don't have the CD of the book you want, you must go to the librarian and ask for the book to overwrite the CD you have. So, it's slow if you must go to the librarian to do this.

Thus, when you want to get a book, the first thing you do is to see if you have it in the shelf in the study carrel. If it's there, you access it quickly and begin to read. If it's not there, you have to go the librarian to get the book "burned" on a CD, and that takes time.

This analogy is reasonably close to what happens between cache and RAM, where the study carrel acts as the cache (fast, small subset of the library) and the library itself represents RAM.

Why Cache Works

Why does cache increase the speed of programs? Let's go back to the analogy and see where it might fail.

So there you are in your study carrel, needing to write (or type) a paper on the ancient history of Rome. As you write, you need to read and refer to books. As long as the CDs are sitting in the carrel this is quick.

However, as time passes, you may have to write on the history of China, and because most of your CDs are about the history of Rome, you need to get those new books. Even though they take some time to get the books you needed copied, once you have them, it's pretty speedy.

Now imagine that you don't have a report, and you happen to like to read books rather randomly. One day you read part of one book, then another day, you read another part of another book. You rarely read the same book twice, and you rarely read the entire book.

In this case, having all the CDs in the study carrel doesn't save much time. You're always having to go to the librarian to make new copies.

Yet, for a typical person, this probably doesn't happen very often. You can often leave the books you want in the carrel, and will refer to them quite often. Thus, the study carrel does help you out, because it allows you to get access to frequently needed CDs fast.

It turns out programs also behave in a similar fashion. In particular, programs exhibit the following two properties:

How can caches take advantage of spatial locality? Since you know you are likely to access addresses near X, then fetch the data in a large block near X. Instead of fetching the byte stored at address X, fetch 64 bytes that include address X, which is faster than accessing the 64 bytes one at a time.

How can you take advantage of temporal locality? Place the frequently used data in cache to begin with. Since that data is faster to access than RAM, and since the data is likely to be accessed in the near future, you make the program faster.

Typically, you have two sets of caches: an instruction cache and a data cache. Both have spatial and temporal locality, but don't overlap that much with each other's addresses.

How Cache Works

To begin, let's establish some terminology (some of this is borrowed from a book by Murdocca).

Here's the idea. Suppose a user wants to access address A31-0. This address might be the address of some data in memory, or it might be the address of an instruction to be fetched.

Since the data block contains 2m bytes, we will fetch this many bytes, instead of fetching one (or four) bytes at address A31-0.

For now, let's assume m=8, so each data block consists of 256 bytes.

How do we get those 256 bytes? We need 256 consecutive addresses. The easiest way to accomplish this is follows.

A31-8 is the upper 24 bits of the original address A31-0 that we want to access. We come up with all 28 = 256 ways of writing an 8 bit value. Combine A31-8 with 8 more bits to get a 32 bit address.

Procedure for accessing data

When an address is generated, the first step is to determine if the data exists in the cache (similar to a person checking to see if the CD is in the study carrel). This is done by searching through the cache (using hardware) to see if it's there.

If there, then you have a cache hit. The data is returned to the main CPU. If not, then you have a cache miss (similar to not having the CD in the study carrel, and having to go to the librarian).

This means, you need to fetch the data from RAM (this is done in hardware).

In the process of fetching the data from RAM, you will need to pick a slot to place the data. If all slots are being used, one of the slots must be "evicted", which typically means copying the data back to RAM, and using the slot for the current data being requested.

An entire data block is fetched from RAM, placed in the slot, and the tag updated accordingly.

How this is done depends on the kind of cache you have.

Fully Associative Cache

There are three kinds of cache schemes: fully associative cache, direct mapped cache, and set associative cache.

In a fully associative cache, when we fetch a 256 byte data block, it can go into any slot. When the hardware looks to see if the data is one of the slots, it checks to see if the tag of the slot matches the tag of the address.

If the address is B31-0, then B31-8 is the "tag" portion of the address, and B7-0 is the offset. The number of bits in the offset is the log of the number of bytes in the data block. Since the data block has 256 bytes, the number of bits in the offset is 8 (which is log2256).

The hardware checks each slot to see if it has a tag that equals B31-8 (which means the tag is 24 bits).

It also checks to see if V=1 (the valid bit is true). If V=0, this means the tag is meaningless and data block contains garbage bits.

If there is a slot that matches the tag portion of the addres and is valid (there can be at most one such slot), then the offset B7-0 is used as the index into the data block. Since the offset is 8 bits, you can access one of 256 addresses, which is, of course, the number of bytes in the data block.

If there is no such slot, then the data is fetched from memory. In particular, you fetch bytes from address B31-8 0000 0000 all the way up to B31-8 1111 1111 (that's 256 addresses) and place them in a slot.

Which slot should be picked? It depends. If there is a slot that's not valid, then that one should be picked.

If all slots are valid, there's a variety of schemes to use. You can try to pick a slot where D = 0. This would mean the slot is not dirty (i.e., the data block hasn't been modified).

The advantage of picking such a block is that you don't have to write this back to RAM, which saves time. The disadvantage is that this block may be a block that's frequently accessed.

A better strategy may be to determine which slot hasn't been used in a while. In order to do this, you need to keep more information than the slots currently store. In particular, you need to keep something like an "age" counter. An age counter can be, say, a 2 bit counter.

If the slot hasn't been acccessed in a certain time period, the counter is incremented. If the counter reaches 11, it stays at 11. If the slot is accessed, the counter is reset back to 00.

When picking a slot to evict, pick a slot with the largest age counter.

Such schemes (i.e. using age counters) are more likely to be seen when using virtual memory. Virtual memory uses the disk as additional memory. Think of RAM as a kind of cache for disk. When some data doesn't appear in RAM, you look for it on disk. The main difference between RAM and disk vs. cache and RAM is the speed. Disk is very slow to access compared to RAM.

Direct Mapped Cache

To understand fully-associative caches, it's helpful to understand direct mapped caches. In a fully associative cache, any slot can be used to store a data block.

In a direct-mapped cache, only a single slot can be used to store a data block.

So, how is that slot determined?

Assume each slot has a slot number. Assume we have 64 slots. The slots are number 0 to 63. With 64 slots, you need 6 bits to identify the slot.

We get these 6 bits from the address. So, if the address is B31-0, then B31-14 is the tag, B13-8 is the slot number (6 bits), and B7-0 is the offset as before.

You might wonder why we picked B13-8 as the slot number. Couldn't we have picked any other 6 bits from bits B31-8? You could have, but usually the further right the bits, the more likely that the values are to vary. You want a lot of variance so that different addresses get mapped to different slots.

If this scheme reminds you of hash tables, it's very similar. If you use the high 6 bits (i.e., B31-26), then you are likely to always pick the same slot, which is like hashing to the same bucket. Picking bits low down tends to make the data spread out more across the different slots.

Thus, B13-8 uniquely determines the slot where the data block must go.

This is similar to having a parking lot with 100 parking spots number 00 through 99 and using the middle two digits of your student ID to select a parking spot. Thus, if your middle 2 digits is 23, then you can only park in spot 23.

The drawback of this scheme is clear. What if many students have 23 as the middle 2 digits, yet, very few have any of the other 2 digits? Then, there may be empty parking spaces, yet you're not allowed to park there (this happens frequently with apartment complexes which assign only one parking spot per apartment---you may have several people living at the apartment, each with their own car, yet only one of you can park in the slot, and others have to park far away---even if other people's parking slots are empty).

The main advantage of this scheme is that it's simpler to implement (uses less hardware) than a fully associative cache.

Set Associate Cache

A set-associative cache is somewhere in between the two schemes. Suppose we have 64 slots. We put each slot in a set. Let's assume there are 8 sets, each containing 8 slots.

We break the address up differently.

B31-11 is the tag (23 bits), B10-8 is the set number (3 bits), and B7-0 is the offset as before.

In a set associative scheme, you use B10-8 to identify one of 8 sets (numbered from 000 to 111).

How do you decide which of the 8 slots to pick within the set? At this point, it becomes fully associative within the set. You can pick any of the 8 slots you want.

Again, the parking lot analogy is useful. Assume that there are still 100 parking spots, but you use the last digit of your student ID. There are only 10 possible values using your last digit. However, there are 10 parking spots for each digit (thus 10 assigned to digit 0, 10 assigned to digit 1, etc).

So, now you have a choice of 10 different spots based on the last digit.

As you can see, the set-associative scheme combines ideas from fully associative and direct mapped.