22 Basics of Cache Memory

Dr A. P. Shanthi

The objectives of this module are to discuss about the basics of cache memories. We will discuss about the various mapping policies and also discuss about the read/write policies. Basically, the four primary questions with respect to block placement, block identification, block replacement and write strategy will be answered.

The speed of the main memory is very low in comparison with the speed of modern processors. For good performance, the processor cannot spend much of its time waiting to access instructions and data in main memory. Hence, it is important to devise a scheme that reduces the time needed to access the necessary information. Since the speed of the main memory unit is limited by electronic and packaging constraints, the solution must be sought in a different architectural arrangement. An efficient solution is to use a fast cache memory, which essentially makes the main memory appear to the processor to be faster than it really is. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations. As long as most memory accesses are to cached memory locations, the average latency of memory accesses will be closer to the cache latency than to the latency of main memory.

The effectiveness of the cache mechanism is based on a property of computer programs called locality of reference. Analysis of programs shows that most of their execution time is spent on routines in which many instructions are executed repeatedly. These instructions may constitute a simple loop, nested loops, or a few procedures that repeatedly call each other. The actual detailed pattern of instruction sequencing is not important – the point is that many instructions in localized areas of the program are executed repeatedly during some time period, and the remainder of the program is accessed relatively infrequently. This is referred to as locality of reference. It manifests itself in two ways: temporal and spatial. The first means that a recently executed instruc-tion is likely to be executed again very soon. The spatial aspect means that instructions in close proximity to a recently executed instruction (with respect to the instructions’ addresses) are also likely to be executed soon.

If the active segments of a program can be placed in a fast cache memory, then the total execution time can be reduced significantly. Conceptually, operation of a cache memory is very simple. The memory control circuitry is designed to take advantage of the property of locality of reference. The temporal aspect of the locality of reference suggests that whenever an information item (instruction or data) is first needed, this item should be brought into the cache where it will hopefully remain until it is needed again. The spatial aspect suggests that instead of fetching just one item from the main memory to the cache, it is useful to fetch several items that reside at adjacent addresses as well. We will use the term block to refer to a set of contiguous address locations of some size. Another term that is often used to refer to a cache block is cache line.

The cache memory that is included in the memory hierarchy can be split or unified/dual. A split cache is one where we have a separate data cache and a separate instruction cache. Here, the two caches work in parallel, one transferring data and the other transferring instructions. A dual or unified cache is wherein the data and the instructions are stored in the same cache. A combined cache with a total size equal to the sum of the two split caches will usually have a better hit rate. This higher rate occurs because the combined cache does not rigidly divide the number of entries that may be used by instructions from those that may be used by data. Nonetheless, many processors use a split instruction and data cache to increase cache bandwidth.

When a Read request is received from the processor, the contents of a block of memory words containing the location specified are transferred into the cache. Subsequently, when the program references any of the locations in this block, the desired contents are read directly from the cache. Usually, the cache memory can store a reasonable number of blocks at any given time, but this number is small compared to the total number of blocks in the main memory. The correspondence between the main memory blocks and those in the cache is specified by a mapping function. When the cache is full and a memory word (instruction or data) that is not in the cache is referenced, the cache control hardware must decide which block should be removed to create space for the new block that contains the referenced word. The collection of rules for making this decision constitutes the replacement algorithm.

Therefore, the three main issues to be handled in a cache memory are

· Cache placement – where do you place a block in the cache?

· Cache identification – how do you identify that the requested information is available in the cache or not?

· Cache replacement – which block will be replaced in the cache, making way for an incoming block?

These questions are answered and explained with an example main memory size of 1MB (the main memory address is 20 bits), a cache memory of size 2KB and a block size of 64 bytes. Since the block size is 64 bytes, you can immediately identify that the main memory has 214 blocks and the cache has 25 blocks. That is, the 16K blocks of main memory have to be mapped to the 32 blocks of cache. There are three different mapping policies – direct mapping, fully associative mapping and n-way set associative mapping that are used. They are discussed below.

Direct Mapping: This is the simplest mapping technique. In this technique, block i of the main memory is mapped onto block j modulo (number of blocks in cache) of the cache. In our example, it is block j mod 32. That is, the first 32 blocks of main memory map on to the corresponding 32 blocks of cache, 0 to 0, 1 to 1, … and 31 to 31. And remember that we have only 32 blocks in cache. So, the next 32 blocks of main memory are also mapped onto the same corresponding blocks of cache. So, 32 again maps to block 0 in cache, 33 to block 1 in cache and so on. That is, the main memory blocks are grouped as groups of 32 blocks and each of these groups will map on to the corresponding cache blocks. For example, whenever one of the main memory blocks 0, 32, 64, … is loaded in the cache, it is stored only in cache block 0. So, at any point of time, if some other block is occupying the cache block, that is removed and the other block is stored. For example, if we want to bring in block 64, and block 0 is already available in cache, block 0 is removed and block 64 is brought in. Similarly, blocks 1, 33, 65, … are stored in cache block 1, and so on. You can easily see that 29 blocks of main memory will map onto the same block in cache. Since more than one memory block is mapped onto a given cache block position, contention may arise for that position even when the cache is not full. That is, blocks, which are entitled to occupy the same cache block, may compete for the block. For example, if the processor references instructions from block 0 and 32 alternatively, conflicts will arise, even though the cache is not full. Contention is resolved by allowing the new block to overwrite the currently resident block. Thus, in this case, the replacement algorithm is trivial. There is no other place the block can be accommodated. So it only has to replace the currently resident block.

Placement of a block in the cache is determined from the memory address. The memory address can be divided into three fields, as shown in Figure 26.1. The low-order 6 bits select one of 64 words in a block. When a new block enters the cache, the 5-bit cache block field determines the cache position in which this block must be stored. The high-order 9 bits of the memory address of the block are stored in 9 tag bits associated with its location in the cache. They identify which of the 29 blocks that are eligible to be mapped into this cache position is currently resident in the cache. As a main memory address is generated, first of all check the block field. That will point to the block that you have to check for. Now check the tag field. If they match, the block is available in cache and it is a hit. Otherwise, it is a miss. Then, the block containing the required word must first be read from the main memory and loaded into the cache. Once the block is identified, use the word field to fetch one of the 64 words. Note that the word field does not take part in the mapping.

Tag	Block	Word
9	5	6

Figure 26.1 Direct mapping

Consider an address 78F28 which is 0111 1000 1111 0010 1000. Now to check whether the block is in cache or not, split it into three fields as 011110001 11100 101000. The block field indicates that you have to check block 28. Now check the nine bit tag field. If they match, it is a hit.

The direct-mapping technique is easy to implement. The number of tag entries to be checked is only one and the length of the tag field is also less. The replacement algorithm is very simple. However, it is not very flexible. Even though the cache is not full, you may have to do a lot of thrashing between main memory and cache because of the rigid mapping policy.

Fully Associative Mapping: This is a much more flexible mapping method, in which a main memory block can be placed into any cache block position. This indicates that there is no need for a block field. In this case, 14 tag bits are required to identify a memory block when it is resident in the cache. This is indicated in Figure 5.8. The tag bits of an address received from the processor are compared to the tag bits of each block of the cache to see if the desired block is present. This is called the associative-mapping technique. It gives complete freedom in choosing the cache location in which to place the memory block. Thus, the space in the cache can be used more efficiently. A new block that has to be brought into the cache has to replace (eject) an existing block only if the cache is full. In this case, we need an algorithm to select the block to be replaced. The commonly used algorithms are random, FIFO and LRU. Random replacement does a random choice of the block to be removed. FIFO removes the oldest block, without considering the memory access patterns. So, it is not very effective. On the other hand, the least recently used technique considers the access patterns and removes the block that has not been referenced for the longest period. This is very effective.

Thus, associative mapping is totally flexible. But, the cost of an associative cache is higher than the cost of a direct-mapped cache because of the need to search all the tag patterns to determine whether a given block is in the cache. This should be an associative search as discussed in the previous section. Also, note that the tag length increases. That is, both the number of tags and the tag length increase. The replacement also is complex. Therefore, it is not practically feasible.

Tag	Word
14	6

Figure 26.2 Fully associative mapping

Set Associative Mapping: This is a compromise between the above two techniques. Blocks of the cache are grouped into sets, consisting of n blocks, and the mapping allows a block of the main memory to reside in any block of a specific set. It is also called n-way set associative mapping. Hence, the contention problem of the direct method is eased by having a few choices for block placement. At the same time, the hardware cost is reduced by decreasing the size of the associative search. For our example, the main memory address for the set-associative-mapping technique is shown in Figure 26.3 for a cache with two blocks per set (2–way set associative mapping). There are 16 sets in the cache. In this case, memory blocks 0, 16, 32 … map into cache set 0, and they can occupy either of the two block positions within this set. Having 16 sets means that the 4-bit set field of the address determines which set of the cache might contain the desired block. The 11 bit tag field of the address must then be associatively compared to the tags of the two blocks of the set to check if the desired block is present. This two-way associative search is simple to implement and combines the advantages of both the other techniques. This can be in fact treated as the general case; when n is 1, it becomes direct mapping; when n is the number of blocks in cache, it is associative mapping.

Tag	Set	Word
11	4	6

Figure 26.3 Set associative mapping

One more control bit, called the valid bit, must be provided for each block. This bit indicates whether the block contains valid data. It should not be confused with the modified, or dirty, bit mentioned earlier. The dirty bit, which indicates whether the block has been modified during its cache residency, is needed only in systems that do not use the write-through method. The valid bits are all set to 0 when power is initially applied to the system or when the main memory is loaded with new programs and data from the disk. Transfers from the disk to the main memory are carried out by a DMA mechanism. Normally, they bypass the cache for both cost and performance reasons. The valid bit of a particular cache block is set to 1 the first time this block is loaded from the main memory, Whenever a main memory block is updated by a source that bypasses the cache, a check is made to determine whether the block being loaded is currently in the cache. If it is, its valid bit is cleared to 0. This ensures that stale data will not exist in the cache.

A similar difficulty arises when a DMA transfer is made from the main memory to the disk, and the cache uses the write-back protocol. In this case, the data in the memory might not reflect the changes that may have been made in the cached copy. One solution to this problem is to flush the cache by forcing the dirty data to be written back to the memory before the DMA transfer takes place. The operating system can do this easily, and it does not affect performance greatly, because such disk transfers do not occur often. This need to ensure that two different entities (the processor and DMA subsystems in this case) use the same copies of data is referred to as a cache-coherence problem.

Read / write policies: Last of all, we need to also discuss the read/write policies that are followed. The processor does not need to know explicitly about the existence of the cache. It simply issues Read and Write requests using addresses that refer to locations in the memory. The cache control circuitry determines whether the requested word currently exists in the cache. If it does, the Read or Write operation is performed on the appropriate cache location. In this case, a read or write hit is said to have occurred. In a Read operation, no modifications take place and so the main memory is not affected. For a write hit, the system can proceed in two ways. In the first technique, called the write-through protocol, the cache location and the main memory location are updated simultaneously. The second technique is to update only the cache location and to mark it as updated with an associated flag bit, often called the dirty or modified bit. The main memory location of the word is updated later, when the block containing this marked word is to be removed from the cache to make room for a new block. This technique is known as the write-back, or copy-back protocol. The write-through protocol is simpler, but it results in unnecessary write operations in the main memory when a given cache word is updated several times during its cache residency. Note that the write-back protocol may also result in unnecessary write operations because when a cache block is written back to the memory all words of the block are written back, even if only a single word has been changed while the block was in the cache. This can be avoided if you maintain more number of dirty bits per block. During a write operation, if the addressed word is not in the cache, a write miss occurs. Then, if the write-through protocol is used, the information is written directly into the main memory. In the case of the write-back protocol, the block containing the addressed word is first brought into the cache, and then the desired word in the cache is overwritten with the new information.

When a write miss occurs, we use the write allocate policy or no write allocate policy. That is, if we use the write back policy for write hits, then the block is anyway brought to cache (write allocate) and the dirty bit is set. On the other hand, if it is write through policy that is used, then the block is not allocated to cache and the modifications happen straight away in main memory.

Irrespective of the write strategies used, processors normally use a write buffer to allow the cache to proceed as soon as the data is placed in the buffer rather than wait till the data is actually written into main memory.

To summarize, we have discussed the need for a cache memory. We have examined the various issues related to cache memories, viz., placement policies, replacement policies and read / write policies. Direct mapping is the simplest to implement. In a direct mapped cache, the cache block is available before determining whether it is a hit or a miss, as it is possible to assume a hit and continue and recover later if it is a miss. It also requires only one comparator compared to N comparators for n-way set associative mapping. In the case of set associative mapping, there is an extra MUX delay for the data and the data comes only after determining whether it is hit or a miss. However, the operation can be speeded up by comparing all the tags in the set in parallel and selecting the data based on the tag result. Set associative mapping is more flexible than direct mapping. Full associative mapping is the most flexible, but also the most complicated to implement and is rarely used.

Web Links / Supporting Materials

Computer Organization and Design – The Hardware / Software Interface, David A. Patterson and John L. Hennessy, 4th Edition, Morgan Kaufmann, Elsevier, 2009.
Computer Architecture – A Quantitative Approach , John L. Hennessy and David A.Patterson, 5th Edition, Morgan Kaufmann, Elsevier, 2011.
Computer Organization, Carl Hamacher, Zvonko Vranesic and Safwat Zaky, 5th.Edition, McGraw- Hill Higher Education, 2011.