CMSC411 PROJECT: Cache, Matrix Multiplication, and Vector... Presented by: Presented by: Hongqing Liu, Stacy Weng, and Wei Sun

Quiz Solution

Q1: Why the formula t_eff= t_cache+ ( 1 – h ) t_mainis only an approximation to the true performance?

A: This is because:

1.      Systems composed of independent random-access memories can satisfy more than one access at a time.

2.      A processor need not generate a memory reference on every machine cycle, particularly if the processor is waiting for input/output.

3.      Memory cycles can take longer than stated in the formula because some requests can arrive at memory while memory is busy honoring earlier processor requests generated by the input/output system.



Q2: What is associative memory? Why does associative memory have a longer cycle than a random-access memory built from identical technology?

A: A parallel memory that has the search capability, by which searches can be achieved simultaneously, is called an associative memory.

The associative memory having a longer cycle than a random-access memory is strictly a consequence of the need to propagate signals through a larger number of gates in the associative memory than in a random-access memory of the equal size.



Q3: What are the advantages and disadvantages of direct mapped cache?

A: The advantages are:

1.      No replacement algorithm necessary.

2.      Simple hardware and low cost.

3.      High speed of operation.

     The disadvantages are:

1.      Performance drops significantly if accesses are made to locations with the same index.

2.      Hit rate lower than associative mapping methods. However, as the size of cache increased, the difference in the hit rates of the direct and associative caches reduce and becomes insignificant.



Q4: What are the advantages of separate the cache in to data cache and code cache?

A: First, the write policy would only have to be applied to the data cache (assuming instructions are not modified).

      Second, separate paths could be provided from the processor to each cache, allowing simultaneous transfers to both the instruction cache and the data cache. This is particularly convenient in a pipeline processor, as different stages of the pipeline access each cache.

     Another advantage is that a designer may choose to have different sizes for the instruction cache and data cache, and have different internal organizations and block sizes for each cache.

Q5: Suppose t_cache = 25ns, t_main = 200ns, h = 99%, w = 20%, and the memory data path fully matches the cache block size. What is the average access time?

A: from    t_a= t_cache+ (1 – h) t_trans+ w(t_main– t_cache)

      We get t_a = 25 + 0.01 * 200 + 0.2 * (200 – 25) = 62ns

      Here the misses account for 2 ns ( = (1 – h) t_trans ), and the write policy account for 35 ns ( = w(t_main– t_cache) )



Q6: How to extend the average access time formula to cover a second-level cache?

A:   If only one-level cache the average access time is:

            t_a= t_cache+ (1 – h) t_main

extending to two-level cache:

            t_a= [t_cache1+ (1 – h₁)t_cache2] + (1 – h₂)t_main

where t_cache1 is the first-level cache access time,

             t_cache2 is the second-level cache access time,

             t_main is the main memory access time,

             h₁ is the first-level cache hit rate,

             h₂ is the combined first/second-level cache hit rate.

Q7: What is the special purpose for vector processor?

Vector processors are special purpose computers that match a range of (scientific) computing tasks. These tasks usually consist of large active data sets, often poor locality, and long run times.

Q8: What is the two problems in vector processor?

Vector lengths do not often correspond to the length of the vector registers.
Stride is the distance separating elements in memory that will be adjacent in a vector register.

Q9: How to handle vector length problem in vector processor?

For shorter vectors, we can use a vector length register applied to each vector operation
For longer vectors, we can split the long vector into multiple vectors (of equal, or of maximum plus smaller lengths). The process is called strip-mining. The strip-mined loop consists of a sequence of convoys.

Presented by: Hongqing Liu, Stacy Weng, and Wei Sun