Farewell my Shared LLC! A Case for Private Die-Stacked DRAM Caches for Servers

Venue: MICRO 2018
Authors: Amna Shahab, Boris Grot, et. al.

Introduction.
Orthogonal to the conventional wisdom that the last level cache (LLC) in large-scale datacenters need to be shared among cores, this paper argues a case for private-LLC. They conduct experiments to analyze the effect of size, access latency, and inter-thread data sharing on LLC and conclude the following:
1. At a fixed latency, increasing LLC size from 8MB to 64MB results in less than 6% performance improvement, whereas increasing the size to 256MB results in 10%-20% improvement. This improvement is because the secondary working set stats to fit in the LLC.
2. Compared to an 8MB LLC at baseline latency, increasing LLC size to 512MB at 100% increase in latency results in a slowdown compared to the baseline. From this analysis they conclude that higher access latency is detrimental even if the LLC size is increased.
3. To analyze the sensitivity of data shared across threads, they analyze the different types of accesses to LLC (reads, writes that are not read by other cores, and writes that are read by other cores). For CloudSuite workloads, they observe a maximum of 3%-4% accesses the fall in the 3rd category. Even when the latency to perform the 3rd category access is increased by 4x, the overall performance is degraded by 0-8% (10% for a single workload). 

The Idea.
They observe that traditional DRAM caches' high access latency is twofold: 1. interconnect delay in both the CPU-side and the DRAM-side, and 2. latency to access data from the capacity-optimized DRAM caches. To this end, they propose a private die-stacked DRAM cache with each vault acting as a core's private LLC. In the proposed design, named SILO, the DRAM cache is stacked over the CPU die thereby reducing long inter-connects. As an attempt to move away from the conventional design choice of prioritizing capacity of latency, they do a design space exploration to find latency-optimal organization of DRAM. They vary different parameters like tile size (the lowest hierarchy in DRAM), number of sub-arrays, number of banks etc. The design that has more number of banks and sub-arrays reduces the length of wordline and bitline, therefore reducing the wire-delay, and hence reducing the access latency.

Thoughts.
Unfortunately I didn't read the second half of the paper meticulously, therefore missing the explanations for some of their design choices. I'm also very new to this field of DRAM caches, HMCs etc. I have the following doubts, questions:
1. DRAM caches is a well known concept. Here they mention the vault sits on top of the core in the same die. Is it something new? Are conventional DRAM caches lie on a separate die?
2. The main reason for improvement seem to stem from the fact that the LLC in SILO is stacked over the core which reduces interconnect length, and optimizing DRAM for latency instead of capacity.
3. What is the reason for making the LLC private? Is this because there's is a non-trivial number of data accesses made from a remote LLC slice in the baseline? If yes, then wouldn't that imply the capacity requirement is high, and not optimizing the LLC DRAM for capacity would exacerbate that issue? Does SILO result in higher number of LLC misses?  

Comments