Achitectural Support for Large-scale Shared Memory Systems

Date
Sep 12, 2017, 3:00 pm4:30 pm
Location
Engineering Quadrangle J401

Speaker

Details

Event Description

Abstract
Modern CPUs, GPUs, and data centers are being built with more and more cores. Many popular workloads will require even more hardware parallelism in the future. Shared memory is a popular parallel programming model with many advantages, but it is historically difficult to scale to a large number of cores/nodes.
 
This thesis focuses on improving two key challenges of large-scale shared memory systems: scalability and fault-tolerance. In order to solve those challenges, this thesis first develops a parallel simulator named PriME to simulate shared memory systems at scale. Then it introduces Coherence Domain Restriction (CDR) as a cache coherence framework that sidesteps traditional scalability challenges and enables systems to scale to thousands of cores within a manycore chip or millions of cores across the entire data center. For fault-tolerance, this thesis has developed both a software-centric solution with resilient memory operations (REMO) and a hardware-centric solution with a fault-tolerant cache coherence framework (FTCC). In sum, this thesis demonstrates that shared memory systems have the potential to achieve comparable scalability and fault-tolerance ability as current cluster-based designs while still maintaining other benefits such as ease of programming and efficient memory accesses.