UoM administered thesis: Master of Philosophy


Reducing latency and increasing the throughput of issued data transfers is a core requirement if we are to meet the needs of future systems at scale, and therefore, fast memory delivery to applications is a core component that needs optimization in order to meet this requirement. The demand for memory capacity from applications has always challenged the available technologies and therefore it is important to understand that this demand and the consequential limitations in various aspects led to the appearance of new memory technologies and system designs. Fundamentally, not a single solution has managed to fully solve this memory capacity challenge. As argued in this thesis, limitations by physical laws make the effort of expanding local off-chip memory impossible without adopting new approaches. The concept of Non-Unified Memory Architecture (NUMA) provides more system memory by using pools of processors, each with their memories, to workaround the physical constraints on a single processor, but the additional system complexities and costs led to various scalability issues that deter any further system expansion using this method. Computer clusters were the first configurations to eventually provide a Distributed Shared Memory (DSM) system at a linear cost while also being more scalable than the traditional cache-coherent NUMA systems, however, this was achieved by using additional software mechanisms that introduce significant latency when accessing the increased memory capacity. As this thesis describes, since the initial software DSM systems, a lot of effort has been invested to create simpler and higher performance solutions including software libraries, language extensions, high performance interconnects and abstractions via system hypervisors, where each approach allows a more efficient way of memory resource allocation and usage across nodes in a machine cluster. Despite such efforts, fundamental problems such as maintaining cache coherence across a scaled system with thousands of nodes are not something that any of the current approaches are capable of efficiently providing, and therefore the requirement of delivering a scalable memory capacity still poses a real challenge for system architects. New design concepts and technologies, such as 3D stacked RAM and the Unimem architecture, are promising and can offer a substantial increase in performance and memory capacity, but together there is no generally accepted and effective solution to provide DSM. On a DSM system, efficient and fast data movement across the network is a major performance and scalability factor. For that reason, this thesis presents a way to change bus transactions in a system, through a mechanism that reduces the latency of small-sized data transfers across system nodes. This is accomplished by implementing and evaluating a software function that conducts data transfers either by using Remote Direct Memory Access (RDMA) accelerators, or native load stores issued by the processor. By conducting measurements, it is found that processor native load/stores are beneficial and can provide up to a seven-fold decrease in transfer latency by overcoming the RDMA limitations in small transfers, while the latter transfer method is found superior and should be used beyond a particular transfer size threshold. Therefore, by combining the benefits of both mechanisms, it is possible to accelerate data movement for small packets on the system global data bus, thus delivering the best performance at zero-cost in terms of resources required. The system in which the evaluation took place consists of Xilinx state-of-the-art FPGA devices with custom hardware designs created by the tools provided by the FPGA vendor. Because of the promising results that this thesis illustrates, these findings can be useful for many application domains. Any parallel application that is based on stencil computations will benefit because of frequent updates of the global array elements as well as synchronization messages. One such example is computational fluid dynamics simulators. In such cases, RDMA cannot fit well because of the inefficiencies when it comes to small transfer sizes, mainly due to the setup time required. Other domains that can benefit is Machine Learning, Deep Learning, and Distributed Graph analytics, mainly because of the large number of synchronization messages required as well as the frequent element updates. Moreover, in a global address space memory scheme where the address space is partitioned to each node and multiple processes running on the system, the memory access model should provide memory isolation and be able to guarantee non-interference between applications' or processes' separate virtual address spaces. The limitations of the current generation of ARM (ARMv8 and AArch64) address scheme simply do not provide enough address bits for physical or virtual addresses for a large scale cluster with thousands of nodes. Therefore, efforts have been made to lift these limitations with custom hardware support, but generally, improving this important subsystem is crucial for future DSM systems with large memory capacities per node.


Original languageEnglish
Awarding Institution
Award date6 Jan 2021