Tools and methodologies for post-silicon technologies

As we improve our understanding of new technologies and materials, we see how new application domains open up. We also get a better grasp of the potential benefit that these technologies can bring for already established application domains. To accelerate progress in these domains, programming and design tools are needed, which is the matter of this research line at the CC chair.

Racetrack Memories

Racetrack memories (RTMs) is an exciting new class of the emerging non-volatile memory (NVM) technologies that unify qualities of the different memory technologies of today. The nano-scale RTMs promise access latency comparable to SRAM, the density of magnetic hard disk drives, and non-volatility that make them energy efficient. Since their conception in 2008, RTMs have evolved significantly with the recent versions eliminating major impediments and demonstrating improvements in both device speed and reliance.

A single cell in RTMs is a magnetic nanoribbon called track. Each track is equipped with one or more magnetic tunnel junction (MTJ) sensors referred to as access ports and stores a series of data bits – up to 500 – in the form of magnetic domains. Tracks in RTMs can be organized vertically (3D) or horizontally (2D) on the surface of a silicon wafer as shown below.

In order to access data from the RTM, the desired bits need to be shifted and aligned to the port positions prior to their access. These shift operations are undesirable for two reasons. (a) They consume energy (b) They not only prolong the RTM access latency but also make it variable. The amount and impact of these shifting operations on the memory subsystem can be mitigated by designing smart compilation tools and memory system designs.

At the chair for compiler construction, we are working on compilation and architectural simulation tools that not only optimize RTMs’ performance but also enable their exploration at various levels in the memory hierarchy. Together with Stuart Parkin, head of the Max Planck Institue Halle, we have developed an architectural simulation tool and proposed several data and instruction placement strategies, some of which have been integrated into an automatic polyhedral compiler tailored for RTMs.

RTSim – The Racetrack Memory Simulator

RTSim is an architectural-level cycle-accurate memory simulation framework. It accurately models the shift operations in RTMs, manages the access ports and the RTM specific memory commands sequence. RTSim is configurable and allows architects to explore the design space of RTMs by varying the design parameters such as the number of tracks, domains and access ports per track. It also implements different access ports management policies which one can choose from while simulating RTMs.

For computer architects and memory researchers, RTSim provides a solid foundation to explore and implement various architectural optimizations for RTMs. For instance, the latency of the shifting operations can be effectively hidden via pre-shifting. Similarly, smart memory controllers can be designed that promote the frequently accessed data objects to domains closer to the access ports and / or reorder memory requests based on the access port positions, aiming at minimizing the total number of shifts.

RTSim is developed on top of a well-know memory simulator NVMain 2.0 which not only enables simulation of other NVMs but also features connections to system simulators such as gem5, empowering full-system simulation. RTSim is opensource and is hosted at GitHub.

References

A. A. Khan, F. Hameed, R. Bläsing, S. Parkin and J. Castrillon, "RTSim: A Cycle-Accurate Simulator for Racetrack Memories", in IEEE Computer Architecture Letters, vol. 18, no. 1, pp. 43-46, 1 Jan.-June 2019.

Compiler Optimizations for RTMs

Most existing methods introduce hardware extensions to abate the shifting overhead in RTMs. The additional hardware in such methods not only consumes the precious chip budget but also induces latency and energy overheads. In addition, since hardware solutions are blind to the memory access behavior of the running application(s), they in some cases might under-perform and result in low energy efficiency.

Software solutions such as compiler-guided data placement optimize the RTM performance and energy consumption by leveraging the knowledge of the application’s memory access pattern. We at the chair for compiler construction have developed a set of data placement techniques for RTMs that maximize the likelihood that consecutive references access the same or nearby memory locations at runtime, thereby minimizing the number of shifts. We have formulated the data placement problem in RTMs as an integer linear program (ILP) and have developed a novel heuristic called ShiftsRedcue that provides a near-optimal solution [1]. When combined with a genetic search, our experimental results showed a reduction in RTM shifts by up to 52.5%. While ShiftsReduce targets a specific RTM architecture, our generalized heuristic in [2] finds comparable solutions in an architecture-independent way.

We have also investigated data layouts for high dimensional tensorial data structures and the non-linear tree data structures in the RTM-based scratchpad memories (SPMs). We examined strategies to improve the performance and energy efficiency for the specific cases of the tensor contraction operations and decision trees. For tensor contraction, compiler optimizations such as the schedule and the data layout transformations, paired with suitable architectural support, were employed to avoid unnecessary shifts in RTMs. The proposed optimizations not only reduced the amount of RTM shifts to the absolute minimum but also enabled guaranteed single-cycle SPM accesses [3, 4]. Our experimental results showed that the proposed optimizations were mandatory to make RTMs outperform SRAMs. For decision trees, in collaboration with TU Dortmund, we exploited the domain knowledge, i.e., the node access probabilities to map temporaly closely accessed nodes to successive locations in RTM [7].

Motivated by the general-purpose heuristics in [1,2] and application-specific solutions in [3, 4, 7], we extended Polly – the polyhedral optimizer in LLVM – and developed an end-to-end automatic compilation framework that generates RTM-efficient code [5]. The RTM compiler takes an input program, analyzes the memory access pattern for potential shift optimizations, and transforms the schedule/layout in a way that reduces (long) shifts in RTM. This is joint work with Tobias Grosser and Torsten Hoefler.

In addition to the data placement optimizations, we, in collaboration with researchers from the Tampere University of Technology, have also proposed shift-reducing instruction memory placement (SHRIMP), an efficient instruction placement strategy that best exploits the sequentiality of both the instruction stream and the RTMs [6]. With negligible memory overhead, our experimental results demonstrated an up-to 40% reduction in the number of RTMs shifts and a best-case average of 23% reduction in total cycle counts.

References

Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019.
Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020.
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019.
Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020.
Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 39, no. 11, pp. 3968-3980, Oct 2020.
Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, pp. 6pp, New York, NY, USA, Jul 2019.
Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jerónimo Castrillón and Jian-Jia Chen. “BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory.” 2021 58th ACM/IEEE Design Automation Conference (DAC) (2021): 1111-1116.

Near-memory and in-memory Computing

Emerging application domains such as machine learning and computational genomics require processing a mammoth volume of data and hence demand significantly higher off-chip memory bandwidth. In conventional CMOS-based Von-Neumann machines, increasing the off-chip bandwidth is becoming increasingly expensive and is strictly constrained by the chip package and system models. On the contrary, non-Von-Neumann system models like computing-near-memory (CNM) and computing-in-memory computing (CIM) have shown great promise by outperforming the conventional Von-Neumann system models by orders of magnitude, both in terms of latency and energy consumption. The idea consists in bringing computations closer to the data or processing data where it makes more sense.

At the chair for compiler construction, we are exploring CNM and CIM systems based on various memory technologies, for different use cases. We are also developing tools and software methods for the design space exploration and effective utilization of these systems.

Computing Near-Memory

Numerous compute-near-memory (CNM) systems have emerged recently, including those from UPMEM, Samsung, and SK hynix. At the Chair for Compiler Construction, we are developing tools to democratise access to these novel architectures by developing tools that abstract their complexities, making them accessible to non-experts while harnessing their full potential. As a concrete example, we have developed Cinnamon—a hierarchical MLIR-based compilation framework that introduces reusable high-level abstractions for targeting a variety of CIM and CNM systems [1]. Cinnamon progressively lowers high-level application representations into device-compatible code, with demonstrated support for multiple backends, including the real-world UPMEM CNM system. Along the compilation path, it applies novel domain- and device-specific optimizations at various abstraction levels to enhance performance and resource utilization.

In our prior work, we have developed CNM systems for pre-alignment filtering in genome analysis. We have proposed ALPHA [2], a co-designed filtering solution based on conventional DRAM that minimizes the number of memory accesses, improving performance and reducing energy consumption. We have recently proposed FIRM [3], a CNM system based on the emerging racetrack memory (see the figure below). We have demonstrated that with intelligent system design, RTMs outperform DRAM by more than 50% in terms of the total runtime and the overall energy consumption. This is joint work with Sebastien Ollivier and Alex K. Jones from the University of Pittsburgh.

References:

Asif Ali Khan, Hamid Farzaneh, Karl Friedrich Alexander Friebel, Clément Fournier, Lorenzo Chelini, and Jeronimo Castrillon. "CINM (Cinnamon): A compilation infrastructure for heterogeneous compute in-memory and compute near-memory paradigms." ASPLOS 2024.
Fazal Hameed, Asif Ali Khan and Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", in IEEE Transactions on Emerging Topics in Computing, doi: 10.1109/TETC.2021.3093840.
A. A. Khan, F. Hameed, T. Shahroodi, A. K. Jones and J. Castrillon, "Efficient Memory Layout for Pre-Alignment Filtering of Long DNA Reads Using Racetrack Memory", IEEE Computer Architecture Letters, 2024.

Computing In-Memory

Computing in-memory (CIM), unlike the CNM, does not require dedicated CMOS logic for computations. Instead, the computation and storage are performed directly in memory using the memory devices' properties. Memristor crossbars (based on the phase-change memory and resistive RAM) have attracted significant interest due to their ability to efficiently perform matrix-matrix and matrix-vector multiplications — the dominant computational kernels in machine learning (deep neural networks). On the other hand, DRAM and RTMs have demonstrated their dominance in efficiently implementing various logic operations.

For memristors-based CIM systems, we developed the open CIM compiler (OCC) - an MLIR-based compilation framework that transparently detects and offloads computational primitives to memristor blocks in the CIM system [1]. OCC leverages the hierarchical abstractions of the MLIR compiler infrastructure (see below) to perform code matching and device-agnostic and device-specific code transformations at the most appropriate level. This is joint work with the University of Eindhoven, the University of Edinburgh, Inria France, and the University of Oklahoma.

We have also developed a similar compilation flow for nonvolatile memories based CIM accelerators performing bulk-bitwise logic operation [2].

For RTM-based CIM accelerators, we proposed RTM architectures to implement the entire hyperdimensional computing (HDC) use case [3]. Leveraging the physical properties of RTM devices, we implemented key HDC operations such as XOR and population count directly within memory. To enable efficient counting, we introduced a novel mechanism based on RTM nanowires. Recognizing that data shifting is intrinsic to RTM, we also developed mapping strategies that minimize shift operations, thereby reducing latency and energy consumption.

Recently, we introduced an in-DRAM counting mechanism that uses the fundaments majority operations in DRAM to perform in-memory counting [4]. Employing Johnson encoding, our design achieves value-dependent operation latency that substantially mitigates the long carry-chains and their associate latency overheads in existing CIM-based addition methods. We demonstrate the applicability of our in-DRAM counting to multiple operations, including integer-binary and integer-integer matrix multiplication.

We are currently investigating heterogeneous CIM and CNM systems, focusing on the development of software methodologies that automatically identify and map operations within a given kernel to the most suitable hardware target in the system. This involves analyzing the computational characteristics of the workload and leveraging the strengths of each memory technology to maximize performance and energy efficiency through intelligent hardware-software co-design.

References:

Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, and Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2021).
Hamid Farzaneh, Joao Paulo De Lima, Ali Nezhadi Khelejani, Asif Ali Khan, Mahta Mayahinia, Mehdi Tahoori, and Jeronimo Castrillon. "SHERLOCK: Scheduling Efficient and Reliable Bulk Bitwise Operations in NVMs." In Proceedings of the 61st ACM/IEEE Design Automation Conference, pp. 1-6. 2024.
Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, and Alex K. Jones. "Brain-inspired Cognition in Next Generation Racetrack Memories", ACM Transactions on Embedded Computing Systems (TECS) (2022).
João Paulo Cardoso de Lima, Benjamin Franklin Morris III, Asif Ali Khan, Jeronimo Castrillon, and Alex K. Jones. "Count2multiply: Reliable in-memory high-radix counting." arXiv preprint arXiv:2409.10136 (2024).

Go back