Parallel programming of heterogeneous platforms

Dataflow Models of Computation

Modern hardware architectures have evolved significantly in recent years, now featuring many cores of varying types, and this trend is expected to continue. To benefit from these architectures, programs need to execute in parallel. However, parallel programming with primitives such as threads is known to be difficult: races on shared data can be prevented with locks at the risk of introducing deadlocks.

Abstractions are the key to making programming for these architectures easy and scalable for future, more complicated hardware designs. By abstracting away parallel constructs, algorithms can be made concise, and compilers can handle parallel aspects across different architectures.

At the Chair for Compiler Construction, we investigate dataflow models of computation (MoC) as a promising abstraction to tame parallel architectures [1]. A dataflow application is modeled as a graph, where nodes represent computation and edges represent communication. This explicit parallel representation allows the compiler to reason about execution order and exploit various types of parallelism.

Example dataflow graph.

References

Jeronimo Castrillon, Karol Desnos, Andrés Goens, Christian Menard, "Dataflow Models of Computation for Programming Heterogeneous Multicores", Springer Nature Singapore, pp. 1–40, Singapore, Jan 2023.

Kahn Process Networks

Kahn Process Networks (KPNs) are a specific model of computation in which nodes execute concurrently as separate processes, typically non-terminating programs. KPNs have been widely used in embedded systems research due to their attractive properties, such as Turing-complete expressivity, natural exposure of parallelism, and deterministic behavior.

An example of a KPN in a Multi-Processor System-on-Chip

At the Chair for Compiler Construction, KPNs are used in several frameworks and tools, including Mocasin and the dfg-mlir dialect, with a focus on developing methods for efficiently and automatically mapping KPN applications to heterogeneous hardware platforms. Research directions include strategies for understanding and handling dynamic behavior within applications, as well as handling multi-application scenarios with resource constraints.

Adaptive Process Networks

Adaptive Process Networks (APNs) is an extension of Kahn Process Network developed at the Chair for Compiler Construction. APN introduces implicit data-level parallelism and enhances adaptivity, while preserving the deterministic execution semantics of KPNs [1]. The APN model incorporates a new type of node, called parallel regions, into the application topology. These special nodes encapsulate a subnetwork of processes and can be dynamically replicated at runtime.

Parallel region in Adaptive Process Network

To facilitate communication between replicated instances of parallel regions and the rest of the network, APNs also introduce parallel channels, which are responsible for distributing and collecting tokens dynamically. Thanks to a consistent distribution strategy, parallel channels preserve the original deterministic behavior of KPNs. With this extension, APNs become more flexible and better suited for dynamic workloads, heterogeneous processing elements, and varying resource availability.

To support the execution of APNs, the Dynamic Process Manager (DPM) has been developed as a C++ runtime library. DPM has three key functions. First, it provides a C++ API that allows developers to define an APN application. Second, it offers the necessary primitives to construct and manage the runtime topology. Finally, DPM serves as the runtime manager, coordinating application configuration and interacting with the system resource manager HARP to align application configurations with the RM decisions.

References

Robert Khasanov, Andrés Goens, Jeronimo Castrillon, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap", Proceedings of the 9th Workshop and 7th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM'18), co-located with 13th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), ACM, pp. 20–25, New York, NY, USA, Jan 2018.

Reactors

The Reactor Model is a modern model of computation for building reliable and predictable software for cyber-physical systems [1]. Each component in the model is associated with a notion of logical time, which allows developers to reason precisely about when things happen in a program, not just what happens. This is crucial for real-time and embedded systems, where tasks must meet strict timing constraints.

A runtime scheduler manages the correct execution of reactor programs and ensures determinism even in a distributed setting. In our recent work, we showed that the reactor approach can be integrated with the Adaptive AUTOSAR standard and applied it to an automotive use case [2]. The resulting deterministic implementation of an emergency brake assistant is shown below.

emergency brake assistant (EBA) application using reactors

We maintain the reactor-cpp framework [3], which is a C++ implementation of the reactor model and its runtime scheduler. The DEAR framework builds on top of that and provides mechanisms for integration with Adaptive AUTOSAR. We further participate in the development of Lingua Franca, a polyglot programming language for reactor programs.

References

Marten Lohstroh, Íñigo Íncer Romero, Andrés Goens, Patricia Derler, Jeronimo Castrillon, Edward A. Lee, Alberto Sangiovanni-Vincentelli, "Reactors: A Deterministic Model for Composable Reactive Systems" , Proceedings of the 9th Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) and the Workshop on Embedded and Cyber-Physical Systems Education (WESE 2019), pp. 26pp, Oct 2019.
Christian Menard, Andrés Goens, Marten Lohstroh, Jeronimo Castrillon, "Achieving Determinism in Adaptive AUTOSAR", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), EDA Consortium, Mar 2020.
Christian Menard, "Deterministic Reactive Programming for Cyber-physical Systems", PhD thesis, TU Dresden, 205 pp., Jun 2024.

ConDRust: From Sequential Rust to Deterministic Data-Parallel Programs

ConDRust [1] is a programming model based on a well-defined sequential subset of Rust that excludes traditional parallel constructs such as threads, tasks, as well as synchronization mechanisms like locks and message passing. Programs written in ConDRust are automatically translated into a concurrent subset of safe Rust modeled as KPNs, ensuring deterministic execution.

This translation introduces a dataflow-based runtime representation inspired by Ohua [2]. A key aspect of ConDRust is scalability through data parallelism. For data-parallel regions of static dataflow, ConDRust introduces dynamic nodes to enable dynamic parallelism. By generating futures and preserving the original data order, it maintains determinism and preserves the original semantic of the program.

ConDRust also addresses irregular applications that exhibit amorphous data parallelism, where the parallelism emerges dynamically, for example, through shared state access, and the structure of the computation (e.g., which data can be computed in parallel) is not statically known. To handle such cases, ConDRust applies transformations that isolate shared state updates from parallel loops, making the parallelism explicit and bounded while preserving deterministic semantics.

To check the ConDRust project, please visit us on GitHub: https://github.com/ohua-lang/condrust

References

Felix Suchert, Lisza Zeidler, Jeronimo Castrillon, Sebastian Ertel, "ConDRust: Scalable Deterministic Concurrency from Verifiable Rust Programs", In Proceeding: 37th European Conference on Object-Oriented Programming (ECOOP 2023) (Ali, Karim and Salvaneschi, Guido), Schloss Dagstuhl – Leibniz-Zentrum für Informatik, vol. 263, pp. 33:1–33:39, Dagstuhl, Germany, Jul 2023.
Sebastian Ertel, Christof Fetzer, Pascal Felber, "Ohua: Implicit Dataflow Programming for Concurrent Systems", Proceedings of the Principles and Practices of Programming on The Java Platform, ACM, pp. 51–64, New York, NY, USA, 2015.

dfg-mlir: A Dataflow Dialect for MLIR

The dfg-mlir dialect is an MLIR dialect designed to represent dataflow applications, such as homogeneous synchronous dataflow (HSDF) and more general Kahn Process Networks (KPN). Depending on the model, users can define a node using a custom operation — dfg.operator for HSDF or dfg.process (for KPN). Within a node, the operator or process reads data from input ports, performs computations, and writes results to output ports. Communication between nodes is facilitated through channels, which are defined using the dfg.channel operation and connected to the node ports when the node is instantiated via dfg.instantiate. To enable developers to use higher-level languages to express their program more ergonomically, a frontend with a C-like syntax has been developed, alongside an MLIR dialect called composition-mlir, which allows reasoning about state sharing and parallelism before lowering its output to the dfg-mlir dialect.

The dfg-mlir dialect can be lowered to various hardware platforms. For instance, it can be lowered to OpenMP for execution on CPUs [1]. For hardware generation, it can be lowered to Olympus, which, together with an HLS tool like Bambu or Vitis, enables deployment on CPU-FPGA heterogeneous systems. Furthermore, an extended FPGA backend introduced in [2] allows for more generic execution on FPGAs using the CIRCT project as the backend. To support pipeline optimization and better handle floating-point designs, we also developed a new backend that generates synthesizable C++ code compatible with Vitis HLS.

References

Stephanie Soldavini, Felix Suchert, Serena Curzel, Michele Fiorito, Karl Friedrich Alexander Friebel, Fabrizio Ferrandi, Radim Cmar, Jeronimo Castrillon, Christian Pilato, "Etna: MLIR-Based System-Level Design and Optimization for Transparent Application Execution on CPU-FPGA Nodes", Proceedings of the 32nd IEEE International Symposium On Field-Programmable Custom Computing Machines (FCCM) (extended abstract), 1pp, May 2024.
Jiahong Bi, A lowering for high-level data flows to reconfigurable hardware (2024)

Design Space Exploration with Mocasin

Programming modern hardware architectures becomes increasingly challenging. Not only do we need to manage the challenges of concurrency to exploit parallelism, but we also need to carefully allocate computational tasks across a diverse set of processing elements. The heterogeneous nature of modern hardware dramatically expands the design space, making automated design space exploration (DSE) essential.

At the Chair for Compiler Construction, we developed Mocasin [1], an open-source rapid prototyping tool designed for exploring new DSE strategies for mapping software onto heterogeneous hardware. Mocasin supports multiple dataflow models of computation, particularly KPN, as well as other models like SDF or task graphs, which can be viewed as specializations of KPN.

As shown in the figure, Mocasin provides internal data structures to represent dataflow applications, platforms, mappings, and additional runtime information, such as pre-recorded application traces. A central component of Mocasin is its discrete-event simulation, which estimates application performance and energy consumption on the given platform based on the mapping and traces.

Mocasin includes several pre-implemented mapping algorithms based on heuristics and metaheuristics, and it also features a modular mapper structure with a unified interface for rapid prototyping of new mapping algorithms.

Using Mocasin, we have extended classical DSE methods in several directions. We leverage symmetries [2] to reduce the design space and accelerate exploration. We apply design centering techniques to identify robust mappings, based on a bio-inspired algorithm [3]. Mocasin has also been used to prototype hybrid application mapping strategies suitable for dynamic workloads. Mocasin has also been used to explore configuration-aware mapping strategies for more dynamic models such as Adaptive Process Networks (APNs).

The Mocasin framework is available online.

References

Christian Menard, Andr'es Goens, Gerald Hempel, Robert Khasanov, Julian Robledo, Felix Teweleitt, Jeronimo Castrillon, "Mocasin—Rapid Prototyping of Rapid Prototyping Tools: A Framework for Exploring New Approaches in Mapping Software to Heterogeneous Multi-cores", Proceedings of the 2021 Drone Systems Engineering and Rapid Simulation and Performance Evaluation: Methods and Tools, co-located with 16th International Conference on High-Performance and Embedded Architectures and Compilers (HiPEAC), Association for Computing Machinery, pp. 66–73, New York, NY, USA, Jan 2021.
Andrés Goens, Sergio Siccha, Jeronimo Castrillon, "Symmetry in Software Synthesis", In ACM Transactions on Architecture and Code Optimization (TACO),, ACM, vol. 14, no. 2, pp. 20:1–20:26, New York, NY, USA, Jul 2017.
Gerald Hempel, Andrés Goens, Josefine Asmus, Jeronimo Castrillon, Ivo F. Sbalzarini, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering", Proceedings of the 20th International Workshop on Software and Compilers for Embedded Systems (SCOPES '17), ACM, pp. 21–30, New York, NY, USA, Jun 2017.

Hybrid Application Mapping

At the Chair for Compiler Construction, we actively research hybrid application mapping methods. Hybrid mapping approaches consist of two stages: a design-time stage and a runtime stage. At design time, hybrid mappers perform sophisticated analysis to generate Pareto-optimal mapping candidates, each varying in resource usage and performance-energy characteristics. At runtime, a resource manager analyzes the current workload and selects one of the pre-generated mapping candidates for each application, ensuring that all constraints are satisfied and overall system utilization is maximized.

Building upon this hybrid mapping paradigm, our research introduces novel extensions to improve adaptivity in dynamic workloads. This includes extending the decision model with a temporal dimension in sporadic real-time scenarios and expanding the applicability of hybrid mapping beyond embedded systems to general-purpose computing platforms.

Spatio-Temporal Mapping

Conventional hybrid mapping approaches typically focus on spatial mapping, i.e., merely assigning tasks and processes to processor cores. In our work, we propose an enhanced resource management strategy that generates spatio-temporal mappings. This allows the runtime system not only to allocate resources spatially but also strategically postpone execution [1] or adjust mappings at a later time [2]. By anticipating workload fluctuations and changes in resource availability, spatio-temporal mapping allows the system to accept more jobs, defer execution when beneficial, and dynamically remap tasks to achieve improved energy efficiency and adaptivity.

References

Robert Khasanov, Jeronimo Castrillon, "Energy-efficient Runtime Resource Management for Adaptable Multi-application Mapping", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 909–914, Mar 2020.
Robert Khasanov, Marc Dietrich, Jeronimo Castrillon, "Flexible Spatio-Temporal Energy-Efficient Runtime Management", In Proceeding: 29th Asia and South Pacific Design Automation Conference (ASP-DAC’24), pp. 777–784, Jan 2024.

Hybrid Mapping for General-Purpose Systems

Inspired by the principles of hybrid application mapping, we, together with the Chair of Operating Systems, have developed a novel Linux-based resource management approach targeting general-purpose systems equipped with heterogeneous multi-core processors [1]. The system comprises two core components: a centralised resource manager that makes global allocation decisions, and a lightweight client library dynamically linked to applications to facilitate interaction with the manager.

To overcome the main limitation of conventional hybrid mapping – its reliance on the design-time exploration stage – our approach introduces a runtime monitoring component. This monitor dynamically analyzes newly arriving applications and strategically explores the mapping space online, efficiently generating a set of Pareto-optimal mapping configurations.

Beyond simply selecting the optimal resource allocation, the system also triggers internal application adaptations to the assigned resources through the client library. For example, data-parallel applications, such as those using OpenMP and Intel TBB, can dynamically adjust the parallelization degree based on the number of assigned processor cores. The client library also exposes an internal interface that can be extended by application or library developers to support more specialized, application-aware adaptations. This approach ensures coordinated adaptivity at both the resource management and application levels, enabling energy-efficient execution in dynamic and heterogeneous computing environments.

References

Till Smejkal, Robert Khasanov, Jeronimo Castrillon, Hermann Härtig, "E-Mapper: Energy-Efficient Resource Allocation for Traditional Operating Systems on Heterogeneous Processors", Jun 2024.

Go back