[FPL 2020] A Domain-Specific Architecture for Accelerating Sparse Matrix Vector Multiplication on FPGAs: Existing FPGA-based DSAs for SpMV do not allow customization through plug and play of the building blocks. For example, most of these DSAs require switching network/crossbar architecture as a building block for routing matrix data to banked vector memory blocks. In this paper, we first present an approach where a custom network is built using simple blocks arranged in a regular fashion to exploit low-level architecture details. Further, we make use of this network to replace expensive crossbars employed in GEMX SpMV engine and develop an end-to-end tool-flow around mixed IP approach (HLS/RTL). Due to the modularity of our design, our tool-flow allows us to insert an additional block in the design to guarantee zero-stall from the accumulation stage. On Alveo U200, we report performance numbers of up to 4.4 GFLOPS (92% peak bandwidth utilization) using our accelerator (attached with one DDR4).
[SLIP 2020] Role of on-chip networks in building domain-specific architectures (DSAs) for sparse computations: Designing high performance and energy efficient DSAs for SpMV is challenging due to highly irregular and random memory access patterns, poor temporal and spatial locality and very low data reuse opportunites. SpMV DSAs exploit distributed on-chip memory blocks to store vector entries for avoiding off-chip random memory access. However, a switching network architecture or a crossbar is usually required as a building block for routing matrix non-zero elements to on-chip memory blocks. In this presentation, we will discuss about the network architectures and switches employed in existing SpMV DSAs, our SpMV DSA based on 2D-mesh network, design choices for the FPGA implementation of the DSA, and scalability aspects. In more detail for our use-case, we will highlight the importance and challenges of achieving energy efficient data movement using scalable on-chip network architectures.