Abstract

Solving diagonally dominant tridiagonal linear systems is a common problem in scientific high-performance computing (HPC). Furthermore, it is becoming more commonplace for HPC platforms to utilise a heterogeneous combination of computing devices. Whilst it is desirable to design faster implementations of parallel linear system solvers, power consumption concerns are increasing in priority. This work presents the oclspkt routine. The oclspkt routine is a heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can use FPGAs, GPUs, and CPUs to concurrently accelerate the solving of diagonally dominant tridiagonal linear systems. The routine is designed to solve tridiagonal systems of any size and can dynamically allocate optimised workloads to each accelerator in a heterogeneous environment depending on the accelerator’s compute performance. The truncated SPIKE FPGA solver is developed first for optimising OpenCL device kernel performance, global memory bandwidth, and interleaved host to device memory transactions. The FPGA OpenCL kernel code is then refactored and optimised to best exploit the underlying architecture of the CPU and GPU. An optimised TDMA OpenCL kernel is also developed to act as a serial baseline performance comparison for the parallel truncated SPIKE kernel since no FPGA tridiagonal solver capable of solving large tridiagonal systems was available at the time of development. The individual GPU, CPU, and FPGA solvers of the oclspkt routine are 110%, 150%, and 170% faster, respectively, than comparable device-optimised third-party solvers and applicable baselines. Assessing heterogeneous combinations of compute devices, the GPU + FPGA combination is found to have the best compute performance and the FPGA-only configuration is found to have the best overall estimated energy efficiency.

1. Introduction

Given the ubiquity of tridiagonal linear system problems in engineering, economic, and scientific fields, it is no surprise that significant research has been undertaken to address the need for larger models and higher resolution simulations. Demand for solvers for massive linear systems that are faster and more memory efficient is ever increasing. First proposed in 1978 by Sameh and Kuck [1] and later refined in 2006 [2], the SPIKE algorithm is becoming an increasingly popular method for solving banded linear system problems [37]. The SPIKE algorithm has been shown to be an effective method for decomposing massive matrices whilst remaining numerically stable and demanding little memory overhead [8]. The SPIKE algorithm has been implemented with good results to solve banded linear systems using CPUs and GPUs and in CPU + GPU heterogeneous environments often using vendor-specific programming paradigms [6].

A scalable SPIKE implementation targeting CPUs and GPUs in a clustered HPC environment to solve massive diagonally dominant linear systems has previously been demonstrated with good computation and communication efficiency [5]. Whilst it is desirable to design faster implementations of parallel linear system solvers, it is necessary also to have regard for power consumption, since this is a primary barrier to exascale computing when using traditional general purpose CPU and GPU hardware [9, 10].

FPGA accelerator cards require an order of magnitude less power compared to HPC grade CPUs and GPUs. Previous efforts in developing FPGA-based routines to solve tridiagonal systems have been limited to solving small systems with the serial Thomas algorithm [1113]. We have previously investigated the feasibility of FPGA implementations of parallel algorithms including the parallel cyclic reduction and SPIKE [14] for solving small tridiagonal linear systems. This previous work utilised OpenCL to produce portable implementations to target FPGAs and GPUs. The current work again utilises OpenCL since this programming framework allows developers to target a wide range of compute devices including FPGAs, CPUs, and GPUs with a unified language.

An OpenCL application consists of C-based kernel code intended to execute on a compute device and C/C++ host code that calls OpenCL API’s to set up the environment and orchestrate memory transfers and kernel execution. In OpenCL’s programming model, a device’s computer resources are divided up at the smallest level as processing elements (PEs), and depending on the device architecture, one or more PEs are grouped into one or many compute units (CUs) [15]. Similarly, the threads of device kernel code are called work items (WIs) and are grouped into work groups (WGs). WIs and WGs are mapped to the PE and CU hardware, respectively.

OpenCL’s memory model abstracts the types of memory that a device has available. These are defined by OpenCL as global, local, and private memory. Global memory is generally hi-capacity off-chip memory banks that can be accessed by all PEs across the device. Local memory is on-chip memory and has higher bandwidth and lower capacity than global memory and is only accessible to PE of the same CU. Finally, private memory refers to on-chip register memory space and is only accessible within a particular PE.

The motivation for this paper is to evaluate the feasibility of utilising FPGAs, along with GPUs and CPUs concurrently in a heterogeneous computing environment in order to accelerate the solving of a diagonally dominant tridiagonal linear system. In addition, we aimed at developing a solution that maintained portability whilst providing an optimised code base for each target device architecture and was capable of solving large systems. As such, we present the oclspkt routine, an heterogeneous OpenCL implementation of the truncated SPIKE algorithm that can dynamically load balance work allocated to FPGAs, GPUs, and CPUs concurrently or in isolation, in order to solve tridiagonal linear systems of any size. We evaluate the oclspkt routine in terms of computational characteristics, numerical accuracy, and estimated energy consumption.

This paper is structured as follows: Section 2 provides an introduction to diagonally dominant tridiagonal linear systems and the truncated SPIKE algorithm. Section 3 describes the implementation of the oclspkt-FPGA OpenCL host and kernel code and the optimisation process. This is followed by the porting and optimisation of the oclspkt-FPGA kernel and host code to the GPU and CPU devices as oclspkt-GPU and oclspkt-CPU. Section 3 concludes with discussion of the integration of the three solvers to produce the heterogeneous oclspkt solver. In Section 4, the individual solvers are compared to optimised third-party tridiagonal linear systems solvers. The three solvers are further compared in terms of estimated energy efficiency, performance, and numerical accuracy in addition to an evaluation of different heterogeneous combinations of the oclspkt. Finally, in Section 5, we draw conclusions from the results and discuss the implications for future work.

2. Background

2.1. Tridiagonal Linear Systems

A coefficient band matrix with a bandwidth of in the linear system is considered tridiagonal:

For nonsingular diagonally dominant systems where in equation (2), a special form of nonpivoting Gaussian elimination called the Thomas algorithm [16] can perform LU decomposition in operations. The Thomas algorithm provides good performance when solving small tridiagonal linear systems; however, since this algorithm is intrinsically serial, it fails to scale well in highly parallel computing environments. More advanced, inherently parallel methods must be applied if the problem requires solving large systems. Many parallel algorithms exist for solving tridiagonal and block tridiagonal linear systems and are implemented in well-established numerical libraries [1719].

2.2. The SPIKE Algorithm

The SPIKE algorithm [2] is a polyalgorithm that uses domain decomposition to partition a banded matrix into mutually independent subsystems which can be solved concurrently. Consider the tridiagonal linear system where A is in size with only a single right-hand side vector . We can partition the system into p partitions of m elements, where , to give a main diagonal partition , off-diagonal partitions and , and the right-hand side partition :

The coefficient matrix partitions are factorised so where D is the main diagonal block matrix and S is the SPIKE matrix as shown in the following equation:where for and for . By first solving , the solution can be retrieved by solving . As is the same size as the original system, solving for X can be simplified by first extracting a reduced system of the boundary elements between partitions to form as shown in equation (5), where t and b denote the top- and bottommost elements of the partition:

The reduced system is a sparse banded matrix of size and has a bandwidth of 2. Polizzi and Sameh [2] proposed strategies to handle solving the reduced system. The truncated SPIKE algorithm states for a diagonally dominant system where (equation (2)) the reduced SPIKE partitions and can be set to zero [2]. This truncated reduced system takes the form of independent systems as shown in equation (6) which can be solved easily using direct methods:

With computed, the remaining values of X can be found with perfect parallelism using the following equation:

Mikkelsen and Manguoglu [20] conducted a detailed error analysis of the truncated SPIKE algorithm and showed that a reasonable approximation of the upper bound of the infinity norm is dependent on the degree of diagonal dominance, the partition size, and bandwidth of the matrix given by

3. Implementation

The general SPIKE algorithm consists of four steps: (1) partitioning the system, (2) factorising the partitions, (3) extracting and solving the reduced system, and (4) recovering the overall solution.

For diagonally dominant tridiagonal linear systems, the truncated SPIKE algorithm may be employed. This requires only the bottom SPIKE element and top SPIKE element in order to resolve the boundary unknown elements and [2]. This decouples the partition from the rest of the matrix and can be achieved by performing only the forward-sweep steps of LU factorisation, referred to as , and forward-sweep steps of UL factorisation, referred to as . and will be computed for partitions k and for .

The factorised right-hand side elements and and SPIKE elements and are used to form and solve the reduced system using equation (6) to produce and . This algorithmic step is referred to as .

The remaining elements of the solution can then be recovered with equation (7) via the back-sweep step of LU, referred to as , and the back-sweep step UL factorisation, referred to as , on the top and bottom half of the partitions k and , respectively. We use the Thomas algorithm to compute the forward- and back-sweep factorisation steps giving the overall complexity of our truncated SPIKE algorithm as . A high-level overview of the anatomy and execution flow of our oclspkt routine is shown in Figure 1. The oclspkt solver expects the size of the system n, the RHS vector , and the tridiagonal matrix split into vectors of its lower diagonal L, main diagonal D, and upper diagonal U as inputs. The solution vector is returned.

In the following subsections, we describe the truncated SPIKE algorithm implementation for the FPGA (oclspkt-FPGA) using OpenCL and the development considerations to obtain optimised performance. As a part of this process, we design and implement an optimised TDMA OpenCL kernel to act as a serial baseline performance comparison for the parallel truncated SPIKE kernel. Since the optimised TDMA implementation is constrained by the available global memory bandwidth, we are able to make genuine comparisons of FPGA hardware utilisation and computational complexity for these two kernels.

We then discuss the process of porting and optimising the oclspkt code for CPU and GPU. Finally, we describe integrating the three oclspkt (FPGA, GPU, and CPU) implementations as a heterogeneous solver. The specific hardware we target for these implementations are Bittware A10PL4 FPGA, NVIDIA M4000 GPU, and the Intel Xenon E5-1650 CPU.

3.1. FPGA Implementation

In order to take advantage of the FPGA’s innate pipelined parallelism, we implement both the TDMA and truncated SPIKE algorithm as single work item kernels. A single work item kernel has no calls to the OpenCL API for the local or global relative position in a range of work items. This allows the Intel FPGA compiler to pipeline as much of the kernel code as possible, whilst not having to address WI execution synchronisation and access to shared memory resources. This reduces the OpenCL FPGA resource consumption overhead, allowing more of the logic fabric to be used for computation.

3.1.1. TDMA Kernel Code

We found no suitable FPGA implementation of a tridiagonal linear system solver able to solve large systems. In order to provide a suitable performance baseline for the more complex SPIKE algorithm, we implemented the Thomas algorithm or TDMA with OpenCL. The TDMA implementation calculates the forward-sweep and backsubsitution for one block of the input system at a time, effectively treating the FPGA’s on-chip BRAM as cache for the current working data. The block size m is set as high as possible and is only limited by the available resources on the FPGA. An OpenCL representation of the kernel implementation is shown in Figure 2.

The forward-sweep section loads m elements of the input vectors , from off-chip DDR4 RAM to on-chip BRAM. With this input, the upper triangular and modified RHS is calculated, overwriting the initial values of and . and are then written back to DDR4 RAM due to BRAM limitation on the FPGA. The forward-sweep section iterates over blocks. The back-substitution section loads m elements of , , and vectors and writes m elements of after recovering the solution via back-substitution. The back-substitution section iterates over blocks.

3.1.2. Truncated SPIKE Kernel Code

An OpenCL algorithm representation of the truncated SPIKE kernel spktrunc is shown in Figure 3. The FPGA oclspkt implementation executes p iterations of its main loop, loading one block of the linear system given, where block size is m, and we solve 1 partition per iteration of the main loop.

A partition of size m is loaded from global memory and partitioned as per equation (3). and are executed concurrently to compute and store half of the upper and lower triangular systems and , and the SPIKE elements and , respectively. Next, using and from the previous iteration and and as inputs for , the boundary elements and are computed.

Finally, and are then recovered with and of and . and are stored for the next iteration of the main loop. The FPGA solver is initialised with an upper triangular identity matrix in for .

This results in a streaming linear system solver where loading in a block of partitions at the start of the pipeline will compute a block of the solution vector with a element offset.

3.1.3. Host Code

On the host side, in order to interleave the PCIe memory transfers to the device with the execution of the solver kernel, we create in-order command queues for writing to, executing on, and reading from the FPGA.

We create two copies of read-only memory objects and a write-only memory object . The spktrunc kernel and the FPGA’s DMA controller for the PCIe bus share the total bandwidth of the DDR4 RAM bank. To maximise FPGA global memory bandwidth, the device memory objects are explicitly designated specific RAM bank locations on the FPGA card in such a way that the PCIe to device RAM, and device RAM to FPGA bandwidth, is optimised.

The execution kernel is enqueued as a 1-by-1-by-1 dimension task with arguments , and , where , the number of partitions to solve, is given by . The execution kernel is scheduled and synchronised with the write and read operations of the device memory objects using OpenCL event objects.

The kernel code is dependent on the partition size m, so memory buffers for the input and output vectors are created as 1-by- vectors, where is given by . The input matrix consists of lower, main, and upper diagonal vectors of and a single right-hand side vector is stored in row-major order.

The memory objects are padded with an identity matrix and zeros in order to accommodate linear systems where m is not a factor of n, giving the overall memory requirement as .

As the kernel is implemented as a single work item, this allows for single-strided memory access patterns when the FPGA loads partitions from global memory for processing. This means that it is not necessary to implement a preexecution data marshalling stage as is often required for SIMD or SIMD-like processors.

3.1.4. Kernel Complexity and Hardware Utilisation

The FLOP requirements for our TDMA and truncated SPIKE OpenCL kernels are presented in Table 1. The TDMA kernel has significantly fewer FLOPs compared to the truncated SPIKE kernel. This is expected as the TDMA kernel only computes the LU factorisation and back-substitution, compared to the more computationally complex truncated SPIKE polyalgorithm described previously. However, since the TDMA kernel requires the upper triangular matrix of the entire system to be stored to global memory as an intermediate step and then subsequently reread, the TDMA kernel requires double the number of FPGA to off-chip memory transactions in comparison with the truncated SPIKE kernel.

The FLOP and memory transaction requirements shown in Table 1 are reflected in the FPGA kernel hardware utilisation presented in Table 2. OpenCL requires a static partition of the available FPGA hardware resources in order to facilitate host to FPGA memory transfers and OpenCL kernel execution. Per Table 2, this static partition is significant, consuming at least 10% of measured resource types. The total resource utilisation for each kernel is given by the addition of OpenCL static resource utilisation and the kernel-specific resource utilisation.

The more computationally complex truncated SPIKE kernel requires more lookup tables (ALUT), flip-flops (FF), and digital signal processor (DSP) tiles than the TDMA kernel. The TDMA kernel however requires more block RAM (BRAM) tiles due to implementing a greater number of load store units to cater for the extra global memory transactions. Furthermore, both kernels are constrained by the available amount of BRAM on the FPGA, with the BRAM utilisation by far the highest resource utilisation for both kernels.

3.1.5. FPGA OpenCL Optimisation Considerations

Our implementation of the truncated SPIKE algorithm is global memory bandwidth constrained. It requires large blocks of floating point data to be accessible at each stage of the algorithm. By far the largest bottleneck to computational throughput is ensuring coalesced aligned global memory transactions.

When loading matrix partitions from the global to local or private memory, a major optimisation consideration is the available bandwidth on the global memory bus. The available bandwidth per memory transaction is 512 bits, and the load-store units that are implemented by the Intel FPGA compiler are of the size bits where and is constrained by the available resources on the FPGA. Therefore, to ensure maximum global memory bandwidth with the aforementioned constraints, we set partition size m to 32 for single precision floating point data for both kernels resulting in 1024 bit memory transactions.

The value of m is hard-coded and known at compile time allowing for unrolling of the nested loops at the expense of increased hardware utilisation. Unrolling a loop effectively tells the compiler to generate a hardware instance for each iteration of the loop, meaning that if there are no loop-carried dependencies, the entire loop is executed in parallel. However, for loops with carried dependencies such as LU/UL factorisation, each iteration cannot execute in parallel. Nonetheless, this is still many times faster than sequential loop execution despite the increase in latency that is dependent on the loop size.

Loop unrolling is our primary computational optimisation step, thereby allowing enough compute bandwidth for our kernel to act as a streaming linear system solver. Note that in our optimisation process, we either fully unroll loops or not at all. It is possible to partially unroll loops for a performance boost when hardware utilisation limitations do not permit a full unroll. Partially unrolling a loop can however be inefficient since the hardware utilisation does not scale proportionally with the unroll factor due to the hardware overhead required to control the loop execution.

3.2. Porting the Truncated SPIKE Kernel to CPU and GPU

In order to investigate the full potential for a heterogeneous computing implementation of the truncated SPIKE algorithm for solving tridiagonal linear systems of any size, we exploited the portability of OpenCL. We modified the host and kernel code used for the FPGA implementation to target CPU and GPU hardware. To achieve this, it was necessary to make modification to the host and kernel side memory objects and data access patterns, remap the truncated SPIKE algorithm to different kernel objects, and modify the work group sizes and their mapping to compute units with respect to CPU and GPU hardware architecture.

3.2.1. Partitioning and Memory Mapping

The memory requirements for CPU and GPU implementations are dependent on the partitioning scheme for each device. The memory required to solve for a partition of size m multiplied by a multiple of the GPU’s preferred work group size must be less than the available local memory. This constrains the size and number of partitions m, since it is preferable to maximise the occupancy of the SIMD lane whilst ensuring that sufficient local memory is available.

Unlike the GPU, all OpenCL memory objects on the CPU are automatically cached into local memory by hardware [21]. However, considering that the CPU has a lower compute unit count, we maximise the partition size m to minimise the number of partitions, thereby minimising the operation count required to recover the reduced system. The relative values for p and in terms of m and work group size are shown in Table 3.

For our implementation of the truncated SPIKE algorithm for the CPU and GPU, the host and kernel memory requirements are five 1-by- vectors, , of the partitioned system and four 1-by- vectors, , of the reduced system. By storing the values for in a separate global memory space, we remove the potential for bank conflicts in memory transactions that may occur if the reduced system vectors are stored in place in the partitioned system.

The reduced system memory objects are padded with zeros to accommodate the top- and bottommost partitions and removing the need for excess control code in the kernel to manage the topmost and bottommost partitions. Further, to ensure data locality for coalesced memory transactions on both the CPU and GPU, the input matrix is transformed in a preexecution data marshalling step. The data marshalling transforms the input vectors so that data for adjacent work items are sequential instead of strided. This allows the data to be automatically cached and for vector processing of work items on the CPU and for full bandwidth global memory transactions on the GPU.

3.2.2. Remapping the Kernel

In contrast to the FPGA implementation of the truncated SPIKE algorithm, for the CPU and GPU implementations, we split the code into two separate kernels for the CPU and three separate kernels for the GPU. This allows for better work group scheduling and dispatching for multiple compute unit architectures as we enqueue the kernels for execution as arrays of work items known as an NDRange in OpenCL’s parlance.

In remapping the kernel to the CPU, the underlying architecture provides relatively few processing elements per compute unit and a fast clock speed. As such, in order to make best use of this architecture, the number of partitions of the truncated SPIKE algorithm should be minimised, thereby ensuring allocation of work groups of the maximum possible size to each compute unit to ensure maximum occupancy. Figure 4 shows an OpenCL representation of the CPU implementation, its execution order, and data path.

For the CPU implementation, we use the partitioning scheme proposed by Mendiratta [22]. We compute the LU and UL forward-sweep factorisation in the spkfaccpu kernel where we apply UL factorisation to elements 0 to and apply LU factorisation to elements to m where is the smallest partition size required to purify the resulting factorisations of error as per equation (8). This reduces the overall operation count and is only possible when . The spktfaccpu kernel is enqueued as a p-by-1-by-1 NDRange, in a single in-order command queue.

The reduced system and the recovery of the overall solution are handled by a second kernel spktrec. The spktrec kernel is enqueued as per spktfaccpu in a single in-order command queue. The reduced system is solved, and the boundary unknown elements are recovered and used to compute the UL back-sweep of elements 0 to and the LU back-sweep of elements m to .

In contrast to the CPU, the GPU has many processing elements per compute unit and a relatively low clock speed. In order to optimise performance, it was important to maximise the number of partitions of the SPIKE algorithm, by reducing the partition size and thereby ensuring maximum occupancy of the processing elements. Figure 5 shows an OpenCL representation of the GPU implementation, its execution order, and data path.

For the GPU, partitioning the system and the LU and UL factorisation of the code are handled by the first kernel, spkfactgpu. Unlike the CPU, the GPU computes the entire block size m of the UL and the LU factorisations. Only the top half of the UL and the bottom half of the LU results are then stored in global memory in order to reduce global memory transactions and overall global memory space requirements. The reduced system and the recovery of the overall solution are handled by two kernels, spktreclu and spktrecul. spktreclu and spktrecul only load the bottom half of partition m and top half of partition m, respectively, to compute the back-sweep portions of LU and UL factorisation. The three kernels are again enqueued as p-by-1-by-1 NDRange in in-order command queues for writing to, executing on, and reading from the GPU. As with the FPGA, this effectively interleaves the PCIe data transfer with kernel execution.

3.3. The Heterogeneous Solver

We further extend our truncated SPIKE implementation to utilise all available computation resources available on a platform, as shown in Figure 1.

This heterogeneous solver first checks for available devices on the host using OpenCL APIs and then queries if device profiling data exist for found devices. If profiling data are not available for all devices, each device will be allocated an even portion of the input system and profiling data will be collected on the next execution of the solver. Otherwise, each device will be allocated a portion of the input system determined by the percentage of the total system throughput over the individual devices’ previously recorded throughput. Throughput in this case includes data transit time across the PCIe bus, data marshalling, and compute time of the kernel.

The heterogeneous solver then asynchronously dispatches chunks of the input data to the devices, executes the device solvers, and recovers the solution. The interchunk boundary solutions recovered from the devices are cleansed of error by executing a “top” level of the truncated SPIKE algorithm on the chunk partitions.

4. Evaluation

In the following subsections, we evaluate the oclspkt routine in terms of compute performance, numerical stability, and estimated power efficiency. The results presented use single precision floating point and all matrices are random and nonsingular, and the main diagonal has a diagonal dominance factor .

All results presented in this paper have been executed on a Dell T5000 desktop PC with an Intel Xeon CPU E5-1620 v4, 64 GB of RAM, a Bittware A10PL4 FPGA, and a NVIDIA M4000 GPU; full specifications are listed in Table 4.

4.1. Compute Performance

To evaluate the compute performance of the oclspkt, we first only consider the kernel execution time for our target devices in isolation, assuming predistributed memory. In Figure 6, we show the time to solve a system where , specifically identifying the solution and data marshalling components of the overall execution time. We set N to in-order to showcase the best possible performance for the computing-device only without introducing PCI memory transactions. In this experiment, the GPU kernel takes on average 78.4 ms to solve the tridiagonal system, where the FPGA and CPU are 2.6 and 4.8 slower at 200 ms and 376 ms, respectively. Furthermore, when also considering the data marshalling overheads required by the GPU and CPU kernels, the GPU is still the quickest at 152 ms with the FPGA and CPU now 1.3 and 6.1 slower.

The compute performance figures are not surprising when we consider that the performance of oclspkt is bound by the available memory bandwidth. For large matrices, the global memory transactions required for oclspkt-GPU and oclspkt-CPU are where the oclspkt-FPGA solver requires . We can estimate the performance of the compute devices using the required memory transactions of the individual solvers and with the total available memory bandwidth of the devices. Performance estimation is calculated using MT/B, where MT is the number of required memory transactions and B is the maximum available memory bandwidth. The estimated relative estimated performance is calculated to be 1, 2.17, and slower for the GPU, FPGA, and CPU, respectively (normalised for the GPU). These values closely correspond to the measured relative performance of the kernel solve time in Figure 6.

In Figure 7, we compare these results to other diagonally dominant tridiagonal solver algorithms, our TDMA FPGA kernel, a CUDA-GPU implementation, dgtsv, [6], the Intel MKL sdtsvb routine [23], and a sequential CPU implementation of the TDMA. For each of the three target devices, our oclspkt implementation outperforms the comparison routines for solving a tridiagonal system of . The oclspkt (FPGA) is 1.7 faster than the TDMA (FPGA) kernel, the oclspkt (GPU) implementation is 1.1 faster than the dgtsv, and our oclspkt (CPU) is 1.5 and 3.5 faster than the sdtsvb and TDMA CPU solvers, respectively. Note that for each of these results, we include any data marshalling overhead, but exclude host to PCIe device transfer time.

A comparison of the compute performance targeting single and heterogeneous combinations of devices executing the oclspkt routine is shown in Figure 8. We normalise the performance metric to rows solved per second (RSs−1) to provide a fair algorithmic comparison across different device hardware architectures. Furthermore, when evaluating the heterogeneous solver performance of oclspkt, we use a holistic system approach, which includes the host to device PCIe data transfer times for the FPGA and GPU devices, and all data marshalling overheads. As shown in Table 5, the GPU + FPGA device combination has the best average maximum performance. The GPU + FPGA device combination performs 1.38 better than the next best device, the GPU-only, and performs 2.48 better than the worst performing device, the CPU-only implementation.

Curiously, we would expect performance metrics of the heterogeneous combinations of devices to be close to the summation of the individual device performance metrics. In fact, our results show that only the GPU + FPGA heterogeneous performance is close to the summation of the GPU and FPGA-only performance at 88% of the theoretical total. The CPU + FPGA, CPU + GPU, and CPU + GPU + FPGA average maximum performance are only 65%, 51%, and 55%, respectively.

For the PCIe attached devices, the GPU and FPGA performance metrics are determined by the available PCIe bus bandwidth. We have provisioned primary and secondary memory spaces in the GPU and FPGA global memory space. This allows the oclspkt routine to execute the OpenCL kernel on the primary memory space whilst writing the next input chuck to the secondary memory space. This ensures the PCIe bus, and available memory bandwidth is being used in an efficient way. That is to say, the OpenCL kernel execution time (Figure 6) is completely interleaved with the host to device memory transfers. As such, the M4000 GPU card with 16 PCIe Gen 3.0 lanes available will outperform the A10PL4 FPGA card with 8 PCIe Gen 3.0 lanes regardless of the kernel compute performance. Similarly, the CPU performance is determined by the available host RAM bandwidth.

Using the de facto industry standard benchmark for measuring sustained memory bandwidth, STREAM [24, 25], our desktop machine, specified in Table 4, has a maximum measured memory bandwidth of 42 GBs−1. Profiling our CPU implementation of oclspkt using Intel VTune Amplifier shows very efficient use of the available memory bandwidth with sustained average 36 GBs−1 and peak 39 GBs−1 memory bandwidth utilisation. This saturation of host memory bandwidth by the CPU solver creates a processing bottleneck and negatively affects the PCIe data transfer to the FPGA and GPU. This, coupled with the heterogeneous partitioning scheme described in subsection 3.3, will favour the increase in the chunk size of the input system allocated for CPU computation on each successive invocation of the oclspkt routine and subsequently decrease the performance of GPU and FPGA devices.

One thing that may increase the performance of the combinations of solvers is to change the workload partitioning scheme to only look at data marshalling and kernel execution times, excluding the PCIe data transfer times. This would mean on successive calls of the routine the host would be tuned to send more workload to the GPU and FPGA and minimise the workload allocated to the CPU, thus improving performance.

4.2. Numerical Accuracy

In Figure 9, we show the numerical accuracy of the oclspkt in terms of the infinity norm of the known and calculated results, varied by the diagonal dominance of the input matrix compared to the TDMA CPU implementation. The TDMA approaches the machine epsilon value for single precision floating point numbers when the diagonal dominance of the input system is 2, whereas the oclspkt for the CPU, GPU, and FPGA requires a diagonal dominance of 2.8 to achieve a similar accuracy. Equation (8) shows that an approximation of the upper bound of the infinity norm error is dependent on the SPIKE partition size, the bandwidth, and the degree of diagonal dominance. As the GPU and FPGA partition sizes and the small CPU partition sizes are equal, that is, , the numerical accuracy for all implementations is expected to be very similar.

4.3. Estimated Energy Consumption and Efficiency

To determine estimated energy consumption in Joules for each device, we used the manufacturer’s rated thermal design power (TDP) and multiplied it by the kernel execution time for data marshalling and solved steps of the oclspkt. TDP represents the average power in watts used by a processor when the device is fully utilised. Whilst this is not a precise measurement of power used to solve the workload, it nevertheless provides a relative interdevice benchmark. The TDP of the M4000 GPU is 120 W, the Xeon E-1650 is 140 W, and the A10PL4 FPGA is 33 W.

When solving a system of as shown in Figure 10, the FPGA implementation uses 2.8 less and 20 less energy than the GPU and CPU implementations, respectively. Further, in Figure 11, we can see the estimated energy efficiency of each hardware configuration of oclspkt in rows solved per Joule. Across the range of the experiment, each solver shows consistent results with the FPGA-only solver estimated to be the most energy-efficient peaking at rows solved per Joule.

The FPGA-only solver is estimated to be on average 1.8 more energy efficient than the next best-performing solver, the GPU + FPGA, and is 20.0 more energy efficient than the poorest performing CPU-only solver. This is not surprising since the TDP for the FPGA is an order of magnitude smaller than the other devices. Similarly to the heterogeneous results in subsection 4.1, the addition of the CPU solver significantly constrains the available bandwidth to host memory slowing down the PCIe data transfer rates. In turn, this pushes more work to the CPU solver and slows down the overall compute, and since the CPU has the highest TDP, this exacerbates the poor energy efficiency.

5. Conclusion

In this paper, we presented a numerically stable heterogeneous OpenCL implementation of the truncated SPIKE algorithm targeting FPGAs, GPUs, CPUs, and combinations of these devices. Our experimental case has demonstrated the feasibility of utilising FPGAs, along with GPUs and CPUs concurrently in a heterogeneous computing environment in order to accelerate the solving of a diagonally dominant tridiagonal linear system. When comparing our CPU, GPU, and FPGA implementation of oclspkt to a suitable baseline implementation and third-party solvers specifically designed and optimised for these devices, the compute performance of our implementation showed 150%, 110%, and 170% improvement, respectively.

Profiling the heterogeneous combinations of oclspkt showed that targeting the GPU + FPGA devices gives the best compute performance and targeting FPGA-only will give the best estimated energy efficiency. Also adding our highly optimised CPU implementation to a heterogeneous device combination with PCIe attached devices significantly reduced the expected performance of the overall system. In our experimental case, with a compute environment that has CPUs, GPUs, and FPGAs, it is advantageous to relegate the CPU to a purely task orchestration role instead of computation.

Under our experimental conditions, all device compute performance results are memory bandwidth constrained. Whilst the GPU kernel compute performance is several times faster than the FPGA kernel, the FPGA test hardware has several times less available memory bandwidth. As high-bandwidth-data transfer technology is introduced to the new generations of FPGA accelerator boards, this performance gap between devices is expected to close. Given the significantly lower power requirements, incorporation of FPGAs has the potential to reduce some of the power consumption barriers currently faced by HPC environments as we move towards exascale computing.

A natural progression of this work would be to extend the oclspkt routine to be able to solve nondiagonally dominant and block tridiagonal linear systems. Further, it would be advantageous to extend the heterogeneous partitioning routine to be able to tune the solver to maximise energy efficiency where desired. A part of this extension would involve a more detailed power analysis and direct in-line monitoring of the host power usage.

An extension of this work may also seek to account for memory bottlenecks detected on successive invocations of the solver further enhancing performance in heterogeneous applications.

Data Availability

The source code and data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This project utilised the High-Performance Computing (HPC) Facility at the Queensland University of Technology (QUT). The facility is administered by QUT’s eResearch Department. Special thanks go to the eResearch Department for their support, particularly for providing access to specialist FPGA and GPU resources.