High-performance floating-point processing has traditionally been associated with high-end CPUs. However, in recent years, GPUs have emerged as powerful platforms for floating-point computations, expanding beyond their original role in graphics to become general-purpose GPUs (GPGPUs). A new development is the use of FPGAs for floating-point processing in demanding applications. This article focuses on FPGAs and their floating-point performance, design flow, and the use of OpenCL, a programming language that is pushing the boundaries of high-performance computing.
The GFLOP/s metric for various processing platforms continues to improve, and now the term TFLOP/s is commonly used. However, peak GFLOP/s values—often cited as TFLOP/s—may not fully reflect actual device performance. They represent only the theoretical number of floating-point additions or multiplications per second. Analysis shows that FPGA-based single-precision floating-point processing can exceed 1 TFLOP/s.
A less common but widely used algorithm is the Fast Fourier Transform (FFT). A 4096-point FFT implemented using single-precision floating-point can process four complex samples per clock cycle. Each FFT core runs faster than 80 GFLOP/s, and a large FPGA can support seven such cores.
However, as shown in Figure 1, the FFT algorithm's GFLOP/s on this FPGA reaches nearly 400 GFLOP/s through a "button-on" OpenCL compilation that doesn’t require FPGA expertise. With optimizations like logic lock and design space exploration (DSE), the 7-core design approaches the Fmax of a single-core design, achieving 500 GFLOP/s and exceeding 10 GFLOP/s per watt.
This efficiency makes FPGAs far more power-efficient than CPUs or GPUs. In contrast, GPUs are not very efficient at these FFT lengths, so no benchmarks were conducted. When the FFT length increases to several hundred thousand points, GPU efficiency improves significantly, offering effective acceleration for CPUs.
In summary, actual GFLOP/s usually reach only a fraction of the theoretical maximum. Therefore, it’s better to compare performance using algorithms, which more accurately reflect real-world application characteristics. The more complex the algorithm, the more representative the benchmark.
Rather than relying solely on vendor-provided peak GFLOP/s metrics, a more sophisticated and representative third-party assessment is recommended. For high-performance computing, Cholesky decomposition is an ideal algorithm. It is widely used in linear algebra to solve systems of equations and perform matrix inversion. It is computationally intensive and requires floating-point representation for accurate results. Its computational demand scales with N³, where N is the matrix dimension, making it highly resource-intensive.
Table 1 presents benchmark results based on the Nvidia GPU specification of 1.35 TFLOP/s, using various libraries, and the Xilinx Virtex6 XC6VSX475T, optimized for DSP processing. These devices are similar in density to Altera FPGAs when used in Cholesky benchmarks.
LAPACK and MAGMA are commercial libraries, while the GPU GFLOP/s implementation was done using OpenCL developed by the University of Tennessee. For smaller matrices, the latter is more optimized.
The mid-range Altera Stratix V FPGA (460kLE) was also benchmarked using the single-precision floating-point Cholesky algorithm. As shown in Table 2, its performance exceeds that of Xilinx results.
It should be noted that the matrix sizes differ. The University of Tennessee started with a [512 x 512] matrix, while the BDTI benchmark reached [360 x 360]. This is because GPUs are inefficient with small matrices, so they shouldn't be used to speed up the CPU in such cases. FPGAs, however, excel in smaller matrices.
The BDTI benchmark uses each Cholesky kernel, allowing selection of matrix size, vector size, and number of channels. Larger matrices require longer vectors, enabling higher performance. For example, a [360 x 360] matrix achieves 91 GFLOP/s, while a [60 x 60] matrix allows two cores, totaling 78 GFLOP/s. Similarly, a [30 x 30] matrix supports three cores, reaching 78 GFLOP/s.
FPGAs seem better suited for smaller data sizes. One reason is that the computational load grows with N³, while data I/O increases with N². As data grows, the GPU’s I/O bottleneck becomes less of an issue. Another consideration is throughput. As matrix size increases, the throughput per matrix decreases, sometimes falling below application requirements. To address this, large matrices are often decomposed into smaller sub-matrices.
For FFTs, the computational load increases with N logâ‚‚N, while data I/O increases with N. For larger datasets, GPUs are more efficient, while FPGAs excel at shorter FFT lengths, typically in the thousands. GPUs are better suited for FFTs with hundreds of thousands of points.
GPU and FPGA design methods vary. GPUs are programmed using CUDA or OpenCL, while FPGAs typically use HDL languages like Verilog or VHDL. Although newer versions support floating-point definitions, they are not well-suited for floating-point designs. For example, System Verilog uses short real for single precision and real for double precision.
Integrating floating-point datapaths into FPGAs using traditional methods is inefficient. Xilinx FPGAs show low performance on Cholesky algorithms, confirming this. Altera uses different methods, including Mathworks-based design input via the DSP Builder Advanced Module Library. This tool supports fixed and floating-point numbers, vectorization, and efficient mapping of floating-point circuits to fixed-point architectures.
OpenCL compilation for FPGAs allows code written for AMD or Nvidia GPUs to run on FPGAs without typical FPGA development. This provides several advantages over GPUs. First, GPU I/O is limited, requiring data to pass through the CPU via PCI, causing delays. FPGAs offer wide broadband I/O, supporting direct data transfer from ADC/DAC inputs or Gigabit Ethernet/SRIO interfaces.
FPGA processing latency is much lower than GPU. While GPUs need thousands of threads to operate efficiently due to memory read delays, FPGAs use a coarse-grained parallel architecture, enabling multiple optimized parallel data paths, each producing one result per clock cycle. Though fewer than GPU cores, their throughput is much higher, offering lower latency critical for many applications.
Another advantage is low power consumption, resulting in much better GFLOP/s per watt. BDTI measurements show FPGAs achieve 5–6 GFLOP/s per watt for complex floating-point algorithms, compared to GPUs’ 0.25 GFLOP/s per watt. This highlights FPGAs' energy efficiency.
Both OpenCL and DSP Builder rely on “fused data path†technology, reducing barrel shift circuits and enabling large-scale high-performance floating-point designs on FPGAs.
To reduce barrel shift frequency, synthesis processes use larger mantissa widths, eliminating normalization and denormalization. Larger multipliers, such as 27×27 and 36×36, support extended single-precision and double-precision calculations. FPGA logic is optimized for large fixed-point adders, including built-in carry chains.
In many linear algebra algorithms, vector dot products are the main operation. A 64-length single-precision dot product requires 64 multipliers and 63 adders, involving many barrel shifts. Instead, outputs can be denormalized to share a common exponent, summed with a fixed-point adder, avoiding temporary normalization.
This method produces more accurate results than traditional IEEE754 floating-point, as shown in Table 3. BDTI benchmarks confirm similar results.
Using the Cholesky decomposition algorithm, Table 3 shows that double-precision implementations are one billion times more accurate than single-precision ones. The fused data path method ensures accuracy, with errors calculated using the Frobenius norm.
DSP Builder and OpenCL tools optimize current designs for next-generation FPGAs, achieving up to 100 peak GFLOPs/W through architectural and process innovations.
In conclusion, high-performance computing now has new platform choices. FPGAs offer low latency and high GFLOP/s for specific floating-point algorithms, and excellent GFLOP/s per watt in most applications. This advantage will grow with next-gen FPGAs.
Altera’s OpenCL compiler offers GPU programmers a seamless way to evaluate this new architecture, compliant with the 1.2 specification and providing comprehensive database support. It solves timing closure, DDR management, and PCIe interface issues.
For non-GPU developers, Altera’s DSP Builder Advanced Blockset enables high-Fmax fixed or floating-point DSP designs within the MathWorks environment. FPGA developers have long used this tool to achieve performance comparable to experienced developers.
Carbon Fiber Rigid Felt Tube,Vacuum Furnace Insulation Cylinder,Carbon Fiber Thermal Insulation Cylinder,Heat Insulation Screen Of Vacuum Furnace
HuNan MTR New Material Technology Co.,Ltd , https://www.hnmtr.com