Performance

The pyvkfft-benchmark script is available to make simple or systematic testss, also allowing to compare with cuFFT and clFFT.

Example results for batched 2D, single precision FFT with array dimensions of batch x N x N using a V100:

https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_V100-Linux.png

Notes regarding this plot:

  • the computed throughput is theoretical, as if each transform axis for the couple (FFT, iFFT) required exactly one read and one write. This is obviously not true, and explains the drop after N=1024 for cuFFT and (in a smaller extent) vkFFT.

  • the batch size is adapted for each N so the transform takes long enough, in practice the transformed array is at around 600MB. Transforms on small arrays with small batch sizes could produce smaller performances, or better ones when fully cached.

The general results are:

  • vkFFT throughput is similar to cuFFT up to N=1024. For N>1024 vkFFT is much more efficient than cuFFT due to the smaller number of read and write per FFT axis (apart from isolated radix-2 3 sizes)

  • the OpenCL and CUDA backends of vkFFT perform similarly, though there are ranges where CUDA performs better, due to different cache. [Note that if the card is also used for display, then difference can increase, e.g. for nvidia cards opencl performance is more affected when being used for display than the cuda backend]

  • clFFT (via gpyfft) generally performs much worse than the other transforms, though this was tested using nVidia cards. (Note that the clFFT/gpyfft benchmark tries all FFT axis permutations to find the fastest combination)

Another example on an A40 card (only with radix<=13 transforms):

https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-2DFFT-NVIDIA-Tesla_A40-Linux-radix13.png

On this card the cuFFT is significantly better, even if the 11 and 13 radix transforms supported by vkFFT give globally better results.

Performance tuning

Starting with VkFFT 1.3.0 and pyvkfft 2023.2, it is possible to tweak low-level parameters including coalesced memory or warp size, batch grouping, number of threads, etc...

Optimising those is difficult, so only do it out of curiosity or when trying to get some extra performance. Generally, VkFFT defaults work quite well. Using the simple FFT API, you can activate auto-tuning by passing tuning=True to the transform functions (fftn, rfftn, etc..). Only do this when using iterative process which really require fine-tuning !

Here is an example of the benchmark ran on a V100 GPU by tuning the coalescedMemory parameter (default value=32):

https://raw.githubusercontent.com/vincefn/pyvkfft/master/doc/benchmark-V100-cuda-2D-coalmem.png

As you can see the optimal value varies with the 2D array size: below n=1536, using coalescedMemory=64 gives the best results, 32 (default) is best between 1536 and 2048, and above that there is little difference between the values chosen.

The same test on an A40 shows little difference. On an Apple M1 pro, it is the aimThreads parameter which is better tuned from 128 (default) to 64 to yield up to 50% faster transforms. YMMV !