Features


  • CUDA (using PyCUDA or CuPy) and OpenCL (using PyOpenCL) backends

  • complex (C2C) transforms

  • R2C/C2R, now fully supporting odd sizes for the fast axis with inplace transforms

  • Direct Cosine or Sine transforms (DCT/DST) of type 1, 2, 3 and 4

  • out-of-place or inplace

  • single and double precision for all transforms (double precision requires device support)

  • Allows up to 8 FFT dimensions by default (can be increased by using VKFFT_MAX_FFT_DIMENSIONS when installing).

  • arrays can have more dimensions than the FFT (batch transforms).

  • Options are available to tune (manually or automatically) the performance for specific GPUs.

  • arbitrary array size, using Bluestein algorithm for prime numbers>13 (note that in this case the performance can be significantly lower, up to ~4x, depending on the transform size, see example performance plot below). Now also uses Rader's FFT algorithm for primes from 17 up to max shared memory length (~10000, see VkFFT's doc for details)

  • transform along a given list of axes, e.g. using a 4-dimensional array and supplying axes=(-3,-1). For R2C transforms, the fast axis must be transformed.

  • normalisation=0 (array L2 norm * array size on each transform) and 1 (the backward transform divides the L2 norm by the array size, so FFT*iFFT restores the original array)

  • Support for C (default) and F-ordered arrays, for C2C and R2C transforms

  • unit tests for all transforms: see test sub-directory. Note that these take a long time to finish due to the extensive number of sub-tests.

  • Note that out-of-place C2R transform currently destroys the complex array for FFT dimensions >=2

  • tested on macOS (10.13.6/x86, 12.6/M1), Linux (Debian/Ubuntu, x86-64 and power9), and Windows 10 (Anaconda python 3.8 with Visual Studio 2019 and the CUDA toolkit 11.2)

  • GPUs tested: mostly nVidia cards, but also some AMD cards and macOS with M1 GPUs.

  • inplace transforms do not require an extra buffer or work area (as in cuFFT), unless the x size is larger than 8192, or if the y and z FFT size are larger than 2048. In that case a buffer of a size equal to the array is necessary. This makes larger FFT transforms possible based on memory requirements (even for R2C !) compared to cuFFT. For example you can compute the 3D FFT for a 1600**3 complex64 array with 32GB of memory.

  • transforms can either be done by creating a VkFFTApp (a.k.a. the fft 'plan'), with the selected backend (pyvkfft.cuda for pycuda/cupy or pyvkfft.opencl for pyopencl) or by using the pyvkfft.fft interface with the fftn, ifftn, rfftn and irfftn functions which automatically detect the type of GPU array and cache the corresponding VkFFTApp (see the example notebook pyvkfft-fft.ipynb).

  • the pyvkfft-test command-line script allows to test specific transforms against expected accuracy values, for all types of transforms.

  • pyvkfft results are evaluated before any release with a comprehensive test suite, comparing transform results for all types of transforms: single and double precision, 1D, 2D and 3D, inplace and out-of-place, different norms, radix and Bluestein, etc... The pyvkfft-test-suite script can be used to run the full suite, which takes more than two days on an A40 GPU using up to 16 parallel process, with about 1.5 million unit tests. Here are the test results for pyvkfft 2024.1: