I created this post due another radix sort post for CPU. This is Radix Sort for GPU. Able achieve 900 Mkeys/S, and sorting 8 million elements in 9.3ms (on RTX 2070). Written on C++ with Vulkan API and GLSL. Based on bitfield and warp hacks. For understand this shader code, need very good knowledge of bitfields and GPU subgroup. For NVIDIA, also need knowledge of subgroup partition extension.
Github source code: https://github.com/world8th/RadX
Classification: Radix Sort (LSD)
Stable: yes
Parallel: yes, vector supported
Bit width: 8-bit (Turing), 2-bit (other), can be changed
Device type: GPU
Also, how to get fastest GPU radix sort ever?