Over a few weeks I have been experimenting with radix sort in this gamedev.ru thread: https://gamedev.ru/flame/forum/?id=227644 .
I decided to repost the findings here, because:
1. Hopefully they will be helpful to someone.
2. I would like to get more feedback.
3. There seems to be lack of general discussion regarding implementation details of radix sort.
The primary artifact of the experiments is the following radix sort library:
https://rextester.com/HLQELD48880
This contains both the library and test harness.
Fair warning: it is undertested at the moment, and there might be stupid bugs (having people too look at it, both from standpoint of correctness, and performance is one of the reasons I'm posting this).
Here's also experimental testbed, which the above library grew from:
https://rextester.com/HEZLFE12709
Note: confusingly, it uses _msb and _lsb suffixes, even though algorithms operate on digits, not bits.
Now, to discuss some points of radix sort.
If you are not familiar with radix sort, wikipedia provides reasonable introduction: https://en.wikipedia.org/wiki/Radix_sort , so I'm not repeating that here.
Generally, radix sort comes in 2 flavors: LSD (sorting least significant digits first) and MSD (sorting most significant digits first). Both can be implemented as out-of-place stable sort. MSD can also be implemented as in-place (not stable) sort.
Here are some observations, in no particular order:
1. People often seem to be concerned with sorting a plain array of numbers. I find that a rather rare use-case: generally, you want to know, which objects these numbers correspond to. So I think struct{uint32_t key;uint32_t payload;} the most likely use-case. Here payload is some sort of handle (say, index in an array; it also can be a pointer, but that is bulkier on x64) to your actual "fat" object (not much reason for lugging those around).
2. I often encountered statement, that MSD is preferable for large arrays, for 2 reasons: it has better cache locality, and it can descend into fallback search for small sizes. That does not quite match the experiments: for struct{uint32_t key;uint32_t payload;} in my tests LSD is consistently faster for n>2000. For larger (>=64b) keys MSD does start to win. For larger payload with small key MSD sometimes wins.
3. I've seen fallback sort for MSD usually implemented as an insertion sort. It raises the problem, of when to switch to it: typical (for other alorithms, say introsort) threshold of ~16 is too small, and running radix sort on it incurs extra overhead; the sensible threshold for radix sort, ~64, makes insertion sort rather slow. I've implemented a merge sort (with threshold of 128/256), which further falls back to insertion sort for n<=18.
4. Prefetching helps. A lot. I'm talking like x3 speedup on large inputs. And, BTW, that's prefetching writing.
5. Input data matters. While radix sort (especially LSD) doesn't seem to care, whether the array of random keys is sorted or not, the keys being a sequence of 0..(n-1) slows down execution x3 (somewhat less, if shuffled). The current best guess is that it is due to a lot of power-of-two sized buckets causing cache aliasing. Not sure, how to fix it.
6. Optimization options (in this implementation) do not seem to matter much, beyond basic (as in -O1).
7. Replacing insertion sort with sorting network doesn't seem to help (on its own sorting network does slightly beat insertion sort).
Very broadly, on struct{uint32_t key;uint32_t payload;} for n>4000, (out-of-place) radix sort tends to be about 5 times faster, than std::sort.
Here are some charts (who doesn't like charts?):
Please note, that these correspond to somewhat outdated versions.
Most of the timings are from rextester.com above. I realize, that benchmarking on an online compiler is not the best idea, but these tests are transparent: everyone sees the same results, along with the knobs set to made them.
If you can find the time, I would appreciate results from your machine (preferably on larger range of sizes; rextester.com only allows ~5 seconds runtime). Ideally, you could also provide your CPU and RAM information (2 tabs from https://www.cpuid.com/softwares/cpu-z.html would do wonderfully).
Especially welcome are results from non-x64 machines.
I invite both general discussion regarding radix sort, and comments on this specific implementation.