🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Radix sort discussion

Started by
4 comments, last by elviras9t 5 years, 3 months ago

Over a few weeks I have been experimenting with radix sort in this gamedev.ru thread: https://gamedev.ru/flame/forum/?id=227644 .

I decided to repost the findings here, because:
1. Hopefully they will be helpful to someone.
2. I would like to get more feedback.
3. There seems to be lack of general discussion regarding implementation details of radix sort.

The primary artifact of the experiments is the following radix sort library:

https://rextester.com/HLQELD48880
This contains both the library and test harness.
Fair warning: it is undertested at the moment, and there might be stupid bugs (having people too look at it, both from standpoint of correctness, and performance is one of the reasons I'm posting this).

Here's also experimental testbed, which the above library grew from:
https://rextester.com/HEZLFE12709
Note: confusingly, it uses _msb and _lsb suffixes, even though algorithms operate on digits, not bits.

Now, to discuss some points of radix sort.
If you are not familiar with radix sort, wikipedia provides reasonable introduction: https://en.wikipedia.org/wiki/Radix_sort , so I'm not repeating that here.
Generally, radix sort comes in 2 flavors: LSD (sorting least significant digits first) and MSD (sorting most significant digits first). Both can be implemented as out-of-place stable sort. MSD can also be implemented as in-place (not stable) sort.

Here are some observations, in no particular order:
1. People often seem to be concerned with sorting a plain array of numbers. I find that a rather rare use-case: generally, you want to know, which objects these numbers correspond to. So I think struct{uint32_t key;uint32_t payload;} the most likely use-case. Here payload is some sort of handle (say, index in an array; it also can be a pointer, but that is bulkier on x64) to your actual "fat" object (not much reason for lugging those around).
2. I often encountered statement, that MSD is preferable for large arrays, for 2 reasons: it has better cache locality, and it can descend into fallback search for small sizes. That does not quite match the experiments: for struct{uint32_t key;uint32_t payload;} in my tests LSD is consistently faster for n>2000. For larger (>=64b) keys MSD does start to win. For larger payload with small key MSD sometimes wins.
3. I've seen fallback sort for MSD usually implemented as an insertion sort. It raises the problem, of when to switch to it: typical (for other alorithms, say introsort) threshold of ~16 is too small, and running radix sort on it incurs extra overhead; the sensible threshold for radix sort, ~64, makes insertion sort rather slow. I've implemented a merge sort (with threshold of 128/256), which further falls back to insertion sort for n<=18.
4. Prefetching helps. A lot. I'm talking like x3 speedup on large inputs. And, BTW, that's prefetching writing.
5. Input data matters. While radix sort (especially LSD) doesn't seem to care, whether the array of random keys is sorted or not, the keys being a sequence of 0..(n-1) slows down execution x3 (somewhat less, if shuffled). The current best guess is that it is due to a lot of power-of-two sized buckets causing cache aliasing. Not sure, how to fix it.
6. Optimization options (in this implementation) do not seem to matter much, beyond basic (as in -O1).
7. Replacing insertion sort with sorting network doesn't seem to help (on its own sorting network does slightly beat insertion sort).

Very broadly, on struct{uint32_t key;uint32_t payload;} for n>4000, (out-of-place) radix sort tends to be about 5 times faster, than std::sort.

Here are some charts (who doesn't like charts?):

Spoiler

plot1.png

plot2.png

plot3.png

plot4.png

plot5ld.png

plot6.png
plot7.png
plot8.png
plot9.png

 

Please note, that these correspond to somewhat outdated versions.

Most of the timings are from rextester.com above. I realize, that benchmarking on an online compiler is not the best idea, but these tests are transparent: everyone sees the same results, along with the knobs set to made them.
If you can find the time, I would appreciate results from your machine (preferably on larger range of sizes; rextester.com only allows ~5 seconds runtime). Ideally, you could also provide your CPU and RAM information (2 tabs from https://www.cpuid.com/softwares/cpu-z.html would do wonderfully).
Especially welcome are results from non-x64 machines.

I invite both general discussion regarding radix sort, and comments on this specific implementation.
 

Advertisement

Visual Studio does not compile testbed:


if (n & 1) ++c[p[-data_size] << 1]; // data_size is unsigned; compiler throws error over this

Here are my results.

CPU-Z:

Spoiler

5sehOIQ.png

Compiled ax x86:

Spoiler


System: 32bit
Estimated CPU frequency: 3.691 GHz
Algorithm                    |      16|      22|      31|      43|      60|      84|     118|     166|     234|     330|     466|     659|     932|    1318|    1864|    2636|    3728|    5272|    7456|   10544|   14912|   21089|   29825|   42181|   59655|   84369|  119321|  168753|  238664|  337539|  477376|  675146|  954849| 1350429| 1909892| 2701132
std_sort                     |  14.0  |  15.2  |  17.5  |  15.1  |  22.6  |  40.4  |  22.4  |  26.5  |  38.4  |  64.9  |  58.4  |  89.4  |  88.0  | 106.3  | 119.6  | 127.5  | 137.1  | 145.2  | 152.8  | 157.9  | 163.6  | 171.5  | 176.6  | 184.8  | 185.6  | 218.7  | 216.5  | 218.7  | 216.5  | 218.7  | 232.0  | 235.1  | 243.5  | 248.7  | 255.1  | 262.4
std_stable_sort              |  14.6  |  15.4  |  17.9  |  21.7  |  20.9  |  21.4  |  42.2  |  54.0  |  31.1  |  45.1  |  57.9  |  65.4  |  86.6  |  88.1  | 101.8  | 108.0  | 114.0  | 120.5  | 126.1  | 131.0  | 138.5  | 144.1  | 150.0  | 155.5  | 185.6  | 175.0  | 185.6  | 175.0  | 185.6  | 185.9  | 201.0  | 202.3  | 212.6  | 224.1  | 230.0  | 233.7
merge_sort_ins_full<18>      |  13.5  |  16.6  |  18.0  |  32.5  |  22.4  |  26.6  |  32.7  |  43.1  |  51.3  |  69.0  |  82.1  | 102.2  | 106.6  | 124.5  | 128.1  | 142.5  | 145.9  | 159.1  | 161.0  | 173.9  | 177.9  | 191.9  | 191.8  | 204.7  | 185.6  | 218.7  | 216.5  | 218.7  | 232.0  | 251.5  | 247.4  | 256.9  | 262.9  | 276.0  | 278.3  | 291.1
merge_sort_ins_new<18>       |  16.4  |  16.2  |  17.3  |  27.3  |  21.9  |  26.5  |  28.7  |  69.4  |  76.7  |  62.8  |  77.7  |  95.4  | 104.9  | 121.8  | 126.1  | 143.2  | 144.2  | 158.3  | 159.7  | 172.8  | 174.1  | 187.1  | 188.3  | 201.6  | 185.6  | 218.7  | 216.5  | 240.6  | 232.0  | 240.6  | 239.7  | 256.9  | 262.9  | 273.3  | 274.4  | 288.3
radix_sort_msb<4>            |  14.3  |  16.6  |  25.2  |  21.3  |  22.3  |  28.9  |  58.0  |  34.7  |  44.9  |  25.3  |  30.7  |  43.4  |  51.7  |  70.1  |  76.0  |  92.2  |  91.6  |  54.8  |  60.1  |  70.5  |  74.3  |  85.6  |  88.2  | 100.4  | 123.7  |  87.5  |  61.9  |  65.6  |  77.3  |  98.4  | 100.5  | 109.3  | 108.2  |  71.1  |  75.4  |  84.7
radix_sort_msb<5>            |  14.6  |  16.5  |  17.5  |  24.7  |  38.0  |  26.6  |  28.3  |  39.0  |  72.4  |  33.0  |  25.9  |  33.6  |  44.0  |  59.0  |  64.3  |  77.2  |  80.8  |  93.8  |  92.5  |  44.3  |  48.7  |  56.0  |  60.9  |  70.8  |  61.9  |  87.5  |  92.8  |  87.5  |  92.8  |  43.7  |  54.1  |  65.6  |  65.7  |  79.3  |  83.1  |  94.3
radix_sort_msb<8, 128>       |  14.3  |  16.5  |  17.5  |  21.1  |  40.3  |  48.0  |  28.0  |  17.6  |  21.6  |  14.9  |  15.3  |  26.8  |  29.1  |  27.3  |  31.3  |  34.7  |  39.3  |  45.7  |  51.3  |  61.1  |  65.5  |  77.0  |  74.5  |  46.5  |   0.0  |  43.7  |  30.9  |  43.7  |  46.4  |  43.7  |  46.4  |  43.7  |  46.4  |  57.4  |  61.8  |  69.7
radix_sort_msb<11, 256>      |  15.0  |  16.9  |  17.8  |  26.8  |  23.3  |  26.9  |  27.7  |  51.7  |  76.0  |  55.4  |  40.9  |  39.1  |  42.0  |  41.8  |  37.6  |  32.6  |  35.2  |  36.8  |  37.7  |  38.0  |  39.6  |  40.3  |  43.8  |  50.9  |  61.9  |  43.7  |  92.8  |  87.5  |  92.8  |  98.4  | 100.5  |  76.5  |  73.4  |  65.6  |  61.8  |  57.4
radix_sort_msb<11>           |  14.9  |  18.6  |  17.8  |  21.3  |  22.9  |  26.4  |  29.0  |  35.8  |  44.7  |  55.9  |  43.0  |  38.7  |  41.1  |  42.1  |  37.8  |  32.7  |  35.3  |  36.6  |  37.6  |  38.4  |  39.5  |  40.0  |  43.9  |  51.3  |  61.9  |  87.5  |  61.9  |  65.6  |  77.3  |  87.5  | 100.5  |  82.0  |  73.4  |  65.6  |  61.8  |  56.0
radix_sort_lsb<4>            |  71.6  |  68.6  |  71.5  |  66.1  |  63.0  |  60.7  |  60.5  |  59.0  |  57.7  |  56.9  |  56.5  |  56.5  |  56.8  |  56.3  |  55.9  |  56.0  |  56.1  |  56.5  |  56.9  |  56.1  |  56.5  |  56.7  |  57.7  |  56.9  |  61.9  |  43.7  |  61.9  |  43.7  |  61.9  |  54.7  |  61.9  |  65.6  |  61.8  |  62.9  |  61.8  |  61.5
radix_sort_lsb<5>            |  70.5  |  64.7  |  67.5  |  61.7  |  56.7  |  54.8  |  53.8  |  51.5  |  50.3  |  49.9  |  48.9  |  49.1  |  48.9  |  48.5  |  48.4  |  48.2  |  48.3  |  48.7  |  48.6  |  48.4  |  48.7  |  48.8  |  48.8  |  48.9  |  61.9  |  43.7  |  61.9  |  43.7  |  46.4  |  54.7  |  46.4  |  54.7  |  54.1  |  54.7  |  54.1  |  54.7
radix_sort_lsb<6>            |  78.4  |  67.7  |  59.7  |  52.5  |  48.1  |  45.3  |  45.1  |  42.8  |  40.8  |  39.7  |  39.7  |  39.1  |  38.5  |  38.2  |  38.1  |  38.5  |  38.3  |  38.4  |  38.5  |  38.8  |  39.3  |  39.5  |  39.5  |  39.3  |  61.9  |  43.7  |  61.9  |  43.7  |  46.4  |  43.7  |  46.4  |  43.7  |  46.4  |  46.5  |  46.4  |  46.5
radix_sort_lsb<7>            |  90.3  |  75.9  |  62.3  |  53.3  |  47.0  |  42.5  |  40.2  |  38.4  |  36.3  |  34.9  |  34.3  |  33.7  |  33.3  |  32.8  |  32.7  |  32.7  |  32.6  |  32.6  |  32.5  |  32.4  |  33.1  |  32.9  |  33.0  |  33.9  |  61.9  |  43.7  |  30.9  |  43.7  |  30.9  |  32.8  |  38.7  |  38.3  |  38.7  |  38.3  |  38.7  |  38.3
radix_sort_lsb<8>            | 136.4  | 102.2  |  80.6  |  65.4  |  53.5  |  46.1  |  40.6  |  36.0  |  32.9  |  31.2  |  29.5  |  28.3  |  27.4  |  26.9  |  26.8  |  26.5  |  26.2  |  26.2  |  26.2  |  26.8  |  26.8  |  26.9  |  27.6  |  28.5  |   0.0  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  23.2  |  32.8  |  30.9  |  32.8  |  30.9  |  31.4
radix_sort_lsb_pre<8>        | 117.8  |  97.1  |  79.2  |  65.2  |  54.2  |  47.0  |  42.1  |  38.2  |  35.1  |  33.3  |  31.8  |  31.0  |  30.4  |  29.9  |  29.6  |  29.8  |  29.5  |  29.4  |  29.5  |  29.2  |  29.7  |  29.6  |  29.5  |  29.7  |  61.9  |  43.7  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  27.3  |  34.8  |  32.8  |  30.9  |  32.8
radix_sort_lsb_loop<8>       | 243.6  | 187.3  | 143.0  | 111.4  |  89.4  |  73.3  |  62.9  |  54.2  |  47.9  |  43.7  |  40.5  |  38.4  |  36.7  |  35.7  |  35.1  |  34.8  |  34.2  |  34.2  |  34.7  |  34.0  |  34.8  |  34.7  |  34.6  |  35.7  |   0.0  |   0.0  |  30.9  |  21.9  |  46.4  |  43.7  |  38.7  |  38.3  |  42.5  |  41.0  |  40.6  |  39.6
radix_sort_lsb8_c            | 145.8#~| 113.5#~|  91.5#~|  73.3#~|  61.8#~|  54.9#~|  49.5#~|  44.7#~|  41.4#~|  38.6#~|  38.2#~|  37.1#~|  36.1#~|  35.7#~|  35.2#~|  34.8#~|  34.5#~|  34.6#~|  34.5#~|  35.6#~|  35.9#~|  35.8#~|  35.6#~|  35.8#~|   0.0#~|  43.7#~|  30.9#~|  43.7#~|  30.9#~|  32.8#~|  46.4#~|  43.7#~|  46.4#~|  43.7#~|  52.2#~|  51.9#~
radix_sort_lsb<9>            | 187.5  | 143.4  | 111.2  |  87.5  |  70.3  |  58.0  |  49.4  |  42.7  |  37.6  |  34.5  |  32.2  |  30.6  |  29.3  |  28.2  |  28.1  |  27.5  |  27.2  |  27.2  |  27.0  |  27.3  |  28.5  |  27.7  |  28.6  |  27.9  |   0.0  |   0.0  |  30.9  |  21.9  |  15.5  |  21.9  |  30.9  |  32.8  |  30.9  |  32.8  |  32.9  |  34.2
radix_sort_lsb<10>           | 327.8  | 247.9  | 182.9  | 138.9  | 107.2  |  84.0  |  67.7  |  55.7  |  47.3  |  41.1  |  36.9  |  34.0  |  31.5  |  30.0  |  29.5  |  28.6  |  28.6  |  28.7  |  28.3  |  27.9  |  28.7  |  29.1  |  29.0  |  29.1  |   0.0  |   0.0  |  30.9  |  43.7  |  30.9  |  32.8  |  30.9  |  38.3  |  30.9  |  35.5  |  34.8  |  35.5
radix_sort_lsb<11>           | 500.8  | 373.7  | 271.0  | 200.8  | 149.8  | 113.5  |  86.3  |  67.2  |  54.0  |  44.3  |  37.1  |  32.3  |  28.6  |  26.2  |  24.4  |  23.3  |  22.7  |  22.3  |  22.2  |  22.5  |  23.0  |  23.4  |  23.6  |  24.4  |  61.9  |  43.7  |  30.9  |  21.9  |  15.5  |  21.9  |  30.9  |  27.3  |  27.1  |  30.1  |  32.9  |  32.8
radix_sort_msb_inplace<8>    |  14.3  |  16.2  |  17.9  |  21.3  |  43.5  |  51.5  |  29.5  |  41.9  |  54.6  |  28.6  |  22.8  |  25.1  |  30.4  |  37.7  |  38.7  |  40.8  |  44.2  |  52.6  |  58.5  |  68.6  |  72.4  |  85.3  |  86.6  | 100.8  | 123.7  |  87.5  |  61.9  |  65.6  |  61.9  |  65.6  |  61.9  |  60.1  |  65.7  |  71.1  |  77.3  |  87.5
radix_sort_msb_inplace<8, 256>|  14.1  |  16.7  |  27.2  |  21.1  |  22.2  |  27.4  |  63.1  |  41.4  |  51.9  |  23.1  |  23.5  |  25.4  |  29.6  |  35.6  |  39.0  |  41.0  |  44.2  |  52.4  |  59.1  |  68.6  |  72.2  |  83.8  |  87.1  | 101.0  |  61.9  |  43.7  |  61.9  |  65.6  |  61.9  |  65.6  |  61.9  |  60.1  |  61.8  |  73.8  |  75.4  |  86.1
radix_sort_heuristic         |  16.0  |  16.6  |  17.7  |  22.1  |  22.2  |  26.5  |  28.0  |  29.2  |  15.1  |  14.9  |  24.0  |  26.8  |  22.5  |  27.7  |  26.8  |  27.0  |  26.3  |  26.2  |  26.2  |  26.2  |  27.6  |  26.7  |  26.9  |  27.2  |   0.0  |  43.7  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  27.3  |  30.9  |  30.1  |  32.9  |  32.8
radix_sort_heuristic_inplace |  14.8  |  17.5  |  26.2  |  21.2  |  22.4  |  27.7  |  62.3  |  26.0  |  22.7  |  26.8  |  30.4  |  25.3  |  30.1  |  34.8  |  39.0  |  40.6  |  43.9  |  44.5  |  46.3  |  47.3  |  47.9  |  48.4  |  53.9  |  64.9  |  61.9  |  43.7  |  61.9  |  43.7  |  61.9  |  54.7  |  61.9  |  54.7  |  65.7  |  73.8  |  77.3  |  73.8
radix_sort_heuristic_msb     |  16.9  |  16.4  |  18.0  |  21.2  |  22.4  |  27.0  |  27.7  |  17.7  |  15.4  |  15.3  |  17.4  |  26.2  |  23.1  |  27.9  |  31.2  |  35.2  |  39.6  |  36.9  |  38.3  |  39.3  |  39.8  |  40.0  |  43.6  |  51.5  |  61.9  |  43.7  |  30.9  |  21.9  |  30.9  |  43.7  |  46.4  |  43.7  |  46.4  |  57.4  |  59.9  |  56.0
radix_sort_lsb_separate<8>   |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |  61.9  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  32.8  |  30.9  |  32.8  |  32.9  |  32.8
NOTE: # means not sorted.
      ~ means not stable.

 

Compiled as x64:

Spoiler


System: 64bit
Estimated CPU frequency: 3.691 GHz
Algorithm                    |      16|      22|      31|      43|      60|      84|     118|     166|     234|     330|     466|     659|     932|    1318|    1864|    2636|    3728|    5272|    7456|   10544|   14912|   21089|   29825|   42181|   59655|   84369|  119321|  168753|  238664|  337539|  477376|  675146|  954849| 1350429| 1909892| 2701132
std_sort                     |   7.9  |  12.4  |  11.2  |  10.6  |  14.2  |  15.7  |  36.9  |  20.3  |  60.2  |  31.8  |  79.4  |  67.3  |  83.2  |  97.1  | 111.0  | 118.3  | 125.6  | 131.8  | 139.7  | 144.5  | 148.5  | 157.8  | 161.6  | 169.2  | 123.8  | 175.0  | 185.6  | 175.0  | 201.1  | 207.8  | 208.8  | 213.2  | 220.4  | 226.9  | 233.9  | 240.5
std_stable_sort              |   8.8  |   9.7  |  19.7  |  13.2  |  26.1  |  19.7  |  21.2  |  48.2  |  20.5  |  65.9  |  42.6  |  76.2  |  85.6  |  86.3  |  95.4  | 102.2  | 109.7  | 116.0  | 122.6  | 126.0  | 133.9  | 139.8  | 146.2  | 151.3  | 123.8  | 131.3  | 185.6  | 175.0  | 170.1  | 185.9  | 201.1  | 202.3  | 208.8  | 216.0  | 224.2  | 229.6
merge_sort_ins_full<18>      |  10.1  |  12.7  |  12.5  |  19.0  |  35.2  |  15.8  |  48.0  |  31.9  |  38.1  |  53.1  |  78.2  |  81.9  |  90.1  | 100.9  | 104.9  | 116.3  | 118.4  | 128.1  | 129.6  | 140.0  | 144.0  | 154.2  | 155.4  | 165.1  | 123.8  | 175.0  | 185.6  | 196.9  | 185.6  | 207.8  | 201.1  | 213.2  | 212.6  | 224.2  | 224.2  | 235.1
merge_sort_ins_new<18>       |  11.1  |  11.0  |  12.4  |  16.8  |  14.6  |  22.9  |  24.4  |  57.7  |  29.4  |  69.9  |  60.2  |  88.2  |  92.7  |  98.2  | 104.7  | 116.3  | 118.7  | 128.6  | 130.1  | 139.4  | 141.6  | 151.9  | 154.4  | 163.1  | 123.8  | 175.0  | 154.7  | 175.0  | 185.6  | 196.9  | 201.1  | 213.2  | 212.6  | 221.4  | 222.3  | 233.7
radix_sort_msb<4>            |  11.8  |  12.4  |  12.5  |  19.4  |  14.7  |  15.8  |  46.7  |  25.3  |  62.9  |  16.4  |  38.4  |  32.6  |  50.8  |  54.9  |  63.8  |  74.9  |  74.6  |  44.4  |  49.1  |  57.0  |  60.0  |  71.0  |  71.4  |  81.2  | 123.8  |  87.5  |  61.9  |  43.8  |  61.9  |  76.6  |  77.3  |  87.5  |  88.9  |  57.4  |  59.9  |  69.7
radix_sort_msb<5>            |  12.3  |  10.7  |  19.5  |  12.8  |  15.2  |  30.1  |  17.0  |  57.0  |  31.7  |  26.0  |  17.2  |  35.3  |  29.8  |  48.3  |  49.7  |  62.7  |  67.0  |  76.2  |  74.5  |  34.0  |  38.7  |  45.7  |  49.3  |  56.9  |  61.9  |  43.8  |  61.9  |  65.6  |  77.3  |  43.7  |  46.4  |  54.7  |  58.0  |  62.9  |  67.6  |  76.5
radix_sort_msb<8, 128>       |  11.6  |  11.1  |  12.7  |  12.9  |  14.7  |  38.3  |  17.9  |  23.7  |  11.9  |  19.4  |  12.8  |  14.3  |  23.8  |  21.7  |  26.9  |  27.2  |  33.6  |  39.1  |  43.6  |  51.4  |  55.4  |  65.5  |  63.4  |  37.9  |  61.9  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  38.7  |  38.3  |  42.5  |  49.2  |  52.2  |  60.1
radix_sort_msb<11, 256>      |  10.6  |  12.7  |  12.6  |  12.8  |  32.1  |  15.8  |  47.6  |  26.3  |  62.0  |  43.2  |  32.0  |  27.1  |  34.8  |  23.3  |  28.2  |  23.6  |  27.2  |  27.2  |  28.5  |  28.8  |  30.9  |  32.3  |  36.4  |  42.9  |   0.0  |  43.8  |  61.9  |  65.6  |  77.3  |  76.6  |  85.1  |  65.6  |  58.0  |  54.7  |  50.3  |  46.5
radix_sort_msb<11>           |  12.6  |  12.5  |  12.6  |  12.8  |  33.0  |  16.0  |  47.3  |  24.3  |  62.4  |  43.2  |  42.5  |  38.3  |  24.1  |  31.2  |  22.5  |  26.9  |  25.5  |  27.4  |  28.0  |  28.7  |  30.9  |  32.1  |  36.7  |  43.2  |  61.9  |  43.8  |  61.9  |  65.6  |  61.9  |  76.6  |  77.3  |  60.1  |  58.0  |  51.9  |  48.3  |  46.5
radix_sort_lsb<4>            |  63.9  |  50.1  |  51.5  |  50.5  |  46.9  |  46.5  |  43.1  |  39.4  |  40.4  |  38.2  |  37.3  |  36.6  |  35.8  |  35.9  |  37.8  |  38.8  |  38.1  |  38.3  |  38.0  |  38.1  |  38.4  |  38.7  |  38.5  |  38.4  |  61.9  |  43.8  |  61.9  |  43.8  |  30.9  |  43.7  |  38.7  |  43.7  |  46.4  |  46.5  |  46.4  |  46.5
radix_sort_lsb<5>            |  69.3  |  61.6  |  55.9  |  48.6  |  48.5  |  45.6  |  38.7  |  35.7  |  34.5  |  33.7  |  31.5  |  32.7  |  31.5  |  32.0  |  33.5  |  33.1  |  33.5  |  34.0  |  33.4  |  33.6  |  33.7  |  33.5  |  33.8  |  33.7  |  61.9  |   0.0  |  30.9  |  21.9  |  30.9  |  32.8  |  38.7  |  43.7  |  42.5  |  43.7  |  42.5  |  42.4
radix_sort_lsb<6>            |  83.0  |  70.3  |  55.5  |  53.2  |  41.4  |  39.1  |  34.3  |  33.3  |  29.3  |  28.6  |  28.4  |  27.6  |  26.6  |  26.5  |  27.6  |  28.0  |  27.5  |  27.8  |  27.3  |  28.2  |  28.1  |  28.4  |  28.6  |  28.7  |   0.0  |  43.8  |  30.9  |  21.9  |  30.9  |  32.8  |  30.9  |  32.8  |  34.8  |  38.3  |  38.7  |  38.3
radix_sort_lsb<7>            | 108.8  |  90.4  |  67.4  |  57.6  |  47.2  |  38.0  |  32.6  |  30.2  |  27.6  |  25.6  |  24.3  |  23.4  |  22.5  |  22.2  |  21.7  |  21.8  |  22.4  |  23.4  |  23.2  |  22.6  |  23.5  |  23.3  |  23.8  |  23.6  |   0.0  |  43.8  |   0.0  |  21.9  |  30.9  |  21.9  |  30.9  |  27.3  |  34.8  |  32.8  |  32.9  |  34.2
radix_sort_lsb<8>            | 195.3  | 150.8  | 108.1  |  82.5  |  66.0  |  49.9  |  41.0  |  33.2  |  28.5  |  24.7  |  22.1  |  20.1  |  18.6  |  17.8  |  17.2  |  17.4  |  17.3  |  18.0  |  17.5  |  17.4  |  18.2  |  18.1  |  18.2  |  18.5  |  61.9  |  43.8  |   0.0  |   0.0  |  15.5  |  21.9  |  23.2  |  27.3  |  27.1  |  24.6  |  27.1  |  27.3
radix_sort_lsb_pre<8>        |  96.8  |  72.7  |  55.9  |  46.6  |  36.5  |  30.9  |  26.4  |  22.9  |  20.6  |  18.6  |  17.4  |  16.6  |  16.0  |  15.6  |  15.5  |  15.7  |  16.8  |  16.6  |  16.8  |  16.9  |  16.9  |  17.2  |  17.3  |  17.1  |   0.0  |   0.0  |   0.0  |  21.9  |  15.5  |  21.9  |  23.2  |  21.9  |  23.2  |  24.6  |  25.1  |  24.6
radix_sort_lsb_loop<8>       | 228.1  | 169.5  | 125.9  |  96.1  |  73.6  |  57.7  |  46.3  |  38.6  |  32.8  |  28.2  |  25.1  |  23.0  |  21.4  |  20.5  |  19.9  |  19.6  |  19.5  |  19.9  |  19.9  |  19.4  |  20.7  |  20.1  |  20.0  |  20.5  |   0.0  |   0.0  |  30.9  |  21.9  |  15.5  |  21.9  |  23.2  |  27.3  |  27.1  |  27.3  |  27.1  |  27.3
radix_sort_lsb8_c            | 198.5#~| 154.0#~| 113.2#~|  90.3#~|  70.9#~|  56.4#~|  48.9#~|  42.4#~|  37.3#~|  33.2#~|  31.4#~|  30.2#~|  28.7#~|  28.9#~|  28.1#~|  26.8#~|  26.7#~|  26.7#~|  26.6#~|  26.9#~|  27.4#~|  27.4#~|  27.9#~|  27.6#~|  61.9#~|  43.8#~|  30.9#~|  21.9#~|  30.9#~|  32.8#~|  30.9#~|  38.3#~|  38.7#~|  35.5#~|  48.3#~|  46.5#~
radix_sort_lsb<9>            | 294.8  | 218.6  | 160.1  | 119.9  |  90.5  |  69.6  |  54.9  |  43.4  |  36.0  |  29.9  |  25.9  |  23.4  |  21.2  |  19.9  |  19.2  |  19.0  |  19.4  |  19.9  |  19.9  |  20.6  |  20.6  |  21.0  |  20.8  |  20.8  |  61.9  |  43.8  |   0.0  |  21.9  |  30.9  |  21.9  |  23.2  |  21.9  |  23.2  |  27.3  |  27.1  |  27.3
radix_sort_lsb<10>           | 562.1  | 413.9  | 297.7  | 219.5  | 170.1  | 128.1  |  95.6  |  72.4  |  55.9  |  44.7  |  36.9  |  32.1  |  27.6  |  24.7  |  22.8  |  22.1  |  22.2  |  22.4  |  23.6  |  23.1  |  24.1  |  24.9  |  24.4  |  24.6  |  61.9  |  43.8  |  30.9  |  21.9  |  15.5  |  21.9  |  30.9  |  32.8  |  30.9  |  30.1  |  30.9  |  30.1
radix_sort_lsb<11>           | 928.5  | 680.1  | 486.9  | 354.9  | 258.4  | 189.6  | 139.4  | 103.7  |  77.5  |  59.0  |  45.9  |  36.8  |  30.1  |  24.6  |  21.7  |  19.7  |  19.2  |  19.6  |  19.7  |  19.4  |  20.8  |  20.7  |  21.1  |  21.5  |   0.0  |   0.0  |  30.9  |  21.9  |  15.5  |  21.9  |  23.2  |  21.9  |  30.9  |  30.1  |  30.9  |  30.1
radix_sort_msb_inplace<8>    |  12.4  |  10.7  |  19.5  |  12.9  |  14.7  |  31.2  |  17.5  |  57.3  |  33.6  |  29.7  |  20.8  |  25.5  |  22.5  |  31.8  |  28.0  |  35.4  |  38.7  |  44.8  |  49.5  |  57.7  |  61.1  |  70.1  |  73.5  |  83.9  |  61.9  |  43.8  |  61.9  |  43.8  |  46.4  |  54.7  |  46.4  |  54.7  |  58.0  |  62.9  |  67.6  |  75.2
radix_sort_msb_inplace<8, 256>|  10.8  |  10.7  |  20.8  |  19.4  |  14.8  |  16.4  |  46.4  |  26.9  |  62.5  |  20.6  |  28.9  |  21.8  |  30.6  |  23.9  |  33.4  |  33.4  |  39.4  |  45.0  |  49.7  |  57.4  |  61.5  |  70.3  |  72.7  |  84.0  | 123.8  |  43.8  |  61.9  |  65.6  |  46.4  |  54.7  |  54.1  |  54.7  |  58.0  |  62.9  |  65.7  |  75.2
radix_sort_heuristic         |  11.9  |  11.4  |  12.5  |  21.0  |  14.7  |  15.8  |  46.6  |  27.6  |  20.2  |  10.9  |  20.1  |  14.0  |  23.5  |  21.7  |  17.3  |  17.4  |  17.3  |  18.0  |  17.5  |  17.7  |  17.9  |  18.7  |  18.2  |  18.5  |   0.0  |   0.0  |   0.0  |  21.9  |  30.9  |  21.9  |  23.2  |  21.9  |  27.1  |  24.6  |  27.1  |  26.0
radix_sort_heuristic_inplace |  10.8  |  12.5  |  14.0  |  14.0  |  15.3  |  38.5  |  18.4  |  33.1  |  21.7  |  27.8  |  21.8  |  28.9  |  20.1  |  24.2  |  31.2  |  36.9  |  40.0  |  39.5  |  39.5  |  39.8  |  41.0  |  42.6  |  47.5  |  58.2  |  61.9  |  43.8  |  61.9  |  65.6  |  46.4  |  54.7  |  54.1  |  54.7  |  58.0  |  65.6  |  67.6  |  75.2
radix_sort_heuristic_msb     |  14.1  |  12.7  |  20.1  |  13.8  |  31.0  |  16.2  |  40.5  |  20.5  |  22.3  |  20.1  |  11.1  |  21.8  |  17.7  |  25.2  |  24.5  |  29.0  |  33.8  |  27.7  |  28.1  |  28.8  |  30.6  |  32.4  |  36.4  |  43.0  |  61.9  |  43.8  |  30.9  |  43.8  |  30.9  |  32.8  |  38.7  |  32.8  |  42.5  |  46.5  |  52.2  |  46.5
radix_sort_lsb_separate<8>   |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |   0.0  |  87.5  |  61.9  |  43.8  |  30.9  |  21.9  |  30.9  |  21.9  |  38.7  |  32.8  |  34.8  |  38.3  |  36.7  |  35.5
NOTE: # means not sorted.
      ~ means not stable.

 

 

1 hour ago, Zaoshi Kaba said:

Visual Studio does not compile testbed:

 

Yes, radix_sort_lsb8_c was borked in more ways than one. Hopefully, fixed:

https://rextester.com/LEEVG59515

The results are about what I would expect (someone already posted data for i7 8700K in the original thread).

So, building all histograms in one go (radix_sort_lsb_pre<8>) does seem to consistently help on x64 in your data, but not on x86 (and the difference is minor, anyway).

And the win over std::sort is even more marked (its also amusing to see std::stable_sort slightly beating std::sort).

Thanks!

Also, I looked at it, and apparently, actual write prefetch (PREFETCHW / _MM_HINT_ET1 / __builtin_prefetch(p,1,3);) hurts a lot, compared to normal _MM_HINT_T0. Go, figure.

Updated main library:

https://rextester.com/VMNUE25458

1 hour ago, Zaoshi Kaba said:

Here are my results.

BTW, radix_sort_msb_inplace<8> and radix_sort_msb_inplace<8,256> are exactly the same in that version, so there shouldn't be a difference (beyound measurement noise).

GPU version of radix sorting (with native NVIDIA Turing support, compute). 

 

This topic is closed to new replies.

Advertisement