15 hours ago, lawnjelly said:
Obviously it is important to do this switch at a slightly higher level than the atomic functions. i.e. Use SSE 4.2, do something 1000x, rather than do 1000x switching on each.
Absolutely right. Even though branches on the CPU are not as disastrous as on the GPU, I try to avoid them as much as possible. Every time I benchmark code, avoiding branches or moving them to a level that minimizes their execution yields the highest speedup.
15 hours ago, lawnjelly said:
some people use SIMD accelerated vector etc classes throughout their code rather than just at bottlenecks. You may also be able to tell the compiler to try to autovectorize and do specific SIMD optimizations throughout, which again might warrant a different approach (maybe different builds / choosing different DLL at startup etc).
I wrote my own linear algebra library based on a flexible SSE/AVX implementation. Taught me a lot about vectorization so I use it whenever I can. The auto-vectorization of GCC and Clang yields quite good results, but only if the handwritten implementation would be easy. For example, I recently wrote an SSE accelerated Gaussian elimination algorithm for dense matrices, which I think is almost at maximum efficiency for the matrix size I am aiming for. I compared it to a non-SSE Version of the code and some popular linear algebra library. First I was shocked that the gain of my handwritten SSE version was less than 10% when compared to the simple implementation. However, then I realized that this was only the case when the matrix size was a multiple of the register size. If not, the auto-vectorization compared rather poorly.
However, back to the topic:
Since I also use automatic vectorization because you never know what the compiler can further optimize, I guess choosing different compiled versions of dynamic libraries at runtime would be the way to go for me. This is because I can't tell the compiler for each code section which auto-vectorization level he has to apply. Still not sure how I can select a dynamic library at run-time. I have never done this before.
On the other side, there is still the option of compiling different executables.
Greetings