I have made some improvements to the cpu code. I am seeing somewhere between 30% and 40% speedup, depending on the case.

I have deployed the new executable for 64bit linux. I am currently trying to get it to build for 64bit windows. I hope to have that ready by tomorrow sometime.

The next step is to use these same tricks on the GPU versions. GPUs can be very finicky, so this might take several weeks before I have something worthwhile.

More...