CUDA version 6.06 and OpenCL version 6.08 were released for Windows today. The CUDA version fixes (I hope) a "device not ready" bug seen by some fast GPUs. While I could not duplicate the error, the code now waits for the events to synchronize which should eliminate the error. The OpenCL version, 6.08, now includes the bug fix where the GPU and CPU results were not always matching. It uses several of the optimizations Sosirus has provided and includes a new "lut_size" configuration option which now defaults to 12 (4096 items). The previous version used a 2^20 sized lookup table which did not fit into the GPU's cache causing it to be memory rather than processor bound. So, you should see a higher GPU utilization with the new version and it should not be as dependent upon memory speed as the previous versions. Linux and OS X versions with the same fixes will follow soon. As usual, let me know if you have any issues with the new versions.

More...