I was noticing a slow down with the GPU app versions on the 15x271 data set. I found the problem - a large fraction of the discriminants were exceeding the hard coded precision. When this happens the GPU kicks it back to the CPU to handle. As a result, the GPU spent more time idling as it waited on the CPU to finish the task. A side effect of this was that the WU would also use almost an entire CPU core.

I increased the hard coded precision and tested on the troublesome data sets as well as the newest 16x271 data set. The issue seems to be fixed, but please report any unexpected behavior.

More...