Over the last couple days we have been having some overload problems on the server. To help reduce the server load I temporarily disabled the batch_status and server_status cron jobs. I will be manually generating those about once a day. Sorry for the inconvenience. I also reduced the "max_wus_in_progress" from 24 to 6, so you may see fewer WUs in your job queues.

For those who are interested in the cause of the overload, it was a combination of things:
1. One of the drives in our RAID is failing, causing HD performance issues. We are working to get this fixed.
2. There were 10k+ WUs that all timed out at about the same time. The server then proceeds to generate new results for all these, causing a backlog. This is the reason for reducing max_wus_in_progress.
3. The transitioner had a DB timout which caused it to crash, increasing the backlog further.
4. Not understanding why WUs were not transitioning, I stupidly ran the "transition_all" admin script since it said this would "unstick" jobs. Big mistake - this script changed the transition time of all 600k WUs in the DB, forcing the transitioner to now reprocess every WU. That only made the backlog worse.

The good news is the server is almost caught up with the backlog. I will keep you posted.

More...