We saw very similar issues with a CentOS server with 40 cores (32 virtualized) when moving from a physical server to a virtual server (I think it had 128GB RAM). Never had the problem on a physical server. We checked the same things as noted here, but never found a bug. We really thought it had something to do with NUMA zone reclaim, but could never prove that. In our case it was all kernel time in the guest, all CPUs at 100%. Sometimes it would last for a few seconds or minutes. Sometimes we would go days without a problem, and then it would completely tank.
If you figure out what is going on, I would like to know (especially if it is virtualized).
Deron