On Wed, Aug 24, 2016 at 11:54 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 08/23/2016 06:18 PM, Heikki Linnakangas wrote:
On 08/22/2016 08:38 PM, Andres Freund wrote:
On 2016-08-22 20:32:42 +0300, Heikki Linnakangas wrote:
I remember seeing ProcArrayLock contention very visible earlier, but I can't hit that now. I suspect you'd still see contention on bigger hardware, though, my laptop has oly 4 cores. I'll have to find a real server for the next round of testing.
Yea, I think that's true. I can just about see ProcArrayLock contention on my more powerful laptop, to see it really bad you need bigger hardware / higher concurrency.
As soon as I sent my previous post, Vladimir Borodin kindly offered access to a 32-core server for performance testing. Thanks Vladimir!
I installed Greg Smith's pgbench-tools kit on that server, and ran some tests. I'm seeing some benefit on "pgbench -N" workload, but only after modifying the test script to use "-M prepared", and using Unix domain sockets instead of TCP to connect. Apparently those things add enough overhead to mask out the little difference.
Attached is a graph with the results. Full results are available at https://hlinnaka.iki.fi/temp/csn-4-results/. In short, the patch improved throughput, measured in TPS, with >= 32 or so clients. The biggest difference was with 44 clients, which saw about 5% improvement.
So, not phenomenal, but it's something. I suspect that with more cores, the difference would become more clear.
Like on a cue, Alexander Korotkov just offered access to a 72-core system :-). Thanks! I'll run the same tests on that.
And here are the results on the 72 core machine (thanks again, Alexander!). The test setup was the same as on the 32-core machine, except that I ran it with more clients since the system has more CPU cores. In summary, in the best case, the patch increases throughput by about 10%. That peak is with 64 clients. Interestingly, as the number of clients increases further, the gain evaporates, and the CSN version actually performs worse than unpatched master. I don't know why that is. One theory that by eliminating one bottleneck, we're now hitting another bottleneck which doesn't degrade as gracefully when there's contention.
Did you try to identify this second bottleneck with perf or something?
It would be nice to also run pgbench -S. Also, it would be nice to check something like 10% of writes, 90% of reads (which is quite typical workload in real life I believe).