GetSnapshotData round two(for me) - Mailing list pgsql-hackers

From Daniel Wood
Subject GetSnapshotData round two(for me)
Date
Msg-id 848729934.500501.1537853440759@connect.xfinity.com
Whole thread Raw
Responses Re: GetSnapshotData round two(for me)  (Dilip Kumar <dilipbalaut@gmail.com>)
List pgsql-hackers

I was about to suggest creating a single shared snapshot instead of having multiple backends compute what is essentially the same snapshot.  Luckily, before posting, I discovered Avoiding repeated snapshot computation from Pavan and POC: Cache data in GetSnapshotData() from Andres.


Andres, could I get a short summary of the biggest drawback that may have prevented this from being released?  Before I saw this I had did my own implementation and saw some promising results(25% on 48 cores).  I do need to do some mixed RO and RW workloads to see how the invalidations of the shared copy, at EOT time, affect the results.  There are some differences in my implementation.  I choose, perhaps incorrectly?, to busy spin other users trying to get a snapshot while the first guy in builds the shared copy.  My thinking is to not increase latency of using the snapshot.  The improvement of the idea doesn't come from getting off the CPU, by using a WAIT, but in not reading PGXACT cache lines on all the cpus acquiring the snapshot that are constantly being dirtied.  One backend can do the heavy lifting and the others can immediately jump on the shared copy once created.


And something else quite weird:  As I was evolving a standard setup for benchmark runs and getting baselines I was getting horrible numbers sometimes(680K) and other times I'd get over 1 million QPS.  I was thinking I had a bad machine.  What I found was that even though I was running a fixed 192 clients I had set max_connections to 600 sometimes and 1000 on other runs.  Here is what I see running select-only scale 1000 pgbench with 192 clients on a 48 core box(2 sockets) using different values for max_connections:


 200 tps = 1092043
 250 tps = 1149490
 300 tps =  732080
 350 tps =  719611
 400 tps =  681170
 450 tps =  687527
 500 tps =  859978
 550 tps =  927161
 600 tps = 1092283
 650 tps = 1154916
 700 tps = 1237271
 750 tps = 1195968
 800 tps = 1162221
 850 tps = 1140626
 900 tps =  749519
 950 tps =  648398
1000 tps =  653460


This is on the base 12x codeline.  The only thought I've had so far is that each PGXACT in use(192) is being scattered across the full set of max_connections, instead of being physically contiguous in the first 192 slots.  This would cause more cache lines to be scanned.  It doesn't make a lot of sense given that it goes up back again from 500 peaking at 700.  Also this is after a fresh restart so the proc's in the freelist shouldn't have been scrambled yet in terms of ordering.


NOTE: I believe you'll only see this huge difference on a dual socket machine.  It'd probably only take 30 minutes or so on a big machine to confirm with a couple of few minute runs at different values for max_connections.  I'll be debugging this soon.  But I've been postponing it while experimenting with my shared snapshot code.





pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: SSL tests failing with "ee key too small" error on Debian SID
Next
From: Dilip Kumar
Date:
Subject: Re: GetSnapshotData round two(for me)