Home > mailing lists

GetSnapshotData round two(for me) - Mailing list pgsql-hackers

From	Daniel Wood
Subject	GetSnapshotData round two(for me)
Date	September 25, 2018 11:30:40
Msg-id	848729934.500501.1537853440759@connect.xfinity.com Whole thread Raw
Responses	Re: GetSnapshotData round two(for me) (Dilip Kumar <dilipbalaut@gmail.com>)
List	pgsql-hackers

Tree view

I was about to suggest creating a single shared snapshot instead of having multiple backends compute what is essentially the same snapshot. Luckily, before posting, I discovered Avoiding repeated snapshot computation from Pavan and POC: Cache data in GetSnapshotData() from Andres.

Andres, could I get a short summary of the biggest drawback that may have prevented this from being released? Before I saw this I had did my own implementation and saw some promising results(25% on 48 cores). I do need to do some mixed RO and RW workloads to see how the invalidations of the shared copy, at EOT time, affect the results. There are some differences in my implementation. I choose, perhaps incorrectly?, to busy spin other users trying to get a snapshot while the first guy in builds the shared copy. My thinking is to not increase latency of using the snapshot. The improvement of the idea doesn't come from getting off the CPU, by using a WAIT, but in not reading PGXACT cache lines on all the cpus acquiring the snapshot that are constantly being dirtied. One backend can do the heavy lifting and the others can immediately jump on the shared copy once created.

And something else quite weird: As I was evolving a standard setup for benchmark runs and getting baselines I was getting horrible numbers sometimes(680K) and other times I'd get over 1 million QPS. I was thinking I had a bad machine. What I found was that even though I was running a fixed 192 clients I had set max_connections to 600 sometimes and 1000 on other runs. Here is what I see running select-only scale 1000 pgbench with 192 clients on a 48 core box(2 sockets) using different values for max_connections:

200 tps = 1092043
250 tps = 1149490
300 tps = 732080
350 tps = 719611
400 tps = 681170
450 tps = 687527
500 tps = 859978
550 tps = 927161
600 tps = 1092283
650 tps = 1154916
700 tps = 1237271
750 tps = 1195968
800 tps = 1162221
850 tps = 1140626
900 tps = 749519
950 tps = 648398
1000 tps = 653460

This is on the base 12x codeline. The only thought I've had so far is that each PGXACT in use(192) is being scattered across the full set of max_connections, instead of being physically contiguous in the first 192 slots. This would cause more cache lines to be scanned. It doesn't make a lot of sense given that it goes up back again from 500 peaking at 700. Also this is after a fresh restart so the proc's in the freelist shouldn't have been scrambled yet in terms of ordering.

NOTE: I believe you'll only see this huge difference on a dual socket machine. It'd probably only take 30 minutes or so on a big machine to confirm with a couple of few minute runs at different values for max_connections. I'll be debugging this soon. But I've been postponing it while experimenting with my shared snapshot code.

pgsql-hackers by date:

From: Michael Paquier
Date: 25 September 2018, 11:26:42
Subject: Re: SSL tests failing with "ee key too small" error on Debian SID

From: Dilip Kumar
Date: 25 September 2018, 11:55:38
Subject: Re: GetSnapshotData round two(for me)

GetSnapshotData round two(for me) - Mailing list pgsql-hackers

Previous

Next