If running benchmarks or you are a customer which is currently impacted by GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
Performance differences of nearly 2X can be seen on select only pgbench due to nothing else but unlucky choices for max_connections. Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. pgbench -S
Results from 5 runs varying max_connections from 400 to 405:
max
conn TPS
400 677639
401 1146776
402 1122140
403 765664
404 671455
405 1190277
...
perf top shows about 21% GetSnapshotData() with the good numbers and 48% with the bad numbers.
This problem is not seen on a 2 socket 32 core Haswell system. Being a one man show I lack some of the diagnostic tools to drill down further. My suspicion is that the fact that Intel has lowered the L2 associativity from 8(Haswell) to 4(Skylake-S) may be the cause. The other possibility is that at higher core counts the shared 16-way inclusive associative L3 cache becomes insufficient. Perhaps that is why Intel has moved to an exclusive L3 cache on Skylake-SP.
If this is indeed just disadvantageous placement of structures/arrays in memory then you might also find that after upgrading a previous good choice for max_connections becomes a bad choice if things move around.
NOTE: int pgprocno = pgprocnos[index];
is where the big increase in time occurs in GetSnapshotData()
This is largely read-only, once all connections are established, and easily fits in the L1, and is not next to anything else causing invalidations.
NOTE2: It is unclear why PG needs to support over 64K sessions. At about 10MB per backend(at the low end) the empty backends alone would consume 640GB's of memory! Changing pgprocnos from int to short gives me the following results.
max
conn TPS
400 780119
401 1129286
402 1263093
403 887021
404 679891
405 1218118
While this change is significant on large Skylake systems it is likely just a trivial improvement on other systems or workloads.