Thread: Skylake-S warning
If running benchmarks or you are a customer which is currently impacted by GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
Performance differences of nearly 2X can be seen on select only pgbench due to nothing else but unlucky choices for max_connections. Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. pgbench -S
Results from 5 runs varying max_connections from 400 to 405:
max
conn TPS
400 677639
401 1146776
402 1122140
403 765664
404 671455
405 1190277
...
perf top shows about 21% GetSnapshotData() with the good numbers and 48% with the bad numbers.
This problem is not seen on a 2 socket 32 core Haswell system. Being a one man show I lack some of the diagnostic tools to drill down further. My suspicion is that the fact that Intel has lowered the L2 associativity from 8(Haswell) to 4(Skylake-S) may be the cause. The other possibility is that at higher core counts the shared 16-way inclusive associative L3 cache becomes insufficient. Perhaps that is why Intel has moved to an exclusive L3 cache on Skylake-SP.
If this is indeed just disadvantageous placement of structures/arrays in memory then you might also find that after upgrading a previous good choice for max_connections becomes a bad choice if things move around.
NOTE: int pgprocno = pgprocnos[index];
is where the big increase in time occurs in GetSnapshotData()
This is largely read-only, once all connections are established, and easily fits in the L1, and is not next to anything else causing invalidations.
NOTE2: It is unclear why PG needs to support over 64K sessions. At about 10MB per backend(at the low end) the empty backends alone would consume 640GB's of memory! Changing pgprocnos from int to short gives me the following results.
max
conn TPS
400 780119
401 1129286
402 1263093
403 887021
404 679891
405 1218118
While this change is significant on large Skylake systems it is likely just a trivial improvement on other systems or workloads.
Hi, On 2018-10-03 14:29:39 -0700, Daniel Wood wrote: > If running benchmarks or you are a customer which is currently > impacted by GetSnapshotData() on high end multisocket systems be wary > of Skylake-S. > Performance differences of nearly 2X can be seen on select only > pgbench due to nothing else but unlucky choices for max_connections. > Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon > Platinum 8175M @ 2.50-GHz) system. pgbench -S FWIW, I've seen performance differences of that magnitude going back to at least nehalem. But not on every larger system, interestingly. > If this is indeed just disadvantageous placement of structures/arrays > in memory then you might also find that after upgrading a previous > good choice for max_connections becomes a bad choice if things move > around. In the thread around https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6a2r@alap3.anarazel.de I'd found doing more aggressive padding helped a lot. Unfortunately I didn't pursue this further :( > NOTE2: It is unclear why PG needs to support over 64K sessions. At > about 10MB per backend(at the low end) the empty backends alone would > consume 640GB's of memory! Changing pgprocnos from int to short gives > me the following results. I've argued this before. After that we reduced MAX_BACKENDS to 0x3FFFF - I personaly think we could go to 16bit without it being a practical problem. I really suspect we're going to have to change the layout of PGXACT data in a way that makes a contiguous scan possible. That'll probably require some uglyness because a naive "use first free slot" scheme obviously is sensitive to there being holes. But I suspect that a scheme were backends would occasionally try to move themselves to an earlier slot wouldn't be too hard to implement. Greetings, Andres Freund
> On October 3, 2018 at 3:55 PM Andres Freund <andres@anarazel.de> wrote: > In the thread around https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6a2r@alap3.anarazel.de > I'd found doing more aggressive padding helped a lot. Unfortunately I > didn't pursue this further :( Interesting. Looks like you saw the same thing I see now. > I really suspect we're going to have to change the layout of PGXACT data > in a way that makes a contiguous scan possible. That'll probably require > some uglyness because a naive "use first free slot" scheme obviously is > sensitive to there being holes. But I suspect that a scheme were > backends would occasionally try to move themselves to an earlier slot > wouldn't be too hard to implement. I've also thought of this. But someone could create a thousand connections and then disconnect all but the first and last creating a huge hole. One thought would be to have a bit array of used entries and to bias toward using free lower indexed entries. Only two cache lines of 1024 bits would need to be scanned to handle 1024 connections. ProcArrayAdd() would just do an atomic xchg to set the used bit. I was even thinking of decomposing PGXACT into separate arrays of XID's, XMIN's, etc. Then have a bit map of those entries where the XID was valid. When you set/clear your XID you'd also set/clear this bit. With the select only workload the XID are all "Invalid" thus scanning this bit array of zeroed 64 bit long entries would be quite efficient. The vacuum and logical decoding flags could be directly done as a bit map. Having a array of 1 byte subtxn counts creates another opportunity for improvement. Scanning for a non-zero in env's which use few subtxn's is efficient. If subtxn's are used a lot then the repeated PGXACT cache line invalidations when setting nxids is offset by the fact that you can grab 8 of them in a single fetch. - Dan
One other thought. Could we update pgxact->xmin less often? What would be the impact of this lower bound being lower thanit would normally be with the existing scheme. Yes, it needs to be moved forward "occasionally". FYI, be careful with padding PGXACT's to a full cache line. With 1024 clients you'd completely blow out the L1 cache. Thebenefits with less than 200 clients is dubious. One problem with multiple(5) pgxact's in a single cache line is thatyou may get a miss fetching xmin and then a second miss fetching xid because an invalidation occurs between the 2 fetchesfrom updating any of the other 5 pgxact's on the same cache line. I've found some improvement fetching both xminand xid with a single 64 bit fetch. This prevents the invalidation between the two fetches. Similarily updating themwith a single 64 bit write helps. Yes, this is uber tweaky.
On 2018-10-03 17:01:44 -0700, Daniel Wood wrote: > > > On October 3, 2018 at 3:55 PM Andres Freund <andres@anarazel.de> wrote: > > > In the thread around https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6a2r@alap3.anarazel.de > > I'd found doing more aggressive padding helped a lot. Unfortunately I > > didn't pursue this further :( > > Interesting. Looks like you saw the same thing I see now. > > > I really suspect we're going to have to change the layout of PGXACT data > > in a way that makes a contiguous scan possible. That'll probably require > > some uglyness because a naive "use first free slot" scheme obviously is > > sensitive to there being holes. But I suspect that a scheme were > > backends would occasionally try to move themselves to an earlier slot > > wouldn't be too hard to implement. > > I've also thought of this. But someone could create a thousand connections and > then disconnect all but the first and last creating a huge hole. Yea, that's what I was suggesting to address with "a scheme where backends would occasionally try to move to move themselves to an earlier slot". I was basically thinking that in occasions where a backend holds ProcArrayLock exclusively it could just compare the current number of connections and its slot and move itself earlier if it's more than, say, 10% after the #num-connection's slot. Or something. The advantage of that approach would be that the size of the change is probably fairly manageable. Greetings, Andres Freund
On 10/3/18 11:29 PM, Daniel Wood wrote: > If running benchmarks or you are a customer which is currently impacted by > GetSnapshotData() on high end multisocket systems be wary of Skylake-S. > > > Performance differences of nearly 2X can be seen on select only pgbench due to > nothing else but unlucky choices for max_connections. Scale 1000, 192 local > clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. > pgbench -S > Hi, Could it be related to : https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net ?
Attachment
On Thu, Oct 4, 2018 at 9:50 AM Adrien Nayrat <adrien.nayrat@anayrat.info> wrote: > > On 10/3/18 11:29 PM, Daniel Wood wrote: > > If running benchmarks or you are a customer which is currently impacted by > > GetSnapshotData() on high end multisocket systems be wary of Skylake-S. > > > > > > Performance differences of nearly 2X can be seen on select only pgbench due to > > nothing else but unlucky choices for max_connections. Scale 1000, 192 local > > clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. > > pgbench -S > > Could it be related to : > https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net > ? Unlikely. I understood from Daniel's email that profiling shows a different hot-spot. In the cited .NET issue the problem was mostly due to issuing PAUSE in a loop without attempting to grab the lock. In PostgreSQL it's called only once per retry attempt. Regards, Ants Aasma -- PostgreSQL Senior Consultant www.cybertec-postgresql.com Austria (HQ), Wiener Neustadt | Switzerland, Zürich | Estonia, Tallinn | Uruguay, Montevideo Facebook: www.fb.com/cybertec.postgresql Twitter: www.twitter.com/PostgresSupport
Hi, On 2018-10-03 17:28:59 -0700, Daniel Wood wrote: > FYI, be careful with padding PGXACT's to a full cache line. I'm not actually thinking of doing that, but just to round it up so we don't have PGXACTs spanning cachelines. It's currently 12bytes, so we end up with one spanning 60-72, then from 120-132 etc. In my quick testing - on a larger westmere machine, so quite possibly outdated - that reduces the variance quite a bit. I think we should make sure it's 16 bytes. What I'd previously suggested is that we use the additional space for the database oid or such, that'd allow for some future improvements. I think we also should: - make sure the pgprocnos array is cacheline aligned. It's not updated as frequently, but I still saw some minor benefits. Adding a pg_attribute_aligned(64) should be sufficient. - remove the volatiles from GetSnapshotData(). As we've, for quite a while now, made sure both lwlocks and spinlocks are proper barriers they're not needed. - reduce the largely redundant flag tests. With the previous change done the compiler should be able to do so, but there's no reason to not start from somewhere sane. I'm kinda wondering about backpatching this part. To me this list seems lik efairly straight-forward changes that we should do soon. One thing I observed when re-familiarizing myself with this is that currently the assignment of pgprocnos is a bit weird - we actually don't start with the lowest pgprocno/PGPROC, but instead start with the *highest* one. That's because InitProcGlobal() populates the free lists in a 'highest first' manner: /* * Newly created PGPROCs for normal backends, autovacuum and bgworkers * must be queued up on the appropriate free list. Because there can * only ever be a small, fixed number of auxiliary processes, no free * list is used in that case; InitAuxiliaryProcess() instead uses a * linear search. PGPROCs for prepared transactions are added to a * free list by TwoPhaseShmemInit(). */ if (i < MaxConnections) { /* PGPROC for normal backend, add to freeProcs list */ procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs; ProcGlobal->freeProcs = &procs[i]; procs[i].procgloballist = &ProcGlobal->freeProcs; } So we constantly push the current element to the start of the list. I, and apparently the CPUs I have access too, think that's a poor idea. It means there's less predictable access patterns at volatile PGXACT *pgxact = &allPgXact[pgprocno]; in GetSnapshotData(), because instead of starting with the first backend, we start with something that changes on a regular basis. And it's not likely to be on a cacheline boundary. I think we should change that so procono 0 is the first on ->freeProcs. Obviously that's not guaranteed that usage remains that way, but there's very little reason, except some minor coding advantage to keep it like it is. Greetings, Andres Freund
Hi, On 2018-10-05 10:29:55 -0700, Andres Freund wrote: > - remove the volatiles from GetSnapshotData(). As we've, for quite a > while now, made sure both lwlocks and spinlocks are proper barriers > they're not needed. Attached is a patch that removes all the volatiles from procarray.c and varsup.c. I'd started to just remove them from GetSnapshotData(), but that doesn't seem particularly consistent. I actually think there's a bit of a correctness problem with the previous state - the logic in GetNewTransactionId() relies on volatile guaranteeing memory ordering, which it doesn't do: * Use volatile pointer to prevent code rearrangement; other backends * could be examining my subxids info concurrently, and we don't want * them to see an invalid intermediate state, such as incrementing * nxids before filling the array entry. Note we are assuming that * TransactionId and int fetch/store are atomic. */ volatile PGPROC *myproc = MyProc; volatile PGXACT *mypgxact = MyPgXact; if (!isSubXact) mypgxact->xid = xid; else { int nxids = mypgxact->nxids; if (nxids < PGPROC_MAX_CACHED_SUBXIDS) { myproc->subxids.xids[nxids] = xid; mypgxact->nxids = nxids + 1; } I've replaced that with a write barrier / read barrier. I strongly suspect this isn't a problem on the write side in practice (due to the dependent read), but the read side looks much less clear to me. I think explicitly using barriers is much saner these days. Comments? > - reduce the largely redundant flag tests. With the previous change done > the compiler should be able to do so, but there's no reason to not > start from somewhere sane. I'm kinda wondering about backpatching > this part. Done. Greetings, Andres Freund
Attachment
On Sat, Nov 10, 2018 at 6:01 PM Andres Freund <andres@anarazel.de> wrote: > On 2018-10-05 10:29:55 -0700, Andres Freund wrote: > > - remove the volatiles from GetSnapshotData(). As we've, for quite a > > while now, made sure both lwlocks and spinlocks are proper barriers > > they're not needed. > > Attached is a patch that removes all the volatiles from procarray.c and > varsup.c. I'd started to just remove them from GetSnapshotData(), but > that doesn't seem particularly consistent. > > I actually think there's a bit of a correctness problem with the > previous state - the logic in GetNewTransactionId() relies on volatile > guaranteeing memory ordering, which it doesn't do: > > * Use volatile pointer to prevent code rearrangement; other backends > * could be examining my subxids info concurrently, and we don't want > * them to see an invalid intermediate state, such as incrementing > * nxids before filling the array entry. Note we are assuming that > * TransactionId and int fetch/store are atomic. > */ > volatile PGPROC *myproc = MyProc; > volatile PGXACT *mypgxact = MyPgXact; > > if (!isSubXact) > mypgxact->xid = xid; > else > { > int nxids = mypgxact->nxids; > > if (nxids < PGPROC_MAX_CACHED_SUBXIDS) > { > myproc->subxids.xids[nxids] = xid; > mypgxact->nxids = nxids + 1; > } > > I've replaced that with a write barrier / read barrier. I strongly > suspect this isn't a problem on the write side in practice (due to the > dependent read), but the read side looks much less clear to me. I think > explicitly using barriers is much saner these days. > > Comments? +1 I said the same over here: https://www.postgresql.org/message-id/flat/CAEepm%3D1nff0x%3D7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
On 2018-11-11 11:29:54 +1300, Thomas Munro wrote: > On Sat, Nov 10, 2018 at 6:01 PM Andres Freund <andres@anarazel.de> wrote: > > I've replaced that with a write barrier / read barrier. I strongly > > suspect this isn't a problem on the write side in practice (due to the > > dependent read), but the read side looks much less clear to me. I think > > explicitly using barriers is much saner these days. > > > > Comments? > > +1 > > I said the same over here: > > https://www.postgresql.org/message-id/flat/CAEepm%3D1nff0x%3D7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g%40mail.gmail.com Hah! Sorry for missing that back then. I think your patch from back then misses a few things that mine did. But I also think mine missed the fact that XidCacheRemove is problematic - I only was thinking of the *reads* of MyPgXact->nxids in XidCacheRemoveRunningXids(), but you correctly observe that the write is the problematic case (the reads are ok, because it's the same process as GetNewTransactionId()). Greetings, Andres Freund