Thread: Skylake-S warning

Skylake-S warning

From
Daniel Wood
Date:

If running benchmarks or you are a customer which is currently impacted by GetSnapshotData() on high end multisocket systems be wary of Skylake-S.


Performance differences of nearly 2X can be seen on select only pgbench due to nothing else but unlucky choices for max_connections.  Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system.  pgbench -S


Results from 5 runs varying max_connections from 400 to 405:


max

conn     TPS

400     677639

401    1146776

402    1122140

403     765664

404     671455

405    1190277

...


perf top shows about 21% GetSnapshotData() with the good numbers and 48% with the bad numbers.


This problem is not seen on a 2 socket 32 core Haswell system.  Being a one man show I lack some of the diagnostic tools to drill down further.  My suspicion is that the fact that Intel has lowered the L2 associativity from 8(Haswell) to 4(Skylake-S) may be the cause.  The other possibility is that at higher core counts the shared 16-way inclusive associative L3 cache becomes insufficient.  Perhaps that is why Intel has moved to an exclusive L3 cache on Skylake-SP.


If this is indeed just disadvantageous placement of structures/arrays in memory then you might also find that after upgrading a previous good choice for max_connections becomes a bad choice if things move around.


NOTE: int pgprocno = pgprocnos[index];

is where the big increase in time occurs in GetSnapshotData()

This is largely read-only, once all connections are established, and easily fits in the L1, and is not next to anything else causing invalidations.


NOTE2: It is unclear why PG needs to support over 64K sessions.  At about 10MB per backend(at the low end) the empty backends alone would consume 640GB's of memory!  Changing pgprocnos from int to short gives me the following results.


max

conn     TPS

400     780119

401    1129286

402    1263093

403     887021

404     679891

405    1218118


While this change is significant on large Skylake systems it is likely just a trivial improvement on other systems or workloads.

Re: Skylake-S warning

From
Andres Freund
Date:
Hi,


On 2018-10-03 14:29:39 -0700, Daniel Wood wrote:
> If running benchmarks or you are a customer which is currently
> impacted by GetSnapshotData() on high end multisocket systems be wary
> of Skylake-S.

> Performance differences of nearly 2X can be seen on select only
> pgbench due to nothing else but unlucky choices for max_connections.
> Scale 1000, 192 local clients on a 2 socket 48 core Skylake-S(Xeon
> Platinum 8175M @ 2.50-GHz) system.  pgbench -S

FWIW, I've seen performance differences of that magnitude going back to
at least nehalem. But not on every larger system, interestingly.


> If this is indeed just disadvantageous placement of structures/arrays
> in memory then you might also find that after upgrading a previous
> good choice for max_connections becomes a bad choice if things move
> around.

In the thread around https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6a2r@alap3.anarazel.de
I'd found doing more aggressive padding helped a lot.  Unfortunately I
didn't pursue this further :(


> NOTE2: It is unclear why PG needs to support over 64K sessions.  At
> about 10MB per backend(at the low end) the empty backends alone would
> consume 640GB's of memory!  Changing pgprocnos from int to short gives
> me the following results.

I've argued this before. After that we reduced MAX_BACKENDS to 0x3FFFF -
I personaly think we could go to 16bit without it being a practical
problem.

I really suspect we're going to have to change the layout of PGXACT data
in a way that makes a contiguous scan possible. That'll probably require
some uglyness because a naive "use first free slot" scheme obviously is
sensitive to there being holes.  But I suspect that a scheme were
backends would occasionally try to move themselves to an earlier slot
wouldn't be too hard to implement.

Greetings,

Andres Freund


Re: Skylake-S warning

From
Daniel Wood
Date:
> On October 3, 2018 at 3:55 PM Andres Freund <andres@anarazel.de> wrote:

> In the thread around https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6a2r@alap3.anarazel.de
> I'd found doing more aggressive padding helped a lot.  Unfortunately I
> didn't pursue this further :(

Interesting.  Looks like you saw the same thing I see now.

> I really suspect we're going to have to change the layout of PGXACT data
> in a way that makes a contiguous scan possible. That'll probably require
> some uglyness because a naive "use first free slot" scheme obviously is
> sensitive to there being holes.  But I suspect that a scheme were
> backends would occasionally try to move themselves to an earlier slot
> wouldn't be too hard to implement.

I've also thought of this.  But someone could create a thousand connections and
then disconnect all but the first and last creating a huge hole.  One thought
would be to have a bit array of used entries and to bias toward using free lower
indexed entries.  Only two cache lines of 1024 bits would need to be scanned to
handle 1024 connections.  ProcArrayAdd() would just do an atomic xchg to set the
used bit.

I was even thinking of decomposing PGXACT into separate arrays of XID's, XMIN's,
etc.  Then have a bit map of those entries where the XID was valid.  When you
set/clear your XID you'd also set/clear this bit.  With the select only workload
the XID are all "Invalid" thus scanning this bit array of zeroed 64 bit long entries
would be quite efficient.  The vacuum and logical decoding flags could be directly
done as a bit map.  Having a array of 1 byte subtxn counts creates another
opportunity for improvement.  Scanning for a non-zero in env's which use few
subtxn's is efficient.  If subtxn's are used a lot then the repeated PGXACT
cache line invalidations when setting nxids is offset by the fact that you can
grab 8 of them in a single fetch.

- Dan


Re: Skylake-S warning

From
Daniel Wood
Date:
One other thought.  Could we update pgxact->xmin less often?  What would be the impact of this lower bound being lower
thanit would normally be with the existing scheme.  Yes, it needs to be moved forward "occasionally".
 

FYI, be careful with padding PGXACT's to a full cache line.  With 1024 clients you'd completely blow out the L1 cache.
Thebenefits with less than 200 clients is dubious.  One problem with multiple(5) pgxact's in a single cache line is
thatyou may get a miss fetching xmin and then a second miss fetching xid because an invalidation occurs between the 2
fetchesfrom updating any of the other 5 pgxact's on the same cache line.  I've found some improvement fetching both
xminand xid with a single 64 bit fetch.  This prevents the invalidation between the two fetches.  Similarily updating
themwith a single 64 bit write helps.
 

Yes, this is uber tweaky.


Re: Skylake-S warning

From
Andres Freund
Date:
On 2018-10-03 17:01:44 -0700, Daniel Wood wrote:
> 
> > On October 3, 2018 at 3:55 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > In the thread around https://www.postgresql.org/message-id/20160411214029.ce3fw6zxim5k6a2r@alap3.anarazel.de
> > I'd found doing more aggressive padding helped a lot.  Unfortunately I
> > didn't pursue this further :(
> 
> Interesting.  Looks like you saw the same thing I see now.
> 
> > I really suspect we're going to have to change the layout of PGXACT data
> > in a way that makes a contiguous scan possible. That'll probably require
> > some uglyness because a naive "use first free slot" scheme obviously is
> > sensitive to there being holes.  But I suspect that a scheme were
> > backends would occasionally try to move themselves to an earlier slot
> > wouldn't be too hard to implement.
> 
> I've also thought of this.  But someone could create a thousand connections and
> then disconnect all but the first and last creating a huge hole.

Yea, that's what I was suggesting to address with "a scheme where
backends would occasionally try to move to move themselves to an earlier
slot". I was basically thinking that in occasions where a backend holds
ProcArrayLock exclusively it could just compare the current number of
connections and its slot and move itself earlier if it's more than, say,
10% after the #num-connection's slot.  Or something.

The advantage of that approach would be that the size of the change is
probably fairly manageable.

Greetings,

Andres Freund


Re: Skylake-S warning

From
Adrien Nayrat
Date:
On 10/3/18 11:29 PM, Daniel Wood wrote:
> If running benchmarks or you are a customer which is currently impacted by
> GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
>
>
> Performance differences of nearly 2X can be seen on select only pgbench due to
> nothing else but unlucky choices for max_connections.  Scale 1000, 192 local
> clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system. 
> pgbench -S
>

Hi,

Could it be related to :
https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net
?




Attachment

Re: Skylake-S warning

From
Ants Aasma
Date:
On Thu, Oct 4, 2018 at 9:50 AM Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
>
> On 10/3/18 11:29 PM, Daniel Wood wrote:
> > If running benchmarks or you are a customer which is currently impacted by
> > GetSnapshotData() on high end multisocket systems be wary of Skylake-S.
> >
> >
> > Performance differences of nearly 2X can be seen on select only pgbench due to
> > nothing else but unlucky choices for max_connections.  Scale 1000, 192 local
> > clients on a 2 socket 48 core Skylake-S(Xeon Platinum 8175M @ 2.50-GHz) system.
> > pgbench -S
>
> Could it be related to :
> https://www.postgresql.org/message-id/D2B9F2A20670C84685EF7D183F2949E2373E66%40gigant.nidsa.net
> ?


Unlikely. I understood from Daniel's email that profiling shows a
different hot-spot. In the cited .NET issue the problem was mostly due
to issuing PAUSE in a loop without attempting to grab the lock. In
PostgreSQL it's called only once per retry attempt.

Regards,
Ants Aasma
--
PostgreSQL Senior Consultant
www.cybertec-postgresql.com

Austria (HQ), Wiener Neustadt  |  Switzerland, Zürich  |  Estonia,
Tallinn  |  Uruguay, Montevideo
Facebook: www.fb.com/cybertec.postgresql
Twitter: www.twitter.com/PostgresSupport


Re: Skylake-S warning

From
Andres Freund
Date:
Hi,

On 2018-10-03 17:28:59 -0700, Daniel Wood wrote:
> FYI, be careful with padding PGXACT's to a full cache line.

I'm not actually thinking of doing that, but just to round it up so we
don't have PGXACTs spanning cachelines. It's currently 12bytes, so we
end up with one spanning 60-72, then from 120-132 etc.  In my quick
testing - on a larger westmere machine, so quite possibly outdated -
that reduces the variance quite a bit.  I think we should make sure it's
16 bytes. What I'd previously suggested is that we use the additional
space for the database oid or such, that'd allow for some future
improvements.


I think we also should:
- make sure the pgprocnos array is cacheline aligned. It's not updated
  as frequently, but I still saw some minor benefits. Adding a
  pg_attribute_aligned(64) should be sufficient.
- remove the volatiles from GetSnapshotData(). As we've, for quite a
  while now, made sure both lwlocks and spinlocks are proper barriers
  they're not needed.
- reduce the largely redundant flag tests. With the previous change done
  the compiler should be able to do so, but there's no reason to not
  start from somewhere sane.  I'm kinda wondering about backpatching
  this part.

To me this list seems lik efairly straight-forward changes that we
should do soon.


One thing I observed when re-familiarizing myself with this is that
currently the assignment of pgprocnos is a bit weird - we actually don't
start with the lowest pgprocno/PGPROC, but instead start with the
*highest* one.  That's because InitProcGlobal() populates the free lists
in a 'highest first' manner:

        /*
         * Newly created PGPROCs for normal backends, autovacuum and bgworkers
         * must be queued up on the appropriate free list.  Because there can
         * only ever be a small, fixed number of auxiliary processes, no free
         * list is used in that case; InitAuxiliaryProcess() instead uses a
         * linear search.   PGPROCs for prepared transactions are added to a
         * free list by TwoPhaseShmemInit().
         */
        if (i < MaxConnections)
        {
            /* PGPROC for normal backend, add to freeProcs list */
            procs[i].links.next = (SHM_QUEUE *) ProcGlobal->freeProcs;
            ProcGlobal->freeProcs = &procs[i];
            procs[i].procgloballist = &ProcGlobal->freeProcs;
        }

So we constantly push the current element to the start of the list.

I, and apparently the CPUs I have access too, think that's a poor
idea. It means there's less predictable access patterns at volatile
PGXACT *pgxact = &allPgXact[pgprocno]; in GetSnapshotData(), because
instead of starting with the first backend, we start with something that
changes on a regular basis. And it's not likely to be on a cacheline
boundary.  I think we should change that so procono 0 is the first on
->freeProcs.  Obviously that's not guaranteed that usage remains that
way, but there's very little reason, except some minor coding advantage
to keep it like it is.

Greetings,

Andres Freund


Re: Skylake-S warning

From
Andres Freund
Date:
Hi,

On 2018-10-05 10:29:55 -0700, Andres Freund wrote:
> - remove the volatiles from GetSnapshotData(). As we've, for quite a
>   while now, made sure both lwlocks and spinlocks are proper barriers
>   they're not needed.

Attached is a patch that removes all the volatiles from procarray.c and
varsup.c. I'd started to just remove them from GetSnapshotData(), but
that doesn't seem particularly consistent.

I actually think there's a bit of a correctness problem with the
previous state - the logic in GetNewTransactionId() relies on volatile
guaranteeing memory ordering, which it doesn't do:

         * Use volatile pointer to prevent code rearrangement; other backends
         * could be examining my subxids info concurrently, and we don't want
         * them to see an invalid intermediate state, such as incrementing
         * nxids before filling the array entry.  Note we are assuming that
         * TransactionId and int fetch/store are atomic.
         */
        volatile PGPROC *myproc = MyProc;
        volatile PGXACT *mypgxact = MyPgXact;

        if (!isSubXact)
            mypgxact->xid = xid;
        else
        {
            int            nxids = mypgxact->nxids;

            if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
            {
                myproc->subxids.xids[nxids] = xid;
                mypgxact->nxids = nxids + 1;
            }

I've replaced that with a write barrier / read barrier.  I strongly
suspect this isn't a problem on the write side in practice (due to the
dependent read), but the read side looks much less clear to me.  I think
explicitly using barriers is much saner these days.

Comments?


> - reduce the largely redundant flag tests. With the previous change done
>   the compiler should be able to do so, but there's no reason to not
>   start from somewhere sane.  I'm kinda wondering about backpatching
>   this part.

Done.

Greetings,

Andres Freund

Attachment

Re: Skylake-S warning

From
Thomas Munro
Date:
On Sat, Nov 10, 2018 at 6:01 PM Andres Freund <andres@anarazel.de> wrote:
> On 2018-10-05 10:29:55 -0700, Andres Freund wrote:
> > - remove the volatiles from GetSnapshotData(). As we've, for quite a
> >   while now, made sure both lwlocks and spinlocks are proper barriers
> >   they're not needed.
>
> Attached is a patch that removes all the volatiles from procarray.c and
> varsup.c. I'd started to just remove them from GetSnapshotData(), but
> that doesn't seem particularly consistent.
>
> I actually think there's a bit of a correctness problem with the
> previous state - the logic in GetNewTransactionId() relies on volatile
> guaranteeing memory ordering, which it doesn't do:
>
>                  * Use volatile pointer to prevent code rearrangement; other backends
>                  * could be examining my subxids info concurrently, and we don't want
>                  * them to see an invalid intermediate state, such as incrementing
>                  * nxids before filling the array entry.  Note we are assuming that
>                  * TransactionId and int fetch/store are atomic.
>                  */
>                 volatile PGPROC *myproc = MyProc;
>                 volatile PGXACT *mypgxact = MyPgXact;
>
>                 if (!isSubXact)
>                         mypgxact->xid = xid;
>                 else
>                 {
>                         int                     nxids = mypgxact->nxids;
>
>                         if (nxids < PGPROC_MAX_CACHED_SUBXIDS)
>                         {
>                                 myproc->subxids.xids[nxids] = xid;
>                                 mypgxact->nxids = nxids + 1;
>                         }
>
> I've replaced that with a write barrier / read barrier.  I strongly
> suspect this isn't a problem on the write side in practice (due to the
> dependent read), but the read side looks much less clear to me.  I think
> explicitly using barriers is much saner these days.
>
> Comments?

+1

I said the same over here:

https://www.postgresql.org/message-id/flat/CAEepm%3D1nff0x%3D7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Skylake-S warning

From
Andres Freund
Date:
On 2018-11-11 11:29:54 +1300, Thomas Munro wrote:
> On Sat, Nov 10, 2018 at 6:01 PM Andres Freund <andres@anarazel.de> wrote:
> > I've replaced that with a write barrier / read barrier.  I strongly
> > suspect this isn't a problem on the write side in practice (due to the
> > dependent read), but the read side looks much less clear to me.  I think
> > explicitly using barriers is much saner these days.
> >
> > Comments?
> 
> +1
> 
> I said the same over here:
> 
> https://www.postgresql.org/message-id/flat/CAEepm%3D1nff0x%3D7i3YQO16jLA2qw-F9O39YmUew4oq-xcBQBs0g%40mail.gmail.com

Hah! Sorry for missing that back then.  I think your patch from back
then misses a few things that mine did. But I also think mine missed the
fact that XidCacheRemove is problematic - I only was thinking of the
*reads* of MyPgXact->nxids in XidCacheRemoveRunningXids(), but you
correctly observe that the write is the problematic case (the reads are
ok, because it's the same process as GetNewTransactionId()).

Greetings,

Andres Freund