Re: spin_delay() for ARM - Mailing list pgsql-hackers

From Pavel Stehule
Subject Re: spin_delay() for ARM
Date
Msg-id CAFj8pRDYc+t4oDBa01ErU3oSa2sMUSki3A2sqALz9rjo50034w@mail.gmail.com
Whole thread Raw
In response to Re: spin_delay() for ARM  (Amit Khandekar <amitdkhan.pg@gmail.com>)
Responses Re: spin_delay() for ARM  (Ants Aasma <ants@cybertec.at>)
List pgsql-hackers


čt 16. 4. 2020 v 9:18 odesílatel Amit Khandekar <amitdkhan.pg@gmail.com> napsal:
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks.  As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> >     I think optimizing spinlocks for machines with only a few CPUs is
> >     probably pointless.  Based on what I've seen so far, spinlock
> >     contention even at 16 CPUs is negligible pretty much no matter what
> >     you do.  Whether your implementation is fast or slow isn't going to
> >     matter, because even an inefficient implementation will account for
> >     only a negligible percentage of the total CPU time - much less than 1%
> >     - as opposed to a 64-core machine, where it's not that hard to find
> >     cases where spin-waits consume the *majority* of available CPU time
> >     (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.

I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000

I compared pgbench results with HEAD versus PAUSE removed like this :
 perform_spin_delay(SpinDelayStatus *status)
 {
-       /* CPU-specific delay each time through the loop */
-       SPIN_DELAY();

Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:

Clients     HEAD     Without_PAUSE
8         244446       247264
16        399939       399549
24        454189       453244
32       1097592      1098844
40       1090424      1087984
48       1068645      1075173
64       1035035      1039973
96        976578       970699

May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.

Not relevant to the PAUSE stuff .... Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.

what I know, pgbench cannot be used for testing spinlocks problems.

Maybe you can see this issue when a) use higher number clients - hundreds, thousands. Decrease share memory, so there will be press on related spin lock.

Regards

Pavel


Thanks
-Amit Khandekar


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: xid wraparound danger due to INDEX_CLEANUP false
Next
From: Masahiko Sawada
Date:
Subject: Re: Vacuum o/p with (full 1, parallel 0) option throwing an error