Thread: Re: use a non-locking initial test in TAS_SPIN on AArch64

Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Nathan Bossart
Date:
On Tue, Oct 22, 2024 at 02:54:57PM -0500, Nathan Bossart wrote:
> My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that
> showed an enormous amount of s_lock() time going to the
> __sync_lock_test_and_set() call in the AArch64 implementation of tas().
> Upon closer inspection, I noticed that we don't implement a custom
> TAS_SPIN() for this architecture, so I quickly hacked together the attached
> patch and ran a couple of benchmarks that stressed the spinlock code.  I
> found no discussion about TAS_SPIN() on ARM in the archives, but I did
> notice that the initial AArch64 support was added [0] before x86_64 started
> using a non-locking test [1].
> 
> These benchmarks are for a c8g.24xlarge running a select-only pgbench with
> 256 clients and pg_stat_statements.track_planning enabled.
> 
> without the patch:
>
> [...] 
> 
>     tps = 74135.100891 (without initial connection time)
> 
> with the patch:
>
> [...] 
> 
>     tps = 549462.785554 (without initial connection time)

Are there any objections to proceeding with this change?  So far, it's been
tested on a c8g.24xlarge and an Apple M3 (which seems to be too small to
show any effect).  If anyone has access to a larger ARM machine, additional
testing would be greatly appreciated.  I think it would be unfortunate if
this slipped to v19.

-- 
nathan



Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Andres Freund
Date:
Hi,

On 2025-01-08 12:12:19 -0600, Nathan Bossart wrote:
> On Tue, Oct 22, 2024 at 02:54:57PM -0500, Nathan Bossart wrote:
> > My colleague Salvatore Dipietro (CC'd) sent me a couple of profiles that
> > showed an enormous amount of s_lock() time going to the
> > __sync_lock_test_and_set() call in the AArch64 implementation of tas().
> > Upon closer inspection, I noticed that we don't implement a custom
> > TAS_SPIN() for this architecture, so I quickly hacked together the attached
> > patch and ran a couple of benchmarks that stressed the spinlock code.  I
> > found no discussion about TAS_SPIN() on ARM in the archives, but I did
> > notice that the initial AArch64 support was added [0] before x86_64 started
> > using a non-locking test [1].
> >
> > These benchmarks are for a c8g.24xlarge running a select-only pgbench with
> > 256 clients and pg_stat_statements.track_planning enabled.
> >
> > without the patch:
> >
> > [...]
> >
> >     tps = 74135.100891 (without initial connection time)
> >
> > with the patch:
> >
> > [...]
> >
> >     tps = 549462.785554 (without initial connection time)
>
> Are there any objections to proceeding with this change?  So far, it's been
> tested on a c8g.24xlarge and an Apple M3 (which seems to be too small to
> show any effect).  If anyone has access to a larger ARM machine, additional
> testing would be greatly appreciated.  I think it would be unfortunate if
> this slipped to v19.

Seems reasonable. It's not surprising that spinning with an atomic operation
doesn't scale very well once a you have more than a handful cores.

Of course the whole way locking works in pg_stat_statements is a scalability
disaster, but that's a bigger project to fix.

Out of curiosity, have you measured whether this has a positive effect without
pg_stat_statements? I think it'll e.g. also affect lwlocks, as they also use
perform_spin_delay().

Greetings,

Andres



Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> Are there any objections to proceeding with this change?  So far, it's been
> tested on a c8g.24xlarge and an Apple M3 (which seems to be too small to
> show any effect).  If anyone has access to a larger ARM machine, additional
> testing would be greatly appreciated.  I think it would be unfortunate if
> this slipped to v19.

I just acquired an M4 Pro, which may also be too small to show any
effect, but perhaps running the test there would at least give us
more confidence that there's not a bad effect.  Which test case(s)
would you recommend trying?

            regards, tom lane



Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Nathan Bossart
Date:
On Wed, Jan 08, 2025 at 03:23:45PM -0500, Tom Lane wrote:
> I just acquired an M4 Pro, which may also be too small to show any
> effect, but perhaps running the test there would at least give us
> more confidence that there's not a bad effect.  Which test case(s)
> would you recommend trying?

Thanks!  A select-only pgbench with many clients (I used 256 upthread) and
pg_stat_statements.track_planning enabled seems to be a pretty easy way to
stress spinlocks.  The same test without track_planning enabled might also
be interesting.  I'm looking for a way to stress LWLocks, too, and I will
share here when I either find an existing test or write something of my
own.

-- 
nathan



Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Nathan Bossart
Date:
On Wed, Jan 08, 2025 at 03:07:24PM -0500, Andres Freund wrote:
> Out of curiosity, have you measured whether this has a positive effect without
> pg_stat_statements? I think it'll e.g. also affect lwlocks, as they also use
> perform_spin_delay().

AFAICT TAS_SPIN() is only used for s_lock(), which doesn't appear to be
used by LWLocks.  But I did retry my test from upthread without
pg_stat_statements and was surprised to find a reproducible 4-6%
regression.  I'm not seeing any obvious differences in perf, but I do see
that the thread for adding TAS_SPIN() for PPC mentions a regression at
lower contention levels [0].  Perhaps the non-locked test is failing often
enough to hurt performance in this case...  Whatever it is, it'll be mighty
frustrating to miss out on a >7x gain because of a 4% regression.

[0] https://postgr.es/me/26496.1325625436%40sss.pgh.pa.us

-- 
nathan



Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Andres Freund
Date:
Hi,

On 2025-01-08 16:01:19 -0600, Nathan Bossart wrote:
> On Wed, Jan 08, 2025 at 03:07:24PM -0500, Andres Freund wrote:
> > Out of curiosity, have you measured whether this has a positive effect without
> > pg_stat_statements? I think it'll e.g. also affect lwlocks, as they also use
> > perform_spin_delay().
> 
> AFAICT TAS_SPIN() is only used for s_lock(), which doesn't appear to be
> used by LWLocks.

Brainfart on my part, sorry. I was thinking of SPIN_DELAY() for a moment...


> But I did retry my test from upthread without pg_stat_statements and was
> surprised to find a reproducible 4-6% regression.

Uh, huh. I assume this was readonly pgbench with 256 clients just as you had
tested upthread? I don't think there's any hot spinlock meaningfully involved
in that workload?  A r/w workload is a different story, but upthread you
mentioned select-only.

Do you see any spinlock in profiles?



> I'm not seeing any obvious differences in perf, but I do see that the thread
> for adding TAS_SPIN() for PPC mentions a regression at lower contention
> levels [0].  Perhaps the non-locked test is failing often enough to hurt
> performance in this case...  Whatever it is, it'll be mighty frustrating to
> miss out on a
> >7x gain because of a 4% regression.

I don't think the explanation can be that simple - even with TAS_SPIN defined,
we do try to acquire the lock once without using TAS_SPIN:

#if !defined(S_LOCK)
#define S_LOCK(lock) \
    (TAS(lock) ? s_lock((lock), __FILE__, __LINE__, __func__) : 0)
#endif     /* S_LOCK */

Only s_lock() then uses TAS_SPIN(lock).


I wonder if you're hitting an extreme case of binary-layout related effects?
I've never seen them at this magnitude though.  I'd suggest using either lld
or mold as linker and comparing the numbers for a few
-Wl,--shuffle-sections=$seed seed values.

Greetings,

Andres Freund



Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Nathan Bossart
Date:
On Wed, Jan 08, 2025 at 05:25:24PM -0500, Andres Freund wrote:
> On 2025-01-08 16:01:19 -0600, Nathan Bossart wrote:
>> But I did retry my test from upthread without pg_stat_statements and was
>> surprised to find a reproducible 4-6% regression.
> 
> Uh, huh. I assume this was readonly pgbench with 256 clients just as you had
> tested upthread? I don't think there's any hot spinlock meaningfully involved
> in that workload?  A r/w workload is a different story, but upthread you
> mentioned select-only.
> 
> Do you see any spinlock in profiles?

Yes, this was using 256 clients.  Looking closer, I don't see anything
spinlock related anywhere near the top of perf.

>> I'm not seeing any obvious differences in perf, but I do see that the thread
>> for adding TAS_SPIN() for PPC mentions a regression at lower contention
>> levels [0].  Perhaps the non-locked test is failing often enough to hurt
>> performance in this case...  Whatever it is, it'll be mighty frustrating to
>> miss out on a
>> >7x gain because of a 4% regression.
> 
> I don't think the explanation can be that simple - even with TAS_SPIN defined,
> we do try to acquire the lock once without using TAS_SPIN:
> 
> #if !defined(S_LOCK)
> #define S_LOCK(lock) \
>     (TAS(lock) ? s_lock((lock), __FILE__, __LINE__, __func__) : 0)
> #endif     /* S_LOCK */
> 
> Only s_lock() then uses TAS_SPIN(lock).

Ah, right.  FWIW I tried setting a cap on the number of times we do a
non-locked test, and the results still showed the regression, which seems
to match your intuition here.

> I wonder if you're hitting an extreme case of binary-layout related effects?
> I've never seen them at this magnitude though.  I'd suggest using either lld
> or mold as linker and comparing the numbers for a few
> -Wl,--shuffle-sections=$seed seed values.

Will do.

-- 
nathan



Re: use a non-locking initial test in TAS_SPIN on AArch64

From
Tom Lane
Date:
Nathan Bossart <nathandbossart@gmail.com> writes:
> AFAICT TAS_SPIN() is only used for s_lock(), which doesn't appear to be
> used by LWLocks.  But I did retry my test from upthread without
> pg_stat_statements and was surprised to find a reproducible 4-6%
> regression.

On what hardware?

I just spent an hour beating on my M4 Pro (the 14-core variant)
and could not detect any outside-the-noise effect of this patch,
with or without pg_stat_statements loaded.  There does seem to be
a small fraction-of-a-percent-ish benefit.  But the run-to-run
variation with 60-second "pgbench -S" tests is a couple of percent,
so I can't say that that's real.

I do feel pretty sure that the patch doesn't hurt on this
class of hardware.

            regards, tom lane