Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA - Mailing list pgsql-hackers

From Sokolov Yura
Subject Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
Date
Msg-id cace5dc9bfec2ae206db56968fe3575e@postgrespro.ru
Whole thread Raw
In response to Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA  (Andres Freund <andres@anarazel.de>)
Responses Re: [HACKERS] Fix performance degradation of contended LWLock on NUMA
List pgsql-hackers
Hello,

On 2017-10-19 19:46, Andres Freund wrote:
> On 2017-10-19 14:36:56 +0300, Sokolov Yura wrote:
>> > > +    init_local_spin_delay(&delayStatus);
>> >
>> > The way you moved this around has the disadvantage that we now do this -
>> > a number of writes - even in the very common case where the lwlock can
>> > be acquired directly.
>> 
>> Excuse me, I don't understand fine.
>> Do you complain against init_local_spin_delay placed here?
> 
> Yes.

I could place it before perform_spin_delay under `if (!spin_inited)` if 
you
think it is absolutely must.

> 
> 
>> Placing it in other place will complicate code.
> 
> 
>> Or you complain against setting `mask` and `add`?
> 
> That seems right.
> 
> 
>> In both cases, I think simpler version should be accepted first. It 
>> acts
>> as algorithm definition. And it already gives measurable improvement.
> 
> Well, in scalability. I'm less sure about uncontended performance.
> 
> 
> 
>> > > +     * We intentionally do not call finish_spin_delay here, because
>> > > the loop
>> > > +     * above usually finished by queuing into the wait list on
>> > > contention, and
>> > > +     * doesn't reach spins_per_delay thereby doesn't sleep inside of
>> > > +     * perform_spin_delay. Also, different LWLocks has very different
>> > > +     * contention pattern, and it is wrong to update spin-lock
>> > > statistic based
>> > > +     * on LWLock contention.
>> > > +     */
>> >
>> > Huh? This seems entirely unconvincing. Without adjusting this here we'll
>> > just spin the same way every iteration. Especially for the case where
>> > somebody else holds LW_FLAG_LOCKED that's quite bad.
>> 
>> LWLock's are very different. Some of them are always short-term
>> (BufferLock), others are always locked for a long time.
> 
> That seems not particularly relevant. The same is true for
> spinlocks. The relevant question isn't how long the lwlock is held, 
> it's
> how long LW_FLAG_LOCKED is held - and that should only depend on
> contention (i.e. bus speed, amount of times put into sleep while 
> holding
> lock, etc), not on how long the lock is held.
> 
>> I've tried to place this delay into lock itself (it has 2 free bytes),
>> but this attempt performed worse.
> 
> That seems unsurprising - there's a *lot* of locks, and we'd have to
> tune all of them. Additionally there's a bunch of platforms where we do
> *not* have free bytes (consider the embedding in BufferTag).
> 
> 
>> Now I understand, that delays should be stored in array indexed by
>> tranche. But I have no time to test this idea. And I doubt it will 
>> give
>> cardinally better results (ie > 5%), so I think it is better to accept
>> patch in this way, and then experiment with per-tranche delay.
> 
> I don't think tranches have any decent predictive power here.

Look after "Make acquiring LWLock to look more like spinlock".
First `skip_wait_list` iterations there is no attempt to queue self into
wait list. It gives most of improvement. (without it there is just no
degradation on high number of clients, but a little decline on low
clients, because current algorithm is also a little `spinny` (because
it attempts to acquire lock again after queueing into wait list)).

`skip_wait_list` depends on tranche very much. And per-tranche
`skip_wait_list` should be calculated based on ability to acquire lock
without any sleep, ie without both queuing itself into wait list and
falling into `pg_usleep` inside `perform_spin_delay` . I suppose, it is
obvious that `spins_per_delay` have to be proportional to 
`skip_wait_list`
(it should be at least greater than `skip_wait_list`). Therefore
`spins_per_delay` is also should be runtime-dependent on lock's tranche.

Am I wrong?

Without that per-tranche auto-tuning, it is better to not touch
`spins_per_delay` inside of `LWLockAttemptLockOrQueue`, I think.

With regards,
-- 
Sokolov Yura aka funny_falcon
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: [HACKERS] asynchronous execution
Next
From: Leon Winter
Date:
Subject: Re: [HACKERS] Cursor With_Hold Performance Workarounds/Optimization