Re: Reduce ProcArrayLock contention - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Reduce ProcArrayLock contention
Date
Msg-id CA+TgmobiSW7Ox7zVHHmT8__th+GL4WJM9guzjzk4bAm=8Bt3Hw@mail.gmail.com
Whole thread Raw
In response to Re: Reduce ProcArrayLock contention  (Andres Freund <andres@anarazel.de>)
Responses Re: Reduce ProcArrayLock contention  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Aug 19, 2015 at 11:39 AM, Andres Freund <andres@anarazel.de> wrote:
> There's a bunch of issues with those two blocks afaics:
>
> 1) The first block (in one backend) might read INVALID_PGPROCNO before
>    ever locking the semaphore if a second backend quickly enough writes
>    INVALID_PGPROCNO. That way the semaphore will be permanently out of
>    "balance".

Yes, this is a clear bug.  I think the fix is to do one unconditional
PGSemaphoreLock(&proc->sem) just prior to the loop.

> 2) There's no memory barriers around dealing with nextClearXidElem in
>    the first block. Afaics there needs to be a read barrier before
>    returning, otherwise it's e.g. not guaranteed that the woken up
>    backend sees its own xid set to InvalidTransactionId.

I can't believe it actually works like that.  Surely a semaphore
operation is a full barrier.  Otherwise this could fail:

P1: a = 0;
P1: PGSemaphoreLock(&P1);
P2: a = 1;
P2: PGSemaphoreUnlock(&P1);
P1: Assert(a == 1);

Do you really think that fails anywhere?  I'd find that shocking.  And
if it's true, then let's put the barrier in PGSemaphore(L|Unl)ock,
because everybody who uses semaphores for anything is going to have
the same problem.  This is exactly what semaphore are for.  I find it
extremely hard to believe that user-level code that uses OS-provided
synchronization primitives has to insert its own barriers also.  I
think we can assume that if I do something and wake you up, you'll see
everything I did before waking you.

(This would of course be needed if we didn't add the PGSemaphoreLock
contemplated by the previous point, because then there'd be no
OS-level synchronization primitive in some cases.  But since we have
to add that anyway I think this is is a non-issue.)

> 3) If a follower returns before the leader has actually finished woken
>    that respective backend up we can get into trouble:
>
>    Consider what happens if such a follower enqueues in another
>    transaction. It is not, as far as I could find out, guaranteed on all
>    types of cpus that a third backend can actually see nextClearXidElem
>    as INVALID_PGPROCNO. That'd likely require SMT/HT cores and multiple
>    sockets. If the write to nextClearXidElem is entered into the local
>    store buffer (leader #1) a hyper-threaded process (leader #2) can
>    possibly see it (store forwarding) while another backend doesn't
>    yet.

Let's call the process that does two commits P, and it's PGPROC
structure proc.  For the first commit, L1 is the leader; for the
second, L2 is the leader.  The intended sequence of operations on
nextClearXidElem is:

(1) P sets proc->nextClearXidElem
(2) L1 clears proc->nextClearXidElem
(3) P sets proc->nextClearXidElem
(4) L2 clears proc->nextClearXidElem

P uses an atomic compare-and-exchange operation on
procglobal->nextClearXidElem after step (1), and L1 can't attempt step
(2) until it uses an atomic compare-and-exchange operation on
procglobal->nextClearXidElem -- that's how it finds out the identity
of P.  Since a compare-and-exchange operation is a full barrier, those
two compare-and-exchange operations form a barrier pair, and (2) must
happen after (1).

Synchronization between (2) and (3) is based on the assumption that P
must do PGSemaphoreLock and L1 must do PGSemaphoreUnlock.  I assume,
as noted above, that those are barriers, that they form a pair, and
thus (3) must happen after (2).

(4) must happen after (3) for the same reasons (2) must happen after (1).

So I don't see the problem.  The "third backend", which I guess is L2,
doesn't need to observe proc->nextClearXidElem until P has reset it;
and procglobal->nextClearXidElem is only manipulated with atomics, so
access to that location had better be totally ordered.

> Going through the patch:
>
> +/*
> + * ProcArrayGroupClearXid -- group XID clearing
> + *
> + * When we cannot immediately acquire ProcArrayLock in exclusive mode at
> + * commit time, add ourselves to a list of processes that need their XIDs
> + * cleared.  The first process to add itself to the list will acquire
> + * ProcArrayLock in exclusive mode and perform ProcArrayEndTransactionInternal
> + * on behalf of all group members.  This avoids a great deal of context
> + * switching when many processes are trying to commit at once, since the lock
> + * only needs to be handed from the last share-locker to one process waiting
> + * for the exclusive lock, rather than to each one in turn.
> + */
> +static void
> +ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
> +{
>
> This comment, in my opinion, is rather misleading. If you have a
> workload that primarily is slowed down due to transaction commits, this
> patch doesn't actually change the number of context switches very
> much. Previously all backends enqueued in the lwlock and got woken up
> one-by-one. Safe backends 'jumping' the queue while a lock has been
> released but the woken up backend doesn't yet run, there'll be exactly
> as many context switches as today.
>
> The difference is that only one backend has to actually acquire the
> lock. So what has changed is the number of times, and the total
> duration, the lock is actually held in exclusive mode.

I don't mind if you want to improve the phrasing of the comment.  I
think my thinking was warped by versions of the patch that used
LWLockAcquireOrWait.  In those versions, every time somebody
committed, all of the waiting backends woke up and fought over who got
to be leader.  That was a lot less good than this approach.  So the
comment *should* be pointing out that we have avoided that danger, but
it should not be making it sound like that's the advantage vs. not
having the mechanism at all.

> +       /* Support for group XID clearing. */
> +       volatile pg_atomic_uint32       nextClearXidElem;
>
> +       /* First pgproc waiting for group XID clear */
> +       volatile pg_atomic_uint32 nextClearXidElem;
>
> Superfluous volatiles.

Hmm, I thought I needed that to avoid compiler warnings, but I guess
not.  And I see that lwlock.h doesn't mention volatile.  So, yeah, we
can remove that.

> I don't think it's a good idea to use the variable name in PROC_HDR and
> PGPROC, it's confusing.

It made sense to me, but I don't care that much if you want to change it.

> How hard did you try checking whether this causes regressions? This
> increases the number of atomics in the commit path a fair bit. I doubt
> it's really bad, but it seems like a good idea to benchmark something
> like a single full-throttle writer and a large number of readers.

Hmm, that's an interesting point.  My analysis was that it really only
increased the atomics in the cases where we otherwise would have gone
to sleep, and I figured that the extra atomics were a small price to
pay for not sleeping.  One writer many readers is a good case to test,
though, because the new code will get used a lot but we'll never
actually manage a group commit.  So that is worth testing.

It's worth point out, though, that your lwlock.c atomics changes have
the same effect.  Before, if there were any shared lockers on an
LWLock and an exclusive-locker came along, the exclusive locker would
take and release the spinlock and go to sleep.  Now, it will use CAS
to try to acquire the lock, then acquire and release the spinlock to
add itself to the wait queue, then use CAS to attempt the lock again,
and then go to sleep - unless of course the second lock acquisition
succeeds, in which case it will first acquire and release the spinlock
yet again to take itself back out of the list of waiters.

The changes we made to the clock sweep are more of the same.  Instead
of taking an lwlock, sweeping the clock hand across many buffers, and
releasing the lwlock, we now perform an atomic operation for every
buffer.  That's obviously "worse" in terms of the total number of
atomic operations, and if you construct a case where every backend
sweeps 10 or 20 buffers yet none of them would have contended with
each other, it might lose.  But it's actually much better and more
scalable in the real world.

I think this is a pattern we'll see with atomics over and over again:
more atomic operations overall in order to reduce the length of time
for which particular resources are unavailable to other processes.
Obviously, we need to be careful about that, and we need to be
particularly careful not to regress the single-client performance,
which is one thing that I liked about this patch - if ProcArrayLock
can be taken in exclusive mode immediately, there's no difference vs.
older versions.  So I think in general this is going shift pretty
smoothly from the old behavior in low client-count situations to the
group behavior in high client-count situations, but it does seem
possible that if you have say 1 writer and 100 readers you could lose.
I don't think you'll lose much, but let's test that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: A few cases of left shifting negative integers
Next
From: Tom Lane
Date:
Subject: Re: Warnings around booleans