Re: Reduce ProcArrayLock contention - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Reduce ProcArrayLock contention
Date
Msg-id CAA4eK1KHHmLM+Ti8KYruwNojAVkVgHKf_Fqc1NZ8vxT8-U31Zg@mail.gmail.com
Whole thread Raw
In response to Re: Reduce ProcArrayLock contention  (Andres Freund <andres@anarazel.de>)
Responses Re: Reduce ProcArrayLock contention  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Reduce ProcArrayLock contention  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Wed, Aug 19, 2015 at 9:09 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> > On Wed, Aug 5, 2015 at 10:59 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > OK, committed.
>
> I spent some time today reviewing the commited patch. So far my only
> major complaint is that I think the comments are only insufficiently
> documenting the approach taken:
> Stuff like avoiding ABA type problems by clearling the list entirely and
> it being impossible that entries end up on the list too early absolutely
> needs to be documented explicitly.
>

I think more comments can be added to explain such behaviour if it is
not clear via looking at current code and comments.

> I think I found a few minor correctness issues. Mostly around the fact
> that we, so far, tried to use semaphores in a way that copes with
> unrelated unlocks "arriving". I actually think I removed all of the
> locations that caused that to happen, but I don't think we should rely
> on that fact. Looking at the following two pieces of code:
>         /* If the list was not empty, the leader will clear our XID. */
>         if (nextidx != INVALID_PGPROCNO)
>         {
>                 /* Sleep until the leader clears our XID. */
>                 while (pg_atomic_read_u32(&proc->nextClearXidElem) != INVALID_PGPROCNO)
>                 {
>                         extraWaits++;
>                         PGSemaphoreLock(&proc->sem);
>                 }
>
>                 /* Fix semaphore count for any absorbed wakeups */
>                 while (extraWaits-- > 0)
>                         PGSemaphoreUnlock(&proc->sem);
>                 return;
>         }
> ...
>         /*
>          * Now that we've released the lock, go back and wake everybody up.  We
>          * don't do this under the lock so as to keep lock hold times to a
>          * minimum.  The system calls we need to perform to wake other processes
>          * up are probably much slower than the simple memory writes we did while
>          * holding the lock.
>          */
>         while (wakeidx != INVALID_PGPROCNO)
>         {
>                 PGPROC  *proc = &allProcs[wakeidx];
>
>                 wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
>                 pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);
>
>                 if (proc != MyProc)
>                         PGSemaphoreUnlock(&proc->sem);
>         }
>
> There's a bunch of issues with those two blocks afaics:
>
> 1) The first block (in one backend) might read INVALID_PGPROCNO before
>    ever locking the semaphore if a second backend quickly enough writes
>    INVALID_PGPROCNO. That way the semaphore will be permanently out of
>    "balance".
>

I think you are right and here we need to use something like what is
suggested below by you.  Originally the code was similar to what you
have written below, but it was using a different (new) variable to achieve
what you have achieved with lwWaiting and to avoid the use of new
variable the code has been refactored in current way.  I think we should
do this change (I can write a patch) unless Robert feels otherwise.

> 2) There's no memory barriers around dealing with nextClearXidElem in
>    the first block. Afaics there needs to be a read barrier before
>    returning, otherwise it's e.g. not guaranteed that the woken up
>    backend sees its own xid set to InvalidTransactionId.
>
> 3) If a follower returns before the leader has actually finished woken
>    that respective backend up we can get into trouble:
>

Surely that can lead to a problem and I think the reason and fix
for this is same as first point.

>    Consider what happens if such a follower enqueues in another
>    transaction. It is not, as far as I could find out, guaranteed on all
>    types of cpus that a third backend can actually see nextClearXidElem
>    as INVALID_PGPROCNO. That'd likely require SMT/HT cores and multiple
>    sockets. If the write to nextClearXidElem is entered into the local
>    store buffer (leader #1) a hyper-threaded process (leader #2) can
>    possibly see it (store forwarding) while another backend doesn't
>    yet.
>
>    I think this is very unlikely to be an actual problem due to
>    independent barriers until enqueued again, but I don't want to rely
>    on it undocumentedly. It seems safer to replace
>    +            wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
>    +            pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);
>    with a pg_atomic_exchange_u32().
>

I didn't follow this point, if we ensure that follower can never return
before leader wakes it up, then why it will be a problem to update
nextClearXidElem like above.

>
> I think to fix these ProcArrayGroupClearXid() should use a protocol
> similar to lwlock.c. E.g. make the two blocks somethign like:
>         while (wakeidx != INVALID_PGPROCNO)
>         {
>                 PGPROC  *proc = &allProcs[wakeidx];
>
>                 wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
>                 pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);
>
>                 /* ensure that all previous writes are visible before follower continues */
>                 pg_write_barrier();
>
>                 proc->lwWaiting = false;
>
>                 if (proc != MyProc)
>                         PGSemaphoreUnlock(&proc->sem);
>         }
> and
>         if (nextidx != INVALID_PGPROCNO)
>         {
>                 Assert(!MyProc->lwWaiting);
>
>                 for (;;)
>                 {
>                         /* acts as a read barrier */
>                         PGSemaphoreLock(&MyProc->sem);
>                         if (!MyProc->lwWaiting)
>                                 break;
>                         extraWaits++;
>                 }
>
>                 Assert(pg_atomic_read_u32(&proc->nextClearXidElem) == INVALID_PGPROCNO)
>
>                 /* Fix semaphore count for any absorbed wakeups */
>                 while (extraWaits-- > 0)
>                         PGSemaphoreUnlock(&proc->sem);
>                 return;
>         }
>
> Going through the patch:
>
> +/*
> + * ProcArrayGroupClearXid -- group XID clearing
> + *
> + * When we cannot immediately acquire ProcArrayLock in exclusive mode at
> + * commit time, add ourselves to a list of processes that need their XIDs
> + * cleared.  The first process to add itself to the list will acquire
> + * ProcArrayLock in exclusive mode and perform ProcArrayEndTransactionInternal
> + * on behalf of all group members.  This avoids a great deal of context
> + * switching when many processes are trying to commit at once, since the lock
> + * only needs to be handed from the last share-locker to one process waiting
> + * for the exclusive lock, rather than to each one in turn.
> + */
> +static void
> +ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
> +{
>
> This comment, in my opinion, is rather misleading. If you have a
> workload that primarily is slowed down due to transaction commits, this
> patch doesn't actually change the number of context switches very
> much. Previously all backends enqueued in the lwlock and got woken up
> one-by-one. Safe backends 'jumping' the queue while a lock has been
> released but the woken up backend doesn't yet run, there'll be exactly
> as many context switches as today.
>

I think this is debatable as in previous mechanism all the backends
one-by-one clears their transaction id's (which is nothing but context
switching) which lead to contention not only with read lockers
but Write lockers as well.

> The difference is that only one backend has to actually acquire the
> lock. So what has changed is the number of times, and the total
> duration, the lock is actually held in exclusive mode.
>
> +       /* Support for group XID clearing. */
> +       volatile pg_atomic_uint32       nextClearXidElem;
>
> +       /* First pgproc waiting for group XID clear */
> +       volatile pg_atomic_uint32 nextClearXidElem;
>
> Superfluous volatiles.
>
> I don't think it's a good idea to use the variable name in PROC_HDR and
> PGPROC, it's confusing.
>

What do you mean by this, are you not happy with variable name?

>
>
> How hard did you try checking whether this causes regressions? This
> increases the number of atomics in the commit path a fair bit. I doubt
> it's really bad, but it seems like a good idea to benchmark something
> like a single full-throttle writer and a large number of readers.
>

I think the case which you want to stress is when the patch doesn't
have any benefit (like single writer case) and rather increases some
instructions to execute due to atomic-ops.  I think for lower
client-counts like 2,4,8, it will hardly get any benefit and execute
somewhat more instructions, but I don't see any noticable difference
in numbers.  However, it is not bad to try what you are suggesting.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Declarative partitioning
Next
From: Amit Kapila
Date:
Subject: Re: Reduce ProcArrayLock contention