Thread: Excessive CPU usage in StandbyReleaseLocks()

Excessive CPU usage in StandbyReleaseLocks()

From
Thomas Munro
Date:
Hello hackers,

Andres Freund diagnosed a case of $SUBJECT in a customer's 9.6 system.
I've written a minimal reproducer and a prototype patch to address the
root cause.

The problem is that StandbyReleaseLocks() does a linear search of all
known AccessExclusiveLocks when a transaction ends.  Luckily, since
v10 (commit 9b013dc2) that is skipped for transactions that haven't
taken any AELs and aren't using 2PC, but that doesn't help all users.

It's fine if the AEL list is short, but if you do something that takes
a lot of AELs such as restoring a database with many tables or
truncating a lot of partitions while other transactions are in flight
then we start doing O(txrate * nlocks * nsubxacts) work and that can
hurt.

The reproducer script I've attached creates one long-lived transaction
that acquires 6,000 AELs and takes a nap, while 48 connections run
trivial 2PC transactions (I was also able to reproduce the effect
without 2PC by creating a throw-away temporary table in every
transaction, but it was unreliable due to contention slowing
everything down).  For me, the standby's startup process becomes 100%
pegged, replay_lag begins to climb and perf says something like:

+   97.88%    96.96%  postgres  postgres            [.] StandbyReleaseLocks

The attached patch splits the AEL list into one list per xid and
sticks them in a hash table.  That makes perf say something like:

+    0.60%     0.00%  postgres  postgres            [.] StandbyReleaseLocks

This seems like something we'd want to back-patch because the problem
affects all branches (the older releases more severely because they
lack the above-mentioned optimisation).

Thoughts?

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Excessive CPU usage in StandbyReleaseLocks()

From
Andres Freund
Date:
Hi,

On 2018-06-19 17:43:42 +1200, Thomas Munro wrote:
> The problem is that StandbyReleaseLocks() does a linear search of all
> known AccessExclusiveLocks when a transaction ends.  Luckily, since
> v10 (commit 9b013dc2) that is skipped for transactions that haven't
> taken any AELs and aren't using 2PC, but that doesn't help all users.

> It's fine if the AEL list is short, but if you do something that takes
> a lot of AELs such as restoring a database with many tables or
> truncating a lot of partitions while other transactions are in flight
> then we start doing O(txrate * nlocks * nsubxacts) work and that can
> hurt.
> 
> The reproducer script I've attached creates one long-lived transaction
> that acquires 6,000 AELs and takes a nap, while 48 connections run
> trivial 2PC transactions (I was also able to reproduce the effect
> without 2PC by creating a throw-away temporary table in every
> transaction, but it was unreliable due to contention slowing
> everything down).  For me, the standby's startup process becomes 100%
> pegged, replay_lag begins to climb and perf says something like:
> 
> +   97.88%    96.96%  postgres  postgres            [.] StandbyReleaseLocks
> 
> The attached patch splits the AEL list into one list per xid and
> sticks them in a hash table.  That makes perf say something like:

> +    0.60%     0.00%  postgres  postgres            [.] StandbyReleaseLocks

Neato. Results aside, that seems like a better suited datastructure.

>  /*
>   * InitRecoveryTransactionEnvironment
> @@ -63,6 +73,19 @@ void
>  InitRecoveryTransactionEnvironment(void)
>  {
>      VirtualTransactionId vxid;
> +    HASHCTL            hash_ctl;
> +
> +    /*
> +     * Initialize the hash table for tracking the list of locks held by each
> +     * transaction.
> +     */
> +    memset(&hash_ctl, 0, sizeof(hash_ctl));
> +    hash_ctl.keysize = sizeof(TransactionId);
> +    hash_ctl.entrysize = sizeof(RecoveryLockListsEntry);
> +    RecoveryLockLists = hash_create("RecoveryLockLists",
> +                                    4096,
> +                                    &hash_ctl,

Isn't that initial size needlessly big?


>  void
>  StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
>  {
> +    RecoveryLockListsEntry *entry;
>      xl_standby_lock *newlock;
>      LOCKTAG        locktag;
> +    bool        found;
>  
>      /* Already processed? */
>      if (!TransactionIdIsValid(xid) ||
> @@ -616,11 +641,19 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
>      /* dbOid is InvalidOid when we are locking a shared relation. */
>      Assert(OidIsValid(relOid));
>  
> +    /* Create a new list for this xid, if we don't have one already. */
> +    entry = hash_search(RecoveryLockLists, &xid, HASH_ENTER, &found);
> +    if (!found)
> +    {
> +        entry->xid = xid;
> +        entry->locks = NIL;
> +    }
> +
>      newlock = palloc(sizeof(xl_standby_lock));
>      newlock->xid = xid;
>      newlock->dbOid = dbOid;
>      newlock->relOid = relOid;
> -    RecoveryLockList = lappend(RecoveryLockList, newlock);
> +    entry->locks = lappend(entry->locks, newlock);

Gotta be careful with lappend et al - it's really easy to leak memory
without explicit context usage. Have you checked that we don't here?

>      SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);
>  
> @@ -628,46 +661,45 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
>  }
>  
>  static void
> -StandbyReleaseLocks(TransactionId xid)
> +StandbyReleaseLockList(List *locks)
>  {
> -    ListCell   *cell,
> -               *prev,
> -               *next;
> -
> -    /*
> -     * Release all matching locks and remove them from list
> -     */
> -    prev = NULL;
> -    for (cell = list_head(RecoveryLockList); cell; cell = next)
> +    while (locks)
>      {
> -        xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
> +        xl_standby_lock *lock = (xl_standby_lock *) linitial(locks);
> +        LOCKTAG        locktag;
> +        elog(trace_recovery(DEBUG4),
> +             "releasing recovery lock: xid %u db %u rel %u",
> +             lock->xid, lock->dbOid, lock->relOid);
> +        SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
> +        if (!LockRelease(&locktag, AccessExclusiveLock, true))
> +            elog(LOG,
> +                 "RecoveryLockLists contains entry for lock no longer recorded by lock manager: xid %u database %u
relation%u",
 
> +                 lock->xid, lock->dbOid, lock->relOid);

This should be a PANIC imo.


Greetings,

Andres Freund


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Robert Haas
Date:
On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote:
> This should be a PANIC imo.

-1.  As a developer, I would prefer PANIC.  But as an end-user, I
would much rather have replay continue (with possible problems) than
have it stopped cold in its tracks with absolutely nothing that I as
the administrator can do to fix it.  We should be building this
product for end users.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Andres Freund
Date:
Hi,

On 2018-06-19 10:45:04 -0400, Robert Haas wrote:
> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote:
> > This should be a PANIC imo.
> 
> -1.  As a developer, I would prefer PANIC.  But as an end-user, I
> would much rather have replay continue (with possible problems) than
> have it stopped cold in its tracks with absolutely nothing that I as
> the administrator can do to fix it.  We should be building this
> product for end users.

Except that that just means the end-user will have an undebuggable
problem at their hand. Which will affect them as well.

And they could just restart with hot_standby=off, and restart again. Or
even just restart without the GUC change, because that will rebuild the
locking state from a later state / start becoming ready for query at a
later stage.

Greetings,

Andres Freund


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Robert Haas
Date:
On Tue, Jun 19, 2018 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-19 10:45:04 -0400, Robert Haas wrote:
>> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote:
>> > This should be a PANIC imo.
>>
>> -1.  As a developer, I would prefer PANIC.  But as an end-user, I
>> would much rather have replay continue (with possible problems) than
>> have it stopped cold in its tracks with absolutely nothing that I as
>> the administrator can do to fix it.  We should be building this
>> product for end users.
>
> Except that that just means the end-user will have an undebuggable
> problem at their hand. Which will affect them as well.

I agree, but my guess is that a PANIC will affect them more.

> And they could just restart with hot_standby=off, and restart again. Or
> even just restart without the GUC change, because that will rebuild the
> locking state from a later state / start becoming ready for query at a
> later stage.

True, but that can still be a sufficient operational problem.

I don't expect you to agree with my vote, but I stand by it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Andres Freund
Date:
On 2018-06-19 13:05:48 -0400, Robert Haas wrote:
> On Tue, Jun 19, 2018 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2018-06-19 10:45:04 -0400, Robert Haas wrote:
> >> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote:
> >> > This should be a PANIC imo.
> >>
> >> -1.  As a developer, I would prefer PANIC.  But as an end-user, I
> >> would much rather have replay continue (with possible problems) than
> >> have it stopped cold in its tracks with absolutely nothing that I as
> >> the administrator can do to fix it.  We should be building this
> >> product for end users.
> >
> > Except that that just means the end-user will have an undebuggable
> > problem at their hand. Which will affect them as well.
> 
> I agree, but my guess is that a PANIC will affect them more.

Sure, in the short term. I think our disagreement is based on the a
different viewpoint: I think users are served better if a problem is
surfaced as close to the origin, because that makes quick diagnosis and
fix possible - even if that means sometimes erroring out in cases where
we could have just limped on without causing noticed problems.  You
think that trying to limp along as long as possible is good, because
that removes immediate pain from the user (and then prevents the user
from redirecting the pain to you).


> I don't expect you to agree with my vote, but I stand by it.

Yea, I didn't expect (but hoped!) to change your mind... ;)

Greetings,

Andres Freund


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2018-06-19 13:05:48 -0400, Robert Haas wrote:
>> I don't expect you to agree with my vote, but I stand by it.

> Yea, I didn't expect (but hoped!) to change your mind... ;)

FWIW, I agree with Robert --- a PANIC here will certainly create
problems, and it's much less clear what problems it might prevent.

However, the trouble with LOG is that we would be unlikely to
notice such a message in testing, so that a test case triggering
an issue of this type would not get recognized as such.

Hence, I have a modest proposal: use elog(LOG) followed by Assert(false),
so that debug builds report such problems loudly but production builds
do not.  We've done that before, see e.g. load_relcache_init_file around
relcache.c:5718.

            regards, tom lane


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Thomas Munro
Date:
Adding Simon and David R.

On Wed, Jun 20, 2018 at 5:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Hence, I have a modest proposal: use elog(LOG) followed by Assert(false),
> so that debug builds report such problems loudly but production builds
> do not.  We've done that before, see e.g. load_relcache_init_file around
> relcache.c:5718.

Done.  (This elog(LOG) isn't new BTW, I merely moved it.)

On Tue, Jun 19, 2018 at 6:30 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-19 17:43:42 +1200, Thomas Munro wrote:
>> +     RecoveryLockLists = hash_create("RecoveryLockLists",
>> +                                                                     4096,
>> +                                                                     &hash_ctl,
>
> Isn't that initial size needlessly big?

Changed to 64.

>> -     RecoveryLockList = lappend(RecoveryLockList, newlock);
>> +     entry->locks = lappend(entry->locks, newlock);
>
> Gotta be careful with lappend et al - it's really easy to leak memory
> without explicit context usage. Have you checked that we don't here?

The list and contents are explicitly freed when xids end and the locks
are released (because xid committed/aborted, or older than known
oldest running transaction, or we release all on promotion).  The
existing code seems to assume that TopMemoryContext is current (or
some context that'll last all the way until our promotion), which
seems OK to me.  TopMemoryContext is inherited from PostgresMain() and
any redo handler that changes it without restoring it would be buggy,
and errors are FATAL here.

I had missed one small thing though: I should call hash_destroy() in
ShutdownRecoveryTransactionEnvironment() to free the hash table itself
on promotion.  Fixed.

Why does StandbyReleaseLocks() handle an invalid xid by removing all
locks (if (!TransactionIdIsValid(xid) || lock->xid == xid)) in master?

V2 attached.

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Excessive CPU usage in StandbyReleaseLocks()

From
David Rowley
Date:
On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> The problem is that StandbyReleaseLocks() does a linear search of all
> known AccessExclusiveLocks when a transaction ends.  Luckily, since
> v10 (commit 9b013dc2) that is skipped for transactions that haven't
> taken any AELs and aren't using 2PC, but that doesn't help all users.

Good to see this getting fixed.  My original patch [1] to fix this was
more along the lines of yours, only I partitioned the List in an array
indexed by the xid mod size of array.  I had done this as I thought it
would be faster than a hash table and would likely see the locks
spread evenly over the table.  IIRC Andres complained and said I
should use a hashtable, which I see you've done.

> +   97.88%    96.96%  postgres  postgres            [.] StandbyReleaseLocks
>
> The attached patch splits the AEL list into one list per xid and
> sticks them in a hash table.  That makes perf say something like:
>
> +    0.60%     0.00%  postgres  postgres            [.] StandbyReleaseLocks

Yeah, I also saw similar with my patch.

> This seems like something we'd want to back-patch because the problem
> affects all branches (the older releases more severely because they
> lack the above-mentioned optimisation).
>
> Thoughts?

I do know this is causing quite a bit of trouble out in the wilds. I'd
be keen to not have to bear the brunt of any more of those troubles,
but it'll ultimately depend on how risky a patch looks.

[1] https://www.postgresql.org/message-id/CAKJS1f9vJ841HY%3DwonnLVbfkTWGYWdPN72VMxnArcGCjF3SywA%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Thomas Munro
Date:
Hi David,

On Thu, Jun 21, 2018 at 12:25 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> The problem is that StandbyReleaseLocks() does a linear search of all
>> known AccessExclusiveLocks when a transaction ends.  Luckily, since
>> v10 (commit 9b013dc2) that is skipped for transactions that haven't
>> taken any AELs and aren't using 2PC, but that doesn't help all users.
>
> Good to see this getting fixed.  My original patch [1] to fix this was
> more along the lines of yours, only I partitioned the List in an array
> indexed by the xid mod size of array.  I had done this as I thought it
> would be faster than a hash table and would likely see the locks
> spread evenly over the table.  IIRC Andres complained and said I
> should use a hashtable, which I see you've done.

Oh!  I hadn't seen that.  I studied 9b013dc2 but it didn't have a link
to your discussion so I didn't know that your original proposal was
completely different to the commit and essentially the same as what
I'm proposing.  I think this is a more fundamental fix to the problem
so I don't understand why it went into the weeds.  I checked if there
was anything I could salvage from your patch but I think HTAB is the
better way to go here.  I will update the commit message to credit you
for the original analysis and point back to your thread.

Looking at your thread, I see that you were also confused about this:

> Although I'm not all
> that sure in which case the function will actually be called with an
> invalid xid.

Anyone?  In my patch I keep that behaviour, but I think it'd be great
to get a comment there to explain why the case is needed.

>> Thoughts?
>
> I do know this is causing quite a bit of trouble out in the wilds. I'd
> be keen to not have to bear the brunt of any more of those troubles,
> but it'll ultimately depend on how risky a patch looks.

Cool.  I'm thinking of pushing this early next week after some more
testing, unless there are objections.  If you have any feedback or
would like to try to poke holes in it, I'd much appreciate it.

-- 
Thomas Munro
http://www.enterprisedb.com


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Andres Freund
Date:
Hi,

On 2018-06-21 00:25:03 +1200, David Rowley wrote:
> On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> > The problem is that StandbyReleaseLocks() does a linear search of all
> > known AccessExclusiveLocks when a transaction ends.  Luckily, since
> > v10 (commit 9b013dc2) that is skipped for transactions that haven't
> > taken any AELs and aren't using 2PC, but that doesn't help all users.
> 
> Good to see this getting fixed.  My original patch [1] to fix this was
> more along the lines of yours

From that discussion I don't really understand why that wasn't pursued
further. The revision committed, clearly was just continuing to use the
wrong datastructure, and had obvious issues with complexity, just in a
somewhat narrower situation?


> only I partitioned the List in an array indexed by the xid mod size of
> array.  I had done this as I thought it would be faster than a hash
> table and would likely see the locks spread evenly over the table.
> IIRC Andres complained and said I should use a hashtable, which I see
> you've done.

It's hard to believe the hashtable is a meaningful bottleneck here. The
primary also uses a hashtable, except it's partitioned & shared, and
thus protected by locks. Also much larger. So it's hard to believe that
we'd need a custom built datastructure to speedup replay...

Greetings,

Andres Freund


Re: Excessive CPU usage in StandbyReleaseLocks()

From
Thomas Munro
Date:
On Thu, Jun 21, 2018 at 10:29 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-06-21 00:25:03 +1200, David Rowley wrote:
>> On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> > The problem is that StandbyReleaseLocks() does a linear search of all
>> > known AccessExclusiveLocks when a transaction ends.  Luckily, since
>> > v10 (commit 9b013dc2) that is skipped for transactions that haven't
>> > taken any AELs and aren't using 2PC, but that doesn't help all users.
>>
>> Good to see this getting fixed.  My original patch [1] to fix this was
>> more along the lines of yours
>
> From that discussion I don't really understand why that wasn't pursued
> further.

I have now pushed this, and back-patched it all the way.  There are
three variants:

* master: as previously posted
* 9.5-10: StandbyReleaseOldLocks() works slightly differently
* 9.3-9.4: minor dynahash API change

The difference in StandbyReleaseOldLocks() is due to recent commit
15378c1a, which I gather was due to be back-patched to REL_10_STABLE
at some stage (so I guess the thing to do will be to copy master's new
StandbyReleaseOldLocks() to 10).  CC: Simon.

I torture tested this for a while on each branch, but of course any
post-commit review or testing would be very welcome.