Thread: Excessive CPU usage in StandbyReleaseLocks()
Hello hackers, Andres Freund diagnosed a case of $SUBJECT in a customer's 9.6 system. I've written a minimal reproducer and a prototype patch to address the root cause. The problem is that StandbyReleaseLocks() does a linear search of all known AccessExclusiveLocks when a transaction ends. Luckily, since v10 (commit 9b013dc2) that is skipped for transactions that haven't taken any AELs and aren't using 2PC, but that doesn't help all users. It's fine if the AEL list is short, but if you do something that takes a lot of AELs such as restoring a database with many tables or truncating a lot of partitions while other transactions are in flight then we start doing O(txrate * nlocks * nsubxacts) work and that can hurt. The reproducer script I've attached creates one long-lived transaction that acquires 6,000 AELs and takes a nap, while 48 connections run trivial 2PC transactions (I was also able to reproduce the effect without 2PC by creating a throw-away temporary table in every transaction, but it was unreliable due to contention slowing everything down). For me, the standby's startup process becomes 100% pegged, replay_lag begins to climb and perf says something like: + 97.88% 96.96% postgres postgres [.] StandbyReleaseLocks The attached patch splits the AEL list into one list per xid and sticks them in a hash table. That makes perf say something like: + 0.60% 0.00% postgres postgres [.] StandbyReleaseLocks This seems like something we'd want to back-patch because the problem affects all branches (the older releases more severely because they lack the above-mentioned optimisation). Thoughts? -- Thomas Munro http://www.enterprisedb.com
Attachment
Hi, On 2018-06-19 17:43:42 +1200, Thomas Munro wrote: > The problem is that StandbyReleaseLocks() does a linear search of all > known AccessExclusiveLocks when a transaction ends. Luckily, since > v10 (commit 9b013dc2) that is skipped for transactions that haven't > taken any AELs and aren't using 2PC, but that doesn't help all users. > It's fine if the AEL list is short, but if you do something that takes > a lot of AELs such as restoring a database with many tables or > truncating a lot of partitions while other transactions are in flight > then we start doing O(txrate * nlocks * nsubxacts) work and that can > hurt. > > The reproducer script I've attached creates one long-lived transaction > that acquires 6,000 AELs and takes a nap, while 48 connections run > trivial 2PC transactions (I was also able to reproduce the effect > without 2PC by creating a throw-away temporary table in every > transaction, but it was unreliable due to contention slowing > everything down). For me, the standby's startup process becomes 100% > pegged, replay_lag begins to climb and perf says something like: > > + 97.88% 96.96% postgres postgres [.] StandbyReleaseLocks > > The attached patch splits the AEL list into one list per xid and > sticks them in a hash table. That makes perf say something like: > + 0.60% 0.00% postgres postgres [.] StandbyReleaseLocks Neato. Results aside, that seems like a better suited datastructure. > /* > * InitRecoveryTransactionEnvironment > @@ -63,6 +73,19 @@ void > InitRecoveryTransactionEnvironment(void) > { > VirtualTransactionId vxid; > + HASHCTL hash_ctl; > + > + /* > + * Initialize the hash table for tracking the list of locks held by each > + * transaction. > + */ > + memset(&hash_ctl, 0, sizeof(hash_ctl)); > + hash_ctl.keysize = sizeof(TransactionId); > + hash_ctl.entrysize = sizeof(RecoveryLockListsEntry); > + RecoveryLockLists = hash_create("RecoveryLockLists", > + 4096, > + &hash_ctl, Isn't that initial size needlessly big? > void > StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) > { > + RecoveryLockListsEntry *entry; > xl_standby_lock *newlock; > LOCKTAG locktag; > + bool found; > > /* Already processed? */ > if (!TransactionIdIsValid(xid) || > @@ -616,11 +641,19 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) > /* dbOid is InvalidOid when we are locking a shared relation. */ > Assert(OidIsValid(relOid)); > > + /* Create a new list for this xid, if we don't have one already. */ > + entry = hash_search(RecoveryLockLists, &xid, HASH_ENTER, &found); > + if (!found) > + { > + entry->xid = xid; > + entry->locks = NIL; > + } > + > newlock = palloc(sizeof(xl_standby_lock)); > newlock->xid = xid; > newlock->dbOid = dbOid; > newlock->relOid = relOid; > - RecoveryLockList = lappend(RecoveryLockList, newlock); > + entry->locks = lappend(entry->locks, newlock); Gotta be careful with lappend et al - it's really easy to leak memory without explicit context usage. Have you checked that we don't here? > SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid); > > @@ -628,46 +661,45 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid) > } > > static void > -StandbyReleaseLocks(TransactionId xid) > +StandbyReleaseLockList(List *locks) > { > - ListCell *cell, > - *prev, > - *next; > - > - /* > - * Release all matching locks and remove them from list > - */ > - prev = NULL; > - for (cell = list_head(RecoveryLockList); cell; cell = next) > + while (locks) > { > - xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell); > + xl_standby_lock *lock = (xl_standby_lock *) linitial(locks); > + LOCKTAG locktag; > + elog(trace_recovery(DEBUG4), > + "releasing recovery lock: xid %u db %u rel %u", > + lock->xid, lock->dbOid, lock->relOid); > + SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid); > + if (!LockRelease(&locktag, AccessExclusiveLock, true)) > + elog(LOG, > + "RecoveryLockLists contains entry for lock no longer recorded by lock manager: xid %u database %u relation%u", > + lock->xid, lock->dbOid, lock->relOid); This should be a PANIC imo. Greetings, Andres Freund
On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote: > This should be a PANIC imo. -1. As a developer, I would prefer PANIC. But as an end-user, I would much rather have replay continue (with possible problems) than have it stopped cold in its tracks with absolutely nothing that I as the administrator can do to fix it. We should be building this product for end users. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2018-06-19 10:45:04 -0400, Robert Haas wrote: > On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote: > > This should be a PANIC imo. > > -1. As a developer, I would prefer PANIC. But as an end-user, I > would much rather have replay continue (with possible problems) than > have it stopped cold in its tracks with absolutely nothing that I as > the administrator can do to fix it. We should be building this > product for end users. Except that that just means the end-user will have an undebuggable problem at their hand. Which will affect them as well. And they could just restart with hot_standby=off, and restart again. Or even just restart without the GUC change, because that will rebuild the locking state from a later state / start becoming ready for query at a later stage. Greetings, Andres Freund
On Tue, Jun 19, 2018 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-06-19 10:45:04 -0400, Robert Haas wrote: >> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote: >> > This should be a PANIC imo. >> >> -1. As a developer, I would prefer PANIC. But as an end-user, I >> would much rather have replay continue (with possible problems) than >> have it stopped cold in its tracks with absolutely nothing that I as >> the administrator can do to fix it. We should be building this >> product for end users. > > Except that that just means the end-user will have an undebuggable > problem at their hand. Which will affect them as well. I agree, but my guess is that a PANIC will affect them more. > And they could just restart with hot_standby=off, and restart again. Or > even just restart without the GUC change, because that will rebuild the > locking state from a later state / start becoming ready for query at a > later stage. True, but that can still be a sufficient operational problem. I don't expect you to agree with my vote, but I stand by it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-06-19 13:05:48 -0400, Robert Haas wrote: > On Tue, Jun 19, 2018 at 1:01 PM, Andres Freund <andres@anarazel.de> wrote: > > On 2018-06-19 10:45:04 -0400, Robert Haas wrote: > >> On Tue, Jun 19, 2018 at 2:30 AM, Andres Freund <andres@anarazel.de> wrote: > >> > This should be a PANIC imo. > >> > >> -1. As a developer, I would prefer PANIC. But as an end-user, I > >> would much rather have replay continue (with possible problems) than > >> have it stopped cold in its tracks with absolutely nothing that I as > >> the administrator can do to fix it. We should be building this > >> product for end users. > > > > Except that that just means the end-user will have an undebuggable > > problem at their hand. Which will affect them as well. > > I agree, but my guess is that a PANIC will affect them more. Sure, in the short term. I think our disagreement is based on the a different viewpoint: I think users are served better if a problem is surfaced as close to the origin, because that makes quick diagnosis and fix possible - even if that means sometimes erroring out in cases where we could have just limped on without causing noticed problems. You think that trying to limp along as long as possible is good, because that removes immediate pain from the user (and then prevents the user from redirecting the pain to you). > I don't expect you to agree with my vote, but I stand by it. Yea, I didn't expect (but hoped!) to change your mind... ;) Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-06-19 13:05:48 -0400, Robert Haas wrote: >> I don't expect you to agree with my vote, but I stand by it. > Yea, I didn't expect (but hoped!) to change your mind... ;) FWIW, I agree with Robert --- a PANIC here will certainly create problems, and it's much less clear what problems it might prevent. However, the trouble with LOG is that we would be unlikely to notice such a message in testing, so that a test case triggering an issue of this type would not get recognized as such. Hence, I have a modest proposal: use elog(LOG) followed by Assert(false), so that debug builds report such problems loudly but production builds do not. We've done that before, see e.g. load_relcache_init_file around relcache.c:5718. regards, tom lane
Adding Simon and David R. On Wed, Jun 20, 2018 at 5:44 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Hence, I have a modest proposal: use elog(LOG) followed by Assert(false), > so that debug builds report such problems loudly but production builds > do not. We've done that before, see e.g. load_relcache_init_file around > relcache.c:5718. Done. (This elog(LOG) isn't new BTW, I merely moved it.) On Tue, Jun 19, 2018 at 6:30 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-06-19 17:43:42 +1200, Thomas Munro wrote: >> + RecoveryLockLists = hash_create("RecoveryLockLists", >> + 4096, >> + &hash_ctl, > > Isn't that initial size needlessly big? Changed to 64. >> - RecoveryLockList = lappend(RecoveryLockList, newlock); >> + entry->locks = lappend(entry->locks, newlock); > > Gotta be careful with lappend et al - it's really easy to leak memory > without explicit context usage. Have you checked that we don't here? The list and contents are explicitly freed when xids end and the locks are released (because xid committed/aborted, or older than known oldest running transaction, or we release all on promotion). The existing code seems to assume that TopMemoryContext is current (or some context that'll last all the way until our promotion), which seems OK to me. TopMemoryContext is inherited from PostgresMain() and any redo handler that changes it without restoring it would be buggy, and errors are FATAL here. I had missed one small thing though: I should call hash_destroy() in ShutdownRecoveryTransactionEnvironment() to free the hash table itself on promotion. Fixed. Why does StandbyReleaseLocks() handle an invalid xid by removing all locks (if (!TransactionIdIsValid(xid) || lock->xid == xid)) in master? V2 attached. -- Thomas Munro http://www.enterprisedb.com
Attachment
On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > The problem is that StandbyReleaseLocks() does a linear search of all > known AccessExclusiveLocks when a transaction ends. Luckily, since > v10 (commit 9b013dc2) that is skipped for transactions that haven't > taken any AELs and aren't using 2PC, but that doesn't help all users. Good to see this getting fixed. My original patch [1] to fix this was more along the lines of yours, only I partitioned the List in an array indexed by the xid mod size of array. I had done this as I thought it would be faster than a hash table and would likely see the locks spread evenly over the table. IIRC Andres complained and said I should use a hashtable, which I see you've done. > + 97.88% 96.96% postgres postgres [.] StandbyReleaseLocks > > The attached patch splits the AEL list into one list per xid and > sticks them in a hash table. That makes perf say something like: > > + 0.60% 0.00% postgres postgres [.] StandbyReleaseLocks Yeah, I also saw similar with my patch. > This seems like something we'd want to back-patch because the problem > affects all branches (the older releases more severely because they > lack the above-mentioned optimisation). > > Thoughts? I do know this is causing quite a bit of trouble out in the wilds. I'd be keen to not have to bear the brunt of any more of those troubles, but it'll ultimately depend on how risky a patch looks. [1] https://www.postgresql.org/message-id/CAKJS1f9vJ841HY%3DwonnLVbfkTWGYWdPN72VMxnArcGCjF3SywA%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi David, On Thu, Jun 21, 2018 at 12:25 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> The problem is that StandbyReleaseLocks() does a linear search of all >> known AccessExclusiveLocks when a transaction ends. Luckily, since >> v10 (commit 9b013dc2) that is skipped for transactions that haven't >> taken any AELs and aren't using 2PC, but that doesn't help all users. > > Good to see this getting fixed. My original patch [1] to fix this was > more along the lines of yours, only I partitioned the List in an array > indexed by the xid mod size of array. I had done this as I thought it > would be faster than a hash table and would likely see the locks > spread evenly over the table. IIRC Andres complained and said I > should use a hashtable, which I see you've done. Oh! I hadn't seen that. I studied 9b013dc2 but it didn't have a link to your discussion so I didn't know that your original proposal was completely different to the commit and essentially the same as what I'm proposing. I think this is a more fundamental fix to the problem so I don't understand why it went into the weeds. I checked if there was anything I could salvage from your patch but I think HTAB is the better way to go here. I will update the commit message to credit you for the original analysis and point back to your thread. Looking at your thread, I see that you were also confused about this: > Although I'm not all > that sure in which case the function will actually be called with an > invalid xid. Anyone? In my patch I keep that behaviour, but I think it'd be great to get a comment there to explain why the case is needed. >> Thoughts? > > I do know this is causing quite a bit of trouble out in the wilds. I'd > be keen to not have to bear the brunt of any more of those troubles, > but it'll ultimately depend on how risky a patch looks. Cool. I'm thinking of pushing this early next week after some more testing, unless there are objections. If you have any feedback or would like to try to poke holes in it, I'd much appreciate it. -- Thomas Munro http://www.enterprisedb.com
Hi, On 2018-06-21 00:25:03 +1200, David Rowley wrote: > On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > The problem is that StandbyReleaseLocks() does a linear search of all > > known AccessExclusiveLocks when a transaction ends. Luckily, since > > v10 (commit 9b013dc2) that is skipped for transactions that haven't > > taken any AELs and aren't using 2PC, but that doesn't help all users. > > Good to see this getting fixed. My original patch [1] to fix this was > more along the lines of yours From that discussion I don't really understand why that wasn't pursued further. The revision committed, clearly was just continuing to use the wrong datastructure, and had obvious issues with complexity, just in a somewhat narrower situation? > only I partitioned the List in an array indexed by the xid mod size of > array. I had done this as I thought it would be faster than a hash > table and would likely see the locks spread evenly over the table. > IIRC Andres complained and said I should use a hashtable, which I see > you've done. It's hard to believe the hashtable is a meaningful bottleneck here. The primary also uses a hashtable, except it's partitioned & shared, and thus protected by locks. Also much larger. So it's hard to believe that we'd need a custom built datastructure to speedup replay... Greetings, Andres Freund
On Thu, Jun 21, 2018 at 10:29 AM, Andres Freund <andres@anarazel.de> wrote: > On 2018-06-21 00:25:03 +1200, David Rowley wrote: >> On 19 June 2018 at 17:43, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> > The problem is that StandbyReleaseLocks() does a linear search of all >> > known AccessExclusiveLocks when a transaction ends. Luckily, since >> > v10 (commit 9b013dc2) that is skipped for transactions that haven't >> > taken any AELs and aren't using 2PC, but that doesn't help all users. >> >> Good to see this getting fixed. My original patch [1] to fix this was >> more along the lines of yours > > From that discussion I don't really understand why that wasn't pursued > further. I have now pushed this, and back-patched it all the way. There are three variants: * master: as previously posted * 9.5-10: StandbyReleaseOldLocks() works slightly differently * 9.3-9.4: minor dynahash API change The difference in StandbyReleaseOldLocks() is due to recent commit 15378c1a, which I gather was due to be back-patched to REL_10_STABLE at some stage (so I guess the thing to do will be to copy master's new StandbyReleaseOldLocks() to 10). CC: Simon. I torture tested this for a while on each branch, but of course any post-commit review or testing would be very welcome.