Thread: [BUGS] BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
[BUGS] BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
wangchuanting@huawei.com
Date:
The following bug has been logged on the website: Bug reference: 14680 Logged by: chuanting wang Email address: wangchuanting@huawei.com PostgreSQL version: 10beta1 Operating system: SuSE Description: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog. the stack is: Thread 6 (Thread 0x7f002e3ff700 (LWP 35305)): #0 0x00007f1106167627 in semop () from /lib64/libc.so.6 #1 0x0000000000ce5622 in PGSemaphoreLock(PGSemaphoreData*, bool) () #2 0x0000000000d73e4d in LWLockAcquire(LWLockId, LWLockMode) () #3 0x0000000000969028 in RemoveGXact(GlobalTransactionData*) () #4 0x000000000096b254 in ProcessTwoPhaseBuffer(unsigned int, XLogRecPtr, bool, bool, bool) () #5 0x000000000096b6b8 in PrescanPreparedTransactions(unsigned int**, int*) () #6 0x000000000097af97 in xlog_redo(XLogReaderState*) () #7 0x00000000009839df in StartupXLOG() () #8 0x0000000000cffd0b in StartupProcessMain() () #9 0x0000000000991539 in AuxiliaryProcessMain(int, char**) () #10 0x0000000000cfb209 in SubPostmasterMain(int, char**) () #11 0x0000000000cfb4f2 in MainStarterThreadFunc(void*) () #12 0x00007f11099a17b6 in start_thread () from /lib64/libpthread.so.0 #13 0x00007f1106165d6d in clone () from /lib64/libc.so.6 #14 0x0000000000000000 in ?? () the reason is: when do PrescanPreparedTransactions, startup process already granted a shared lock on TwoPhaseStateLock, then it called RemoveGXact, but in RemoveGXact it acquired exclusive lock on TwoPhaseStateLock, so it blocked by itself anyone, has some idea on how to fix it? -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14680: startup process on standby encounter a deadlock of TwoPhaseStateLock when redo 2PC xlog
From
Tom Lane
Date:
wangchuanting@huawei.com writes: > startup process on standby encounter a deadlock of TwoPhaseStateLock when > redo 2PC xlog. Please provide an example of how to get into this state. regards, tom lane -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > wangchuanting@huawei.com writes: >> startup process on standby encounter a deadlock of TwoPhaseStateLock when >> redo 2PC xlog. > > Please provide an example of how to get into this state. That would help. Are you seeing in the logs something like "removing future two-phase state from memory for XXX" or "removing stale two-phase state from shared memory for XXX"? Even with that, the light-weight lock sequence taken in those code paths look definitely wrong to me, we should not take twice TwoPhaseStateLock in the same code path. I think that we should remove the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then upgrade the locks of PrescanPreparedTransactions() and StandbyRecoverPreparedTransactions() to be exclusive. We still need to keep a lock as CheckPointTwoPhase() may still be triggered by the checkpoint. Tom, what do you think? -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
wangchuanting
Date:
we use Postgres-XC+pg10(4 coordinator + 4 datanode(pg 10 as datanode, and 1 master 1 standby for each datanode)), and benchmark tpcc, there is some cross datanode transactions that use 2pc, during testing, we restart the cluster, then one datanode standby can not recovery done and hangup with TwoPhaseStateLock deadlock. and sorry, we reinstall the cluster, and the log is removed, we will try to reproduce, but anyway, the code is not right like Michael saied. -- View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964241.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Wed, May 31, 2017 at 12:30 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> wangchuanting@huawei.com writes: >>> startup process on standby encounter a deadlock of TwoPhaseStateLock when >>> redo 2PC xlog. >> >> Please provide an example of how to get into this state. > > That would help. Are you seeing in the logs something like "removing > future two-phase state from memory for XXX" or "removing stale > two-phase state from shared memory for XXX"? > > Even with that, the light-weight lock sequence taken in those code > paths look definitely wrong to me, we should not take twice > TwoPhaseStateLock in the same code path. I think that we should remove > the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then > upgrade the locks of PrescanPreparedTransactions() and > StandbyRecoverPreparedTransactions() to be exclusive. We still need to > keep a lock as CheckPointTwoPhase() may still be triggered by the > checkpoint. Tom, what do you think? Attached is what I was thinking about for reference. I just came back from a long flight and I am pretty tired, so my brain may have missed something. I'll take again a look at this issue on Monday, an open item has been added for now. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Thu, Jun 1, 2017 at 12:11 AM, wangchuanting <wangchuanting@huawei.com> wrote: > we use Postgres-XC+pg10(4 coordinator + 4 datanode(pg 10 as datanode, and 1 > master 1 standby for each datanode)), and benchmark tpcc, there is some > cross datanode transactions that use 2pc, during testing, we restart the > cluster, then one datanode standby can not recovery done and hangup with > TwoPhaseStateLock deadlock. (Former Postgres-XC maintainer here) Are you aware of the fact that this is not going to work? Postgres protocol has been extended between coordinators and datanodes to be able to push down transaction ID, snapshot as well as timestamps when running transactions across nodes. So by using a community Postgres as a datanode you break the global consistency of the cluster. There are also a couple of other things proper to datanodes. > and sorry, we reinstall the cluster, and the log is removed, we will try to > reproduce, but anyway, the code is not right like Michael said. On that I agree. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
wangchuanting
Date:
>So by using a community Postgres as >a datanode you break the global consistency of the cluster. There are >also a couple of other things proper to datanodes. sorry, i confuse you, actually we merged "Speedup 2PC recovery by skipping two phase state files in normal path" to pgxc. -- View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964398.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
wangchuanting
Date:
Hi Michael, appreciate for your quick response. i look carefully on the patch, because of we removed TwoPhaseStateLock lwlock acquire in `RemoveGXact' and let caller held lwlock, so i think: 1. xact_redo also need held lwlock before call PrepareRedoRemove 2. RecoverPreparedTransactions also need held lwlock before call ProcessTwoPhaseBuffer -- View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964407.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
Noah Misch
Date:
On Thu, Jun 01, 2017 at 01:07:53AM -0700, Michael Paquier wrote: > On Wed, May 31, 2017 at 12:30 PM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> wangchuanting@huawei.com writes: > >>> startup process on standby encounter a deadlock of TwoPhaseStateLock when > >>> redo 2PC xlog. > >> > >> Please provide an example of how to get into this state. > > > > That would help. Are you seeing in the logs something like "removing > > future two-phase state from memory for XXX" or "removing stale > > two-phase state from shared memory for XXX"? > > > > Even with that, the light-weight lock sequence taken in those code > > paths look definitely wrong to me, we should not take twice > > TwoPhaseStateLock in the same code path. I think that we should remove > > the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then > > upgrade the locks of PrescanPreparedTransactions() and > > StandbyRecoverPreparedTransactions() to be exclusive. We still need to > > keep a lock as CheckPointTwoPhase() may still be triggered by the > > checkpoint. Tom, what do you think? > > Attached is what I was thinking about for reference. I just came back > from a long flight and I am pretty tired, so my brain may have missed > something. I'll take again a look at this issue on Monday, an open > item has been added for now. [Action required within three days. This is a generic notification.] The above-described topic is currently a PostgreSQL 10 open item. Simon, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a v10 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within three calendar days of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed well in advance of shipping v10. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Fri, Jun 2, 2017 at 10:59 AM, wangchuanting <wangchuanting@huawei.com> wrote: > Appreciate for your quick response. Back into business for this issue. > i look carefully on the patch, because of we removed TwoPhaseStateLock > lwlock acquire in `RemoveGXact' and let caller held lwlock, so i think: > 1. xact_redo also need held lwlock before call PrepareRedoRemove > 2. RecoverPreparedTransactions also need held lwlock before call > ProcessTwoPhaseBuffer Thanks for the input. I have reviewed all the code paths that have been modified, and strengthened the code with assertions using LWLockHeldByMeInMode() to make sure that the correct lock is always hold in those code paths. There is actually no point in holding the lock in restoreTwoPhaseData(), but as this makes the code less consistent with the rest I added one. Also, I have replaced the lock acquisition in PrepareRedoAdd() with acquisitions at higher levels, and added an assertion in this routine. This makes the 2PC state data addition and removal more consistent with each other. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Mon, Jun 5, 2017 at 7:24 AM, Noah Misch <noah@leadboat.com> wrote: > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. I have just provided a patch that takes care of solving this issue and cleans up the lock handling for all the 2PC redo code paths. -- Michael
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
wangchuanting
Date:
Hi, Michael I don't understand why the patch remove TwoPhaseStateLock in MarkAsPrepared. if so: 1. does it need remove lock acquire in MarkAsPreparing also? 2. MarkAsPrepared will call ProcArrayAdd, and in ProcArrayAdd it acquires ProcArrayLock, we should carefully handle the lock sequence about these two locks. -- View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5964796.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Mon, Jun 5, 2017 at 8:32 PM, wangchuanting <wangchuanting@huawei.com> wrote: > I don't understand why the patch remove TwoPhaseStateLock in MarkAsPrepared. This one is on purpose, no low-level routines in this patch acquire TwoPhaseStateLock. Logically I agree that it does not change much but... You wrote something below about that. > if so: > 1. does it need remove lock acquire in MarkAsPreparing also? No, MarkAsPreparing() is used only in the non-redo code path. > 2. MarkAsPrepared will call ProcArrayAdd, and in ProcArrayAdd it acquires > ProcArrayLock, we should carefully handle the lock sequence about these two > locks. And this gives a reason to not complicate our lives. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
Noah Misch
Date:
On Sun, Jun 04, 2017 at 10:24:30PM +0000, Noah Misch wrote: > On Thu, Jun 01, 2017 at 01:07:53AM -0700, Michael Paquier wrote: > > On Wed, May 31, 2017 at 12:30 PM, Michael Paquier > > <michael.paquier@gmail.com> wrote: > > > On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > >> wangchuanting@huawei.com writes: > > >>> startup process on standby encounter a deadlock of TwoPhaseStateLock when > > >>> redo 2PC xlog. > > >> > > >> Please provide an example of how to get into this state. > > > > > > That would help. Are you seeing in the logs something like "removing > > > future two-phase state from memory for XXX" or "removing stale > > > two-phase state from shared memory for XXX"? > > > > > > Even with that, the light-weight lock sequence taken in those code > > > paths look definitely wrong to me, we should not take twice > > > TwoPhaseStateLock in the same code path. I think that we should remove > > > the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then > > > upgrade the locks of PrescanPreparedTransactions() and > > > StandbyRecoverPreparedTransactions() to be exclusive. We still need to > > > keep a lock as CheckPointTwoPhase() may still be triggered by the > > > checkpoint. Tom, what do you think? > > > > Attached is what I was thinking about for reference. I just came back > > from a long flight and I am pretty tired, so my brain may have missed > > something. I'll take again a look at this issue on Monday, an open > > item has been added for now. > > [Action required within three days. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > since you committed the patch believed to have created it, you own this open > item. If some other commit is more relevant or if this does not belong as a > v10 open item, please let us know. Otherwise, please observe the policy on > open item ownership[1] and send a status update within three calendar days of > this message. Include a date for your subsequent status update. Testers may > discover new open items at any time, and I want to plan to get them all fixed > well in advance of shipping v10. Consequently, I will appreciate your efforts > toward speedy resolution. Thanks. > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com This PostgreSQL 10 open item is past due for your status update. Kindly send a status update within 24 hours, and include a date for your subsequent status update. Refer to the policy on open item ownership: https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Fri, Jun 9, 2017 at 3:17 PM, Noah Misch <noah@leadboat.com> wrote: > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. Refer to the policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I have sent an updated patch simplifying the locking here: https://www.postgresql.org/message-id/CAB7nPqQthLP9GvD2242epHKOBkDMd+03tSuFvK3cVZsGarQyWA@mail.gmail.com -- Michael
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
Noah Misch
Date:
On Thu, Jun 08, 2017 at 11:17:38PM -0700, Noah Misch wrote: > On Sun, Jun 04, 2017 at 10:24:30PM +0000, Noah Misch wrote: > > On Thu, Jun 01, 2017 at 01:07:53AM -0700, Michael Paquier wrote: > > > On Wed, May 31, 2017 at 12:30 PM, Michael Paquier > > > <michael.paquier@gmail.com> wrote: > > > > On Wed, May 31, 2017 at 6:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > >> wangchuanting@huawei.com writes: > > > >>> startup process on standby encounter a deadlock of TwoPhaseStateLock when > > > >>> redo 2PC xlog. > > > >> > > > >> Please provide an example of how to get into this state. > > > > > > > > That would help. Are you seeing in the logs something like "removing > > > > future two-phase state from memory for XXX" or "removing stale > > > > two-phase state from shared memory for XXX"? > > > > > > > > Even with that, the light-weight lock sequence taken in those code > > > > paths look definitely wrong to me, we should not take twice > > > > TwoPhaseStateLock in the same code path. I think that we should remove > > > > the lock acquisitions in RemoveGXact() and PrepareRedoRemove, and then > > > > upgrade the locks of PrescanPreparedTransactions() and > > > > StandbyRecoverPreparedTransactions() to be exclusive. We still need to > > > > keep a lock as CheckPointTwoPhase() may still be triggered by the > > > > checkpoint. Tom, what do you think? > > > > > > Attached is what I was thinking about for reference. I just came back > > > from a long flight and I am pretty tired, so my brain may have missed > > > something. I'll take again a look at this issue on Monday, an open > > > item has been added for now. > > > > [Action required within three days. This is a generic notification.] > > > > The above-described topic is currently a PostgreSQL 10 open item. Simon, > > since you committed the patch believed to have created it, you own this open > > item. If some other commit is more relevant or if this does not belong as a > > v10 open item, please let us know. Otherwise, please observe the policy on > > open item ownership[1] and send a status update within three calendar days of > > this message. Include a date for your subsequent status update. Testers may > > discover new open items at any time, and I want to plan to get them all fixed > > well in advance of shipping v10. Consequently, I will appreciate your efforts > > toward speedy resolution. Thanks. > > > > [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > This PostgreSQL 10 open item is past due for your status update. Kindly send > a status update within 24 hours, and include a date for your subsequent status > update. Refer to the policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due for your status update. Please reacquaint yourself with the policy on open item ownership[1] and then reply immediately. If I do not hear from you by 2017-06-11 07:00 UTC, I will transfer this item to release management team ownership without further notice. [1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Alvaro Herrera
Date:
Noah Misch wrote: > IMMEDIATE ATTENTION REQUIRED. This PostgreSQL 10 open item is long past due > for your status update. Please reacquaint yourself with the policy on open > item ownership[1] and then reply immediately. If I do not hear from you by > 2017-06-11 07:00 UTC, I will transfer this item to release management team > ownership without further notice. I volunteer to own this item. I'll report on Wednesday 14th at the latest. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Alvaro Herrera
Date:
Michael Paquier wrote: > > i look carefully on the patch, because of we removed TwoPhaseStateLock > > lwlock acquire in `RemoveGXact' and let caller held lwlock, so i think: > > 1. xact_redo also need held lwlock before call PrepareRedoRemove > > 2. RecoverPreparedTransactions also need held lwlock before call > > ProcessTwoPhaseBuffer > > Thanks for the input. I have reviewed all the code paths that have > been modified, and strengthened the code with assertions using > LWLockHeldByMeInMode() to make sure that the correct lock is always > hold in those code paths. There is actually no point in holding the > lock in restoreTwoPhaseData(), but as this makes the code less > consistent with the rest I added one. Also, I have replaced the lock > acquisition in PrepareRedoAdd() with acquisitions at higher levels, > and added an assertion in this routine. This makes the 2PC state data > addition and removal more consistent with each other. Thanks for the patch -- I agree that the new coding looks better. I also agree that restoreTwoPhaseData doesn't need to hold the lock, since we don't expect anything running concurrently, but that it's okay to hold it and makes the whole thing simpler. We could try to apply an equivalent rationale to RecoverPreparedTransactions: even though we have already been running in HOT standby mode for a while, there's no possibility of concurrent activity either: since we exited recovery, master cannot write any more 2PC xacts, and we haven't yet set the flag that lets open sessions write WAL. However, it seems mildly dangerous to run the bottom sections of the loop with the lock held, since it acquires other lwlocks and generally does a lot of crap. Also, let's add an lwlock-is-held assertion to MarkAsPreparingGuts. BTW: I think that saving one redundant line of code is not worth the ugliness that it costs in xact_redo, so let's undo that. The patch is a bit bulky, but the resulting code is simpler. In short, I propose the attached. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Alvaro Herrera
Date:
Alvaro Herrera wrote: For some reason I sent a preliminary version of the email I was writing. Some additional thoughts that were supposed to be in there: > I also agree that restoreTwoPhaseData doesn't need to hold the lock, > since we don't expect anything running concurrently, but that it's okay > to hold it and makes the whole thing simpler. It's okay to hold the lock for the whole duration of the function. > We could try to apply an equivalent rationale to > RecoverPreparedTransactions: even though we have already been running in > HOT standby mode for a while, there's no possibility of concurrent > activity either: since we exited recovery, master cannot write any more > 2PC xacts, and we haven't yet set the flag that lets open sessions write > WAL. However, it seems mildly dangerous to run the bottom sections of > the loop with the lock held, since it acquires other lwlocks and > generally does a lot of crap. Therefore, let's adopt your idea of acquiring the lock only for the specific low-level calls that require it. But the comment atop the loop needs a lot of work to explain why it's doing that (i.e. how come it's reading TwoPhaseState without a lock), as in my proposed patch. (At first, I didn't really like this very much and wanted to add more locking to that function, but it turned quite messy, so I back-tracked). Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Alvaro Herrera
Date:
wangchuanting wrote: > we use Postgres-XC+pg10(4 coordinator + 4 datanode(pg 10 as datanode, and 1 > master 1 standby for each datanode)), and benchmark tpcc, there is some > cross datanode transactions that use 2pc, during testing, we restart the > cluster, then one datanode standby can not recovery done and hangup with > TwoPhaseStateLock deadlock. > > and sorry, we reinstall the cluster, and the log is removed, we will try to > reproduce, but anyway, the code is not right like Michael saied. Did you have any luck reproducing the original issue? It would be good if you could confirm that the bug is gone with the patch I posted in https://www.postgresql.org/message-id/20170611022536.goef4cdbmgicye5g@alvherre.pgsql Thanks -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Sun, Jun 11, 2017 at 12:24 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Alvaro Herrera wrote: > > For some reason I sent a preliminary version of the email I was writing. > Some additional thoughts that were supposed to be in there: No problem. Thanks for the input. >> I also agree that restoreTwoPhaseData doesn't need to hold the lock, >> since we don't expect anything running concurrently, but that it's okay >> to hold it and makes the whole thing simpler. > > It's okay to hold the lock for the whole duration of the function. Yup. >> We could try to apply an equivalent rationale to >> RecoverPreparedTransactions: even though we have already been running in >> HOT standby mode for a while, there's no possibility of concurrent >> activity either: since we exited recovery, master cannot write any more >> 2PC xacts, and we haven't yet set the flag that lets open sessions write >> WAL. However, it seems mildly dangerous to run the bottom sections of >> the loop with the lock held, since it acquires other lwlocks and >> generally does a lot of crap. > > Therefore, let's adopt your idea of acquiring the lock only for the > specific low-level calls that require it. But the comment atop the loop > needs a lot of work to explain why it's doing that (i.e. how come it's > reading TwoPhaseState without a lock), as in my proposed patch. > > (At first, I didn't really like this very much and wanted to add more > locking to that function, but it turned quite messy, so I back-tracked). Yes, I got the same temptation but give up as well because of the unnecessary complications. GXactLoadSubxactData() does nothing fancy, but my worries are in ProcessRecords() and PostPrepare_Twophase(), particularly if they do more complicated and fancy stuff in the future for whatever reason. The comment you have added is this one: /* - * Don't need a lock in the recovery phase. + * It's okay to access TwoPhaseState without a lock here: recovery is + * finished (so if we were a standby, there's no master that can prepare + * transactions anymore), and we haven't yet set WAL as open for writes, + * so local existing backends, if any, cannot do so either. We could use a + * coding pattern similar to restoreTwoPhaseData, i.e., run the whole loop + * with the lock held; but this loop is far more complex, so instead only + * grab the lock while calling the low-level functions that require it. */ I have reworked the comment as follows: /* - * Don't need a lock in the recovery phase. + * It is fine to access TwoPhaseState without a lock here: recovery is + * finished (so if we were a standby, there's no master that can prepare + * transactions anymore), and we haven't yet set WAL as open for writes, + * so local existing backends, if any, cannot do so either. We could use a + * coding pattern similar to restoreTwoPhaseData, i.e., run the whole loop + * with the lock held; but this loop is far more complex, so instead only + * grab the lock while calling the low-level functions working directly on + * manipulating the two-phase state data. Functions working directly on + * PGPROC entries linked with the two-phase transaction work with other + * types of locks but we don't want to complicate that more than necessary. */ The v5 attached does not have any other changes than what you did in v4 (plus typos noticed). - if (info == XLOG_XACT_COMMIT || info == XLOG_XACT_COMMIT_PREPARED) + if (info == XLOG_XACT_COMMIT) { xl_xact_commit *xlrec = (xl_xact_commit *) XLogRecGetData(record); xl_xact_parsed_commit parsed; No objections to the shuffling done here. + * An exclusive lock on TwoPhaseStateLock is taken for the duration of the + * scan of TwoPhaseState data as its may be updated. Typo noticed here => s/its/it/. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
wangchuanting
Date:
> Did you have any luck reproducing the original issue? sorry, not reproduced -- View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5966050.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Mon, Jun 12, 2017 at 3:25 PM, wangchuanting <wangchuanting@huawei.com> wrote: >> Did you have any luck reproducing the original issue? > > sorry, not reproduced I am a bit confused. Do you mean that you applied the patch and did not see the issue again? Or do you mean that you did not have the time to check it? -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
[BUGS] Re: BUG #14680: startup process on standby encounter a deadlock ofTwoPhaseStateLock when redo 2PC xlog
From
wangchuanting
Date:
>I am a bit confused. Do you mean that you applied the patch and did >not see the issue again? Or do you mean that you did not have the time >to check it? i means: this issue occurs only once in my test, and have not reproduce it. i review the patch, and i think the patch fix the issue(but i have not applied it, i will apply it tomorrow). -- View this message in context: http://www.postgresql-archive.org/BUG-14680-startup-process-on-standby-encounter-a-deadlock-of-TwoPhaseStateLock-when-redo-2PC-xlog-tp5964069p5966106.html Sent from the PostgreSQL - bugs mailing list archive at Nabble.com. -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Alvaro Herrera
Date:
Michael Paquier wrote: > On Sun, Jun 11, 2017 at 12:24 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > I have reworked the comment as follows: > /* > - * Don't need a lock in the recovery phase. > + * It is fine to access TwoPhaseState without a lock here: recovery is > + * finished (so if we were a standby, there's no master that can prepare > + * transactions anymore), and we haven't yet set WAL as open for writes, > + * so local existing backends, if any, cannot do so either. We could use a > + * coding pattern similar to restoreTwoPhaseData, i.e., run the whole loop > + * with the lock held; but this loop is far more complex, so instead only > + * grab the lock while calling the low-level functions working directly on > + * manipulating the two-phase state data. Functions working directly on > + * PGPROC entries linked with the two-phase transaction work with other > + * types of locks but we don't want to complicate that more than necessary. > */ Hmm. Honestly I don't like the final sentence you added. I find it more confusing than useful, because it doesn't explain what these "other types of locks" are, or why we care. However, I found out that this rationale is likely not true, because the checkpointer may be running concurrently with this code from startup process, and checkpointer does process 2PC data. Maybe there are other reasons why there's no live bug here, but it looks wrong (I didn't try to reproduce a problem). -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Tue, Jun 13, 2017 at 7:49 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > However, I found out that this rationale is likely not true, because the > checkpointer may be running concurrently with this code from startup > process, and checkpointer does process 2PC data. Maybe there are other > reasons why there's no live bug here, but it looks wrong (I didn't try > to reproduce a problem). The current coding is actually safe because the checkpointer does not remove or add any 2PC entry in the array while holding TwoPhaseStateLock, it just updates some values that need to be read and/or written while holding the lock. Well, to be honest, HEAD is wrong because it can read a flag value while the checkpointer updates it, and the patch is careful to change that to be correct. The wrong part is when calling ProcessTwoPhaseBuffer() in RecoverPreparedTransactions() which accesses gxact->ondisk and prepare_start_lsn without locking things. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Alvaro Herrera
Date:
Michael Paquier wrote: > The current coding is actually safe because the checkpointer does not > remove or add any 2PC entry in the array while holding > TwoPhaseStateLock, it just updates some values that need to be read > and/or written while holding the lock. Well, to be honest, HEAD is > wrong because it can read a flag value while the checkpointer updates > it, and the patch is careful to change that to be correct. The wrong > part is when calling ProcessTwoPhaseBuffer() in > RecoverPreparedTransactions() which accesses gxact->ondisk and > prepare_start_lsn without locking things. Honestly I don't like this rationale very much. Even if doing the unlocked access is safe today, it looks like installing a landmine for the future, and for what? I don't think there's a lot to be gained: RecoverPreparedTransactions only runs once in the life of a server, and CheckPointTwoPhase is supposed to have a very short runtime (per explanation in comments therein). It seems better to me to continue our tradition of using the appropriate locks instead of playing a dangerous game. So I propose that RecoverPreparedTransactions grabs exclusive lock at the top, and only the bottom part of the loop is done unlocked, which AFAICS should be safe. (MarkAsPrepared gained a boolean argument indicating that caller already holds lock). Here's a patch along those lines. The full testsuite is running now, but the recovery tests pass fine. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Attachment
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Wed, Jun 14, 2017 at 8:08 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > So I propose that RecoverPreparedTransactions grabs exclusive lock at > the top, and only the bottom part of the loop is done unlocked, which > AFAICS should be safe. (MarkAsPrepared gained a boolean argument > indicating that caller already holds lock). Logically both approaches are really close as with this approach the additional locked area is when scanning the entries of TwoPhaseState. But I agree that your suggestion would be safer in the long run. This looks good to me after a close look. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Alvaro Herrera
Date:
Michael Paquier wrote: > On Wed, Jun 14, 2017 at 8:08 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > So I propose that RecoverPreparedTransactions grabs exclusive lock at > > the top, and only the bottom part of the loop is done unlocked, which > > AFAICS should be safe. (MarkAsPrepared gained a boolean argument > > indicating that caller already holds lock). > > Logically both approaches are really close as with this approach the > additional locked area is when scanning the entries of TwoPhaseState. > But I agree that your suggestion would be safer in the long run. This > looks good to me after a close look. Thanks, pushed. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
Re: [BUGS] Re: BUG #14680: startup process on standby encounter adeadlock of TwoPhaseStateLock when redo 2PC xlog
From
Michael Paquier
Date:
On Thu, Jun 15, 2017 at 12:35 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Michael Paquier wrote: >> On Wed, Jun 14, 2017 at 8:08 AM, Alvaro Herrera >> <alvherre@2ndquadrant.com> wrote: >> > So I propose that RecoverPreparedTransactions grabs exclusive lock at >> > the top, and only the bottom part of the loop is done unlocked, which >> > AFAICS should be safe. (MarkAsPrepared gained a boolean argument >> > indicating that caller already holds lock). >> >> Logically both approaches are really close as with this approach the >> additional locked area is when scanning the entries of TwoPhaseState. >> But I agree that your suggestion would be safer in the long run. This >> looks good to me after a close look. > > Thanks, pushed. Thanks for the commit. -- Michael -- Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs