Thread: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM
[HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM
From
Andreas Seltenreich
Date:
Hi, low-memory testing with REL_10_STABLE at 1f19550a87 produced the following PANIC: stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 I was about to wrap the pstrdup()s with a PG_TRY block, but I can't find a spinlock being released in a PG_CATCH block anywhere, so maybe that's a bad idea? regards, Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Thomas Munro
Date:
On Tue, Oct 3, 2017 at 9:56 AM, Andreas Seltenreich <seltenreich@gmx.de> wrote: > low-memory testing with REL_10_STABLE at 1f19550a87 produced the > following PANIC: > > stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 > > I was about to wrap the pstrdup()s with a PG_TRY block, but I can't find > a spinlock being released in a PG_CATCH block anywhere, so maybe that's > a bad idea? No comment on what might be holding the spinlock there, but perhaps the spinlock-protected code should strncpy into stack-local buffers instead of calling pstrdup()? The buffers could be statically sized with NAMEDATALEN and MAXCONNINFO. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Andres Freund
Date:
On 2017-10-02 22:56:49 +0200, Andreas Seltenreich wrote: > Hi, > > low-memory testing with REL_10_STABLE at 1f19550a87 produced the > following PANIC: > > stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 Ugh. > I was about to wrap the pstrdup()s with a PG_TRY block, but I can't find > a spinlock being released in a PG_CATCH block anywhere, so maybe that's > a bad idea? Yes, that'd be a bad idea. It's not great to have memcpys in a critical section, but it's way better than pallocs. So we need to use some local buffers that this get copied to. This seems to have been introduced as part of b1a9bad9e74 and then 9ed551e0a4f. Authors CCed. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes: > On 2017-10-02 22:56:49 +0200, Andreas Seltenreich wrote: >> low-memory testing with REL_10_STABLE at 1f19550a87 produced the >> following PANIC: >> stuck spinlock detected at pg_stat_get_wal_receiver, walreceiver.c:1397 > Ugh. Egad. > Yes, that'd be a bad idea. It's not great to have memcpys in a critical > section, but it's way better than pallocs. So we need to use some local > buffers that this get copied to. Or replace the spinlock with an LWLock? In any case, I think it would be a good idea to look at every other critical section touching that lock to see if there are any other blatant coding-rule violations. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Andres Freund
Date:
On 2017-10-02 17:30:25 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > Yes, that'd be a bad idea. It's not great to have memcpys in a critical > > section, but it's way better than pallocs. So we need to use some local > > buffers that this get copied to. > > Or replace the spinlock with an LWLock? That'd probably be a good idea, but I'm loathe to do so in the back branches. Not at this callsite, but some others, there's some potential for contention. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes: > On 2017-10-02 17:30:25 -0400, Tom Lane wrote: >> Or replace the spinlock with an LWLock? > That'd probably be a good idea, but I'm loathe to do so in the back > branches. Not at this callsite, but some others, there's some potential > for contention. If this is the only problem then I'd agree we should stick to a spinlock (I assume the strings in question can't be very long). I was thinking more about what to do if we find other violations that are harder to fix. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I wrote: > If this is the only problem then I'd agree we should stick to a spinlock > (I assume the strings in question can't be very long). I was thinking > more about what to do if we find other violations that are harder to fix. I took a quick look through walreceiver.c, and there are a couple of obvious problems of the same ilk in WalReceiverMain, which is doing this: walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime = walrcv->latestWalEndTime = GetCurrentTimestamp(); (ie, a potential kernel call) inside a spinlock. But there seems no real problem with just collecting the timestamp before we enter that critical section. I also don't especially like the fact that just above there it reaches elog(PANIC) with the lock still held, though at least that's a case that should never happen. Further down, it's doing a pfree() inside the spinlock, apparently for no other reason than to save one "if (tmp_conninfo)". I don't especially like the Asserts inside spinlocks, either. Personally, I think if those conditions are worth testing then they're worth testing for real (in production). Variables that are manipulated by multiple processes are way more likely to assume unexpected states than local variables. I'm also rather befuddled by the fact that this code sets and clears walrcv->latch outside the critical sections. If that field is used by any other process, surely that's completely unsafe. If it isn't, why is it being kept in shared memory? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Michael Paquier
Date:
On Tue, Oct 3, 2017 at 6:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> If this is the only problem then I'd agree we should stick to a spinlock >> (I assume the strings in question can't be very long). I was thinking >> more about what to do if we find other violations that are harder to fix. I don't think that there is any need to switch to a LWLock. Any issues in need to be dealt with here don't require it, if we are fine with the memcpy method of course. > I took a quick look through walreceiver.c, and there are a couple of > obvious problems of the same ilk in WalReceiverMain, which is doing this: > > walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime = walrcv->latestWalEndTime = GetCurrentTimestamp(); > > (ie, a potential kernel call) inside a spinlock. But there seems no > real problem with just collecting the timestamp before we enter that > critical section. No problems seen either from here. > I also don't especially like the fact that just above there it reaches > elog(PANIC) with the lock still held, though at least that's a case that > should never happen. This part has been around since the beginning in 1bb2558. I agree that the lock should be released before doing the logging. > Further down, it's doing a pfree() inside the spinlock, apparently > for no other reason than to save one "if (tmp_conninfo)". Check. > I don't especially like the Asserts inside spinlocks, either. Personally, > I think if those conditions are worth testing then they're worth testing > for real (in production). Variables that are manipulated by multiple > processes are way more likely to assume unexpected states than local > variables. Those could be replaced by similar checks using some PG_USED_FOR_ASSERTS_ONLY out of the spin lock sections, though I am not sure if those are worth worrying. What do others think? > I'm also rather befuddled by the fact that this code sets and clears > walrcv->latch outside the critical sections. If that field is used > by any other process, surely that's completely unsafe. If it isn't, > why is it being kept in shared memory? Do you mean the introduction of WalRcvForceReply by 314cbfc? This is more recent, and has been discussed during the review of the remote_apply patch here to avoid sending SIGUSR1 too much from the startup process to the WAL receiver: https://www.postgresql.org/message-id/CA+TgmobPsROS-gFk=_KJdW5scQjcKtpiLhsH9Cw=UWH1htFKaw@mail.gmail.com I am attaching a patch that addresses the bugs for the spin lock sections. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM
From
Andreas Seltenreich
Date:
Michael Paquier writes: > I am attaching a patch that addresses the bugs for the spin lock sections. > [2. text/x-diff; walreceiver-spin-calls.patch] I haven't seen a spinlock PANIC since testing with the patch applied. They occured five times with the same amount of testing done earlier. regards, Andreas -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Alvaro Herrera
Date:
Andreas Seltenreich wrote: > Michael Paquier writes: > > > I am attaching a patch that addresses the bugs for the spin lock sections. > > [2. text/x-diff; walreceiver-spin-calls.patch] > > I haven't seen a spinlock PANIC since testing with the patch applied. > They occured five times with the same amount of testing done earlier. Thanks for testing. I'm about to push something. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Alvaro Herrera
Date:
Fixed the pstrdup problem by replacing with strlcpy() to stack-allocated variables (rather than palloc + memcpy as proposed in Michael's patch). About the other problems: Tom Lane wrote: > I took a quick look through walreceiver.c, and there are a couple of > obvious problems of the same ilk in WalReceiverMain, which is doing this: > > walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime = walrcv->latestWalEndTime = GetCurrentTimestamp(); > > (ie, a potential kernel call) inside a spinlock. But there seems no > real problem with just collecting the timestamp before we enter that > critical section. Done that way. > I also don't especially like the fact that just above there it reaches > elog(PANIC) with the lock still held, though at least that's a case that > should never happen. Fixed by releasing spinlock just before elog. > Further down, it's doing a pfree() inside the spinlock, apparently > for no other reason than to save one "if (tmp_conninfo)". Fixed. > I don't especially like the Asserts inside spinlocks, either. Personally, > I think if those conditions are worth testing then they're worth testing > for real (in production). Variables that are manipulated by multiple > processes are way more likely to assume unexpected states than local > variables. I didn't change these. It doesn't look to me that these asserts are worth very much as production code. > I'm also rather befuddled by the fact that this code sets and clears > walrcv->latch outside the critical sections. If that field is used > by any other process, surely that's completely unsafe. If it isn't, > why is it being kept in shared memory? I think the latch is only used locally. Seems that it was only put in shmem to avoid a separate variable ... -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Fixed the pstrdup problem by replacing with strlcpy() to stack-allocated > variables (rather than palloc + memcpy as proposed in Michael's patch). +1 > Tom Lane wrote: >> I don't especially like the Asserts inside spinlocks, either. > I didn't change these. It doesn't look to me that these asserts are > worth very much as production code. OK. If we ever see these hit in the buildfarm I might argue for reconsidering, but without some evidence of that sort it's not worth much concern. >> I'm also rather befuddled by the fact that this code sets and clears >> walrcv->latch outside the critical sections. If that field is used >> by any other process, surely that's completely unsafe. If it isn't, >> why is it being kept in shared memory? > I think the latch is only used locally. Seems that it was only put in > shmem to avoid a separate variable ... Hm, I'm strongly tempted to move it to a separate static variable then. That's not a bug fix, so maybe it only belongs in HEAD, but is there value in keeping the branches in sync in this code? It sounded from your commit message like they were pretty different already :-( regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Alvaro Herrera
Date:
Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > Tom Lane wrote: > >> I don't especially like the Asserts inside spinlocks, either. > > > I didn't change these. It doesn't look to me that these asserts are > > worth very much as production code. > > OK. If we ever see these hit in the buildfarm I might argue for > reconsidering, but without some evidence of that sort it's not > worth much concern. Sure. I would be very surprised if buildfarm ever exercises this code. > > I think the latch is only used locally. Seems that it was only put in > > shmem to avoid a separate variable ... > > Hm, I'm strongly tempted to move it to a separate static variable then. > That's not a bug fix, so maybe it only belongs in HEAD, but is there > value in keeping the branches in sync in this code? It sounded from > your commit message like they were pretty different already :-( Well, there were conflicts in almost every branch, but they were pretty minor. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > Tom Lane wrote: >> Alvaro Herrera <alvherre@alvh.no-ip.org> writes: >>> I think the latch is only used locally. Seems that it was only put in >>> shmem to avoid a separate variable ... >> Hm, I'm strongly tempted to move it to a separate static variable then. Oh, wait, look at /** Wake up the walreceiver main loop.** This is called by the startup process whenever interesting xlog records* are applied,so that walreceiver can check if it needs to send an apply* notification back to the master which may be waitingin a COMMIT with* synchronous_commit = remote_apply.*/ void WalRcvForceReply(void) {WalRcv->force_reply = true;if (WalRcv->latch) SetLatch(WalRcv->latch); } So that's trouble waiting to happen, for sure. At the very least we need to do a single fetch of WalRcv->latch, not two. I wonder whether even that is sufficient, though: this coding requires an atomic fetch of a pointer, which is something we generally don't assume to be safe. I'm inclined to think that it'd be a good idea to move the set and clear of the latch field into the nearby spinlock critical sections, and then change WalRcvForceReply to look like void WalRcvForceReply(void) {Latch *latch; WalRcv->force_reply = true;/* fetching the latch pointer might not be atomic, so use spinlock */SpinLockAcquire(&WalRcv->mutex);latch= WalRcv->latch;SpinLockRelease(&WalRcv->mutex);if (latch) SetLatch(latch); } This still leaves us with a hazard of applying SetLatch to the latch of a PGPROC that is no longer the walreceiver's, but I think that's okay (and we have the same problem elsewhere). At worst it leads to an extra wakeup of the next process using that PGPROC. I'm also thinking that the force_reply flag needs to be sig_atomic_t, not bool, because bool store/fetch isn't necessarily atomic on all platforms. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I wrote: > So that's trouble waiting to happen, for sure. At the very least we > need to do a single fetch of WalRcv->latch, not two. I wonder whether > even that is sufficient, though: this coding requires an atomic fetch of > a pointer, which is something we generally don't assume to be safe. > > I'm inclined to think that it'd be a good idea to move the set and > clear of the latch field into the nearby spinlock critical sections, > and then change WalRcvForceReply to look like ... Concretely, as per the attached. I reordered the WalRcvData fields to show that the "latch" field is now treated as protected by the spinlock. In the back branches, we shouldn't do that, just in case some external code is touching the mutex field. regards, tom lane diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 57c305d..1bf9be6 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -256,13 +256,14 @@ WalReceiverMain(void) walrcv->lastMsgSendTime = walrcv->lastMsgReceiptTime = walrcv->latestWalEndTime = now; + /* Report the latch to use to awaken this process */ + walrcv->latch = &MyProc->procLatch; + SpinLockRelease(&walrcv->mutex); /* Arrange to clean up at walreceiver exit */ on_shmem_exit(WalRcvDie, 0); - walrcv->latch = &MyProc->procLatch; - /* Properly accept or ignore signals the postmaster might send us */ pqsignal(SIGHUP, WalRcvSigHupHandler); /* set flag to read config file */ pqsignal(SIGINT, SIG_IGN); @@ -777,8 +778,7 @@ WalRcvDie(int code, Datum arg) /* Ensure that all WAL records received are flushed to disk */ XLogWalRcvFlush(true); - walrcv->latch = NULL; - + /* Mark ourselves inactive in shared memory */ SpinLockAcquire(&walrcv->mutex); Assert(walrcv->walRcvState == WALRCV_STREAMING || walrcv->walRcvState == WALRCV_RESTARTING || @@ -789,6 +789,7 @@ WalRcvDie(int code, Datum arg) walrcv->walRcvState = WALRCV_STOPPED; walrcv->pid = 0; walrcv->ready_to_display = false; + walrcv->latch = NULL; SpinLockRelease(&walrcv->mutex); /* Terminate the connection gracefully. */ @@ -1344,9 +1345,15 @@ ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime) void WalRcvForceReply(void) { + Latch *latch; + WalRcv->force_reply = true; - if (WalRcv->latch) - SetLatch(WalRcv->latch); + /* fetching the latch pointer might not be atomic, so use spinlock */ + SpinLockAcquire(&WalRcv->mutex); + latch = WalRcv->latch; + SpinLockRelease(&WalRcv->mutex); + if (latch) + SetLatch(latch); } /* diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c index 78f8693..b1f28d0 100644 --- a/src/backend/replication/walreceiverfuncs.c +++ b/src/backend/replication/walreceiverfuncs.c @@ -226,6 +226,7 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo, WalRcvData *walrcv = WalRcv; bool launch = false; pg_time_t now = (pg_time_t) time(NULL); + Latch *latch; /* * We always start at the beginning of the segment. That prevents a broken @@ -274,12 +275,14 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo, walrcv->receiveStart = recptr; walrcv->receiveStartTLI = tli; + latch = walrcv->latch; + SpinLockRelease(&walrcv->mutex); if (launch) SendPostmasterSignal(PMSIGNAL_START_WALRECEIVER); - else if (walrcv->latch) - SetLatch(walrcv->latch); + else if (latch) + SetLatch(latch); } /* diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h index 9a8b2e2..e58fc49 100644 --- a/src/include/replication/walreceiver.h +++ b/src/include/replication/walreceiver.h @@ -117,14 +117,6 @@ typedef struct /* set true once conninfo is ready to display (obfuscated pwds etc) */ bool ready_to_display; - slock_t mutex; /* locks shared variables shown above */ - - /* - * force walreceiver reply? This doesn't need to be locked; memory - * barriers for ordering are sufficient. - */ - bool force_reply; - /* * Latch used by startup process to wake up walreceiver after telling it * where to start streaming (after setting receiveStart and @@ -133,6 +125,15 @@ typedef struct * normally mapped to procLatch when walreceiver is running. */ Latch *latch; + + slock_t mutex; /* locks shared variables shown above */ + + /* + * force walreceiver reply? This doesn't need to be locked; memory + * barriers for ordering are sufficient. But we do need atomic fetch and + * store semantics, so use sig_atomic_t. + */ + sig_atomic_t force_reply; /* used as a bool */ } WalRcvData; extern WalRcvData *WalRcv; -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
I wrote: >> So that's trouble waiting to happen, for sure. At the very least we >> need to do a single fetch of WalRcv->latch, not two. I wonder whether >> even that is sufficient, though: this coding requires an atomic fetch of >> a pointer, which is something we generally don't assume to be safe. BTW, I had supposed that this bug was of long standing, but actually it's new in v10, dating to 597a87ccc9a6fa8af7f3cf280b1e24e41807d555. Before that walreceiver start/stop just changed the owner of a long-lived shared latch, and there was no question of stale pointers. I considered reverting that decision, but the reason for it seems to have been to let libpqwalreceiver.c manipulate MyProc->procLatch rather than having to know about a custom latch. That's probably a sufficient reason to justify some squishiness in the wakeup logic. Still, we might want to revisit it if we find any other problems here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Petr Jelinek
Date:
On 03/10/17 19:44, Tom Lane wrote: > I wrote: >>> So that's trouble waiting to happen, for sure. At the very least we >>> need to do a single fetch of WalRcv->latch, not two. I wonder whether >>> even that is sufficient, though: this coding requires an atomic fetch of >>> a pointer, which is something we generally don't assume to be safe. > > BTW, I had supposed that this bug was of long standing, but actually it's > new in v10, dating to 597a87ccc9a6fa8af7f3cf280b1e24e41807d555. Before > that walreceiver start/stop just changed the owner of a long-lived shared > latch, and there was no question of stale pointers. > > I considered reverting that decision, but the reason for it seems to have > been to let libpqwalreceiver.c manipulate MyProc->procLatch rather than > having to know about a custom latch. That's probably a sufficient reason > to justify some squishiness in the wakeup logic. Still, we might want to > revisit it if we find any other problems here. That's correct, and because the other users of that library don't have special latch it seemed feasible to standardize on the procLatch. If we indeed wanted to change the decision about this I think we can simply give latch as a parameter to libpqrcv_connect and store it in the WalReceiverConn struct. The connection does not need to live past the latch (although it currently does, but that's just a matter of reordering the code in WalRcvDie() a little AFAICS). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Thomas Munro
Date:
On Wed, Oct 4, 2017 at 7:32 AM, Petr Jelinek <petr.jelinek@2ndquadrant.com> wrote: > On 03/10/17 19:44, Tom Lane wrote: >> I wrote: >>>> So that's trouble waiting to happen, for sure. At the very least we >>>> need to do a single fetch of WalRcv->latch, not two. I wonder whether >>>> even that is sufficient, though: this coding requires an atomic fetch of >>>> a pointer, which is something we generally don't assume to be safe. >> >> BTW, I had supposed that this bug was of long standing, but actually it's >> new in v10, dating to 597a87ccc9a6fa8af7f3cf280b1e24e41807d555. Before >> that walreceiver start/stop just changed the owner of a long-lived shared >> latch, and there was no question of stale pointers. >> >> I considered reverting that decision, but the reason for it seems to have >> been to let libpqwalreceiver.c manipulate MyProc->procLatch rather than >> having to know about a custom latch. That's probably a sufficient reason >> to justify some squishiness in the wakeup logic. Still, we might want to >> revisit it if we find any other problems here. > That's correct, and because the other users of that library don't have > special latch it seemed feasible to standardize on the procLatch. If we > indeed wanted to change the decision about this I think we can simply > give latch as a parameter to libpqrcv_connect and store it in the > WalReceiverConn struct. The connection does not need to live past the > latch (although it currently does, but that's just a matter of > reordering the code in WalRcvDie() a little AFAICS). I wonder if the new ConditionVariable mechanism would be worth considering in future cases like this, where the signalling process doesn't know the identity of the process to be woken up (or even how many waiting processes there are), but instead any interested waiters block on a particular ConditionVariable that models a specific interesting condition. In the end it's just latches anyway, but it may be a better abstraction. On the other hand I'm not sure how waits on a ConditionVariable would be multiplexed with IO (a distinct wait event, or leaky abstraction where the caller relies on the fact that it's built on MyProc->latch, something else?). -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM
From
Petr Jelinek
Date:
On 03/10/17 22:01, Thomas Munro wrote: > On Wed, Oct 4, 2017 at 7:32 AM, Petr Jelinek > <petr.jelinek@2ndquadrant.com> wrote: >> On 03/10/17 19:44, Tom Lane wrote: >>> I wrote: >>>>> So that's trouble waiting to happen, for sure. At the very least we >>>>> need to do a single fetch of WalRcv->latch, not two. I wonder whether >>>>> even that is sufficient, though: this coding requires an atomic fetch of >>>>> a pointer, which is something we generally don't assume to be safe. >>> >>> BTW, I had supposed that this bug was of long standing, but actually it's >>> new in v10, dating to 597a87ccc9a6fa8af7f3cf280b1e24e41807d555. Before >>> that walreceiver start/stop just changed the owner of a long-lived shared >>> latch, and there was no question of stale pointers. >>> >>> I considered reverting that decision, but the reason for it seems to have >>> been to let libpqwalreceiver.c manipulate MyProc->procLatch rather than >>> having to know about a custom latch. That's probably a sufficient reason >>> to justify some squishiness in the wakeup logic. Still, we might want to >>> revisit it if we find any other problems here. >> That's correct, and because the other users of that library don't have >> special latch it seemed feasible to standardize on the procLatch. If we >> indeed wanted to change the decision about this I think we can simply >> give latch as a parameter to libpqrcv_connect and store it in the >> WalReceiverConn struct. The connection does not need to live past the >> latch (although it currently does, but that's just a matter of >> reordering the code in WalRcvDie() a little AFAICS). > > I wonder if the new ConditionVariable mechanism would be worth > considering in future cases like this, where the signalling process > doesn't know the identity of the process to be woken up (or even how > many waiting processes there are), but instead any interested waiters > block on a particular ConditionVariable that models a specific > interesting condition. In the end it's just latches anyway, but it > may be a better abstraction. On the other hand I'm not sure how waits > on a ConditionVariable would be multiplexed with IO (a distinct wait > event, or leaky abstraction where the caller relies on the fact that > it's built on MyProc->latch, something else?). > In this specific case the problem is that what we are waiting for is indeed the Latch because we want the libpqwalreceiver calls to be interruptible when waiting for I/O. Otherwise, yes ConditionVariable is useful for generic signaling, we use it in slot and origin for "lock" waits (they don't have normal catalog so normal locking does not work). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers