Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM - Mailing list pgsql-hackers

From Tom Lane
Subject Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM
Date
Msg-id 22735.1507048202@sss.pgh.pa.us
Whole thread Raw
In response to Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiverafter OOM  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: [HACKERS] [sqlsmith] stuck spinlock in pg_stat_get_wal_receiver after OOM
List pgsql-hackers
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> Tom Lane wrote:
>> Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
>>> I think the latch is only used locally.  Seems that it was only put in
>>> shmem to avoid a separate variable ...

>> Hm, I'm strongly tempted to move it to a separate static variable then.

Oh, wait, look at

/** Wake up the walreceiver main loop.** This is called by the startup process whenever interesting xlog records* are
applied,so that walreceiver can check if it needs to send an apply* notification back to the master which may be
waitingin a COMMIT with* synchronous_commit = remote_apply.*/
 
void
WalRcvForceReply(void)
{WalRcv->force_reply = true;if (WalRcv->latch)    SetLatch(WalRcv->latch);
}

So that's trouble waiting to happen, for sure.  At the very least we
need to do a single fetch of WalRcv->latch, not two.  I wonder whether
even that is sufficient, though: this coding requires an atomic fetch of
a pointer, which is something we generally don't assume to be safe.

I'm inclined to think that it'd be a good idea to move the set and
clear of the latch field into the nearby spinlock critical sections,
and then change WalRcvForceReply to look like

void
WalRcvForceReply(void)
{Latch       *latch;
WalRcv->force_reply = true;/* fetching the latch pointer might not be atomic, so use spinlock
*/SpinLockAcquire(&WalRcv->mutex);latch= WalRcv->latch;SpinLockRelease(&WalRcv->mutex);if (latch)    SetLatch(latch);
 
}

This still leaves us with a hazard of applying SetLatch to the latch
of a PGPROC that is no longer the walreceiver's, but I think that's
okay (and we have the same problem elsewhere).  At worst it leads to
an extra wakeup of the next process using that PGPROC.

I'm also thinking that the force_reply flag needs to be sig_atomic_t,
not bool, because bool store/fetch isn't necessarily atomic on all
platforms.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with alot of columns
Next
From: Dang Minh Huong
Date:
Subject: Re: [HACKERS] list of credits for release notes