Re: Problem while setting the fpw with SIGHUP - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Problem while setting the fpw with SIGHUP
Date
Msg-id CAA4eK1K7dVgKU4zrNTSCW=EoqALG38XmNT0HK9Wdkr935iwTQg@mail.gmail.com
Whole thread Raw
In response to Re: Problem while setting the fpw with SIGHUP  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: Problem while setting the fpw with SIGHUP
List pgsql-hackers
On Wed, Aug 1, 2018 at 12:56 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Thank you, Amit, Michael.
>

Can you verify the first patch that I have posted above [1]?  We can
commit it separately.

>
> It's a long time ago.. Let me have a bit of time to blow dust off..
>
> https://www.postgresql.org/message-id/20180420.173354.43303926.horiguchi.kyotaro@lab.ntt.co.jp
>
> ...done..i working..
>
> The original problem here was UpdateFullPageWrites() can call
> RecoveryInProgress(), which does palloc in a critical
> section. This happens when standy is commanded to reload with
> change of full_pages_writes during recovery.
>

AFAIU from the original problem reported by Dilip, it can happen
during checkpoint without standby as well.

> While exploring it, I found that update of fullPageWrite flag is
> updated concurrently against its design. A race condition happens
> in the following steps in my diagnosis.
>
> 1. While the startup process is running recovery, we didn't
>  consider that checkpointer may be running concurrently, but it
>  happens for standby.
>
> 2. Checkpointer considers itself (or designed) as the *only*
>  updator of shared config including fillPageWrites. In reality
>  the startup is another concurent updator on standby. Addition to
>  that, checkpointer assumes that it is started under WAL-emitting
>  state, that is, InitXLogInsert() has been called elsewhere, but
>  it is not the case on standby.
>
>  Note that checkpointer mustn't update FPW flag before recovery
>  ends because startup will overrides the flag.
>
> 3. As the result, when standby gets full_page_writes changed and
>  SIGHUP during recovery, checkpointer tries to update FPW flag
>  and calls RecoveryInProgress() on the way. bang!
>
>
> With the 0002-step1.patch, checkpointer really becomes the only
> updator of the FPW flag after recovery ends. StartupXLOG()
> updates it just once before checkpointer starts to update it.
>

- * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
- * record before resource manager writes cleanup WAL records or checkpoint
- * record is written.
+ * Update full_page_writes in shared memory with the lastest value before
+ * resource manager writes cleanup WAL records or checkpoint record is
+ * written. We don't need to write XLOG_FPW_CHANGE since this just
+ * reflects the status at the last redo'ed record. No lock is required
+ * since startup is the only updator of the flag at this
+ * point. Checkpointer will take over after SharedRecoveryInProgress is
+ * turned to false.
  */
  Insert->fullPageWrites = lastFullPageWrites;
- LocalSetXLogInsertAllowed();
- UpdateFullPageWrites();
- LocalXLogInsertAllowed = -1;

lastFullPageWrites will contain the latest value among the replayed
WAL records.  It can still be different fullPageWrites which is
updated by UpdateFullPageWrites().  So, I don't think it is advisable
to remove it and rely on checkpointer to update it.  I think when it
is called from this code path, it will anyway not write
XLOG_FPW_CHANGE because of the !RecoveryInProgress() check.

> Under the normal(slow?) promotion mode, checkpointer is woken up
> explicitly to make the config setting effective as soon as
> possible. (This might be unnecessary.)
>

I am not sure this is the right approach.  If we are worried about
concurrency issues, then we can use lock to protect it.

> In checkpointer, RecoveryInProgress() is called preior to
> UpdateFPW() to disable update FPW during recovery, so the crash
> that is the issue here is fixed.
>

It seems to me that you are trying to resolve two different problems
in the same patch.  I think we can have a patch to deal with
concurrency issue if any and a separate patch to call
RecoveryInProgress outside critical section.

[1] - https://www.postgresql.org/message-id/CAA4eK1JvKxsHfM6GCHoKNas-7vDSniLgaAm%3DcG_OaQaOYRNc3w%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Compile failures today
Next
From: Bruce Momjian
Date:
Subject: Re: Reference link for applicable_roles view is missing in thedocumentation of enabled_roles