Re: Problem while setting the fpw with SIGHUP - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Problem while setting the fpw with SIGHUP
Date
Msg-id CAA4eK1K7Pg6HDMCHZ61XdppSt4nt5HJ0qv-uaTMMAZsTQo4fLg@mail.gmail.com
Whole thread Raw
In response to Re: Problem while setting the fpw with SIGHUP  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List pgsql-hackers
On Tue, Sep 18, 2018 at 12:46 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> At Fri, 14 Sep 2018 16:30:37 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+Xfx5jD2CHmLPNqXeOzqRLKG9TNr8wfs3-cPf9Ln9RVg@mail.gmail.com>
> > On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote:
> > > > /*
> > > >  * Properly accept or ignore signals the postmaster might send us.
> > > >  */
> > > > -       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> > > > +       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> > > >
> > > > I am finally coming back to this patch set, and that's one of the first
> > > > things I am going to help moving on for this CF.  And this bit from the
> > > > last patch series is not acceptable as there are some parameters which
> > > > are used by the startup process which can be reloaded.  One of them is
> > > > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
> > >
> > > So, I have been working on this problem again and I have reviewed the
> > > thread, and there have been many things discussed in the last couple of
> > > months:
> > > 1) We do not want to initialize XLogInsert stuff unconditionally for all
> > > processes at the moment recovery begins, but we just want to initialize
> > > it once WAL write is open for business.
> > > 2) Both the checkpointer and the startup process can call
> > > UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
> > > incorrect values.
> >
> > Can you share the steps to reproduce this problem?
>
> The window for the issue is small.
>
> It happens if checkpointer first looks SharedRecoveryInProgress
> is turned off at the beginning of the CheckPointerMain loop.
> The window is needed be widen to make sure the issue happens.
>
> Applying the first patch attched, the following steps will cause
> the issue with high reliability.
>
> 1. initdb  (full_page_writes = on by default)
>
> 2. put recovery.conf so that server can accept promote signal
>
> 3. touch /tmp/hoge
>    change full_page_write to off in postgresql.conf
>
> 4. pg_ctl_promote
>
> The attached second file is a script do the above steps.
> Server writes the following log message and die.
>
> | 2018-09-18 13:55:49.928 JST [16405] LOG:  database system is ready to accept connections
> | TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731)
> | 2018-09-18 13:55:50.546 JST [16405] LOG:  checkpointer process (PID 16407) was terminated by signal 6: Aborted
>
> We can preallocating the XLogInsert buffer just to prevent the
> crash. This is done by calling RecoveryInProgress() before
> UpdateFullPageWrites() finds it turned to false. This is an
> isolated problem.
>

Yes, till this point the problem is quite clear and can be reproduced,
however, my question was for the next part for which there doesn't
seem to be a concrete test case.

> But it has another issue that FPW_CHANGE record
> can be missing or wrong FPW state after recovery end.
>
> It comes from the fact that responsibility to update the flag is
> not atomically passed from startup to checkpointer. (The window
> is after UpdateFullPageWrites() call and until setting
> SharedRecoveryInProgress to false.)
>

I understand your concern about missing FPW_CHANGE WAL record during
the above window but is that really a problem because till the
promotion is complete, it is not expected that we write any WAL
record.  Now, the other part of the problem you mentioned is "wrong
FPW state" which can happen because we don't read
Insert->fullPageWrites under lock.  If you are concerned about that
then I think we can solve it with a much less invasive change.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Alexander Kukushkin
Date:
Subject: Re: Connection slots reserved for replication
Next
From: Pavel Stehule
Date:
Subject: Re: Proposal for disk quota feature