Home > mailing lists

Re: Problem while setting the fpw with SIGHUP - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Problem while setting the fpw with SIGHUP
Date	September 21, 2018 14:55:50
Msg-id	CAA4eK1K7Pg6HDMCHZ61XdppSt4nt5HJ0qv-uaTMMAZsTQo4fLg@mail.gmail.com Whole thread Raw
In response to	Re: Problem while setting the fpw with SIGHUP (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
List	pgsql-hackers

Tree view

On Tue, Sep 18, 2018 at 12:46 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> At Fri, 14 Sep 2018 16:30:37 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in
<CAA4eK1+Xfx5jD2CHmLPNqXeOzqRLKG9TNr8wfs3-cPf9Ln9RVg@mail.gmail.com>
> > On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote:
> > >
> > > On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote:
> > > > /*
> > > >  * Properly accept or ignore signals the postmaster might send us.
> > > >  */
> > > > -       pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */
> > > > +       pqsignal(SIGHUP, SIG_IGN);  /* ignore reload config */
> > > >
> > > > I am finally coming back to this patch set, and that's one of the first
> > > > things I am going to help moving on for this CF.  And this bit from the
> > > > last patch series is not acceptable as there are some parameters which
> > > > are used by the startup process which can be reloaded.  One of them is
> > > > wal_retrieve_retry_interval for tuning when fetching WAL at recovery.
> > >
> > > So, I have been working on this problem again and I have reviewed the
> > > thread, and there have been many things discussed in the last couple of
> > > months:
> > > 1) We do not want to initialize XLogInsert stuff unconditionally for all
> > > processes at the moment recovery begins, but we just want to initialize
> > > it once WAL write is open for business.
> > > 2) Both the checkpointer and the startup process can call
> > > UpdateFullPageWrites() which can cause Insert->fullPageWrites to get
> > > incorrect values.
> >
> > Can you share the steps to reproduce this problem?
>
> The window for the issue is small.
>
> It happens if checkpointer first looks SharedRecoveryInProgress
> is turned off at the beginning of the CheckPointerMain loop.
> The window is needed be widen to make sure the issue happens.
>
> Applying the first patch attched, the following steps will cause
> the issue with high reliability.
>
> 1. initdb  (full_page_writes = on by default)
>
> 2. put recovery.conf so that server can accept promote signal
>
> 3. touch /tmp/hoge
>    change full_page_write to off in postgresql.conf
>
> 4. pg_ctl_promote
>
> The attached second file is a script do the above steps.
> Server writes the following log message and die.
>
> | 2018-09-18 13:55:49.928 JST [16405] LOG:  database system is ready to accept connections
> | TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731)
> | 2018-09-18 13:55:50.546 JST [16405] LOG:  checkpointer process (PID 16407) was terminated by signal 6: Aborted
>
> We can preallocating the XLogInsert buffer just to prevent the
> crash. This is done by calling RecoveryInProgress() before
> UpdateFullPageWrites() finds it turned to false. This is an
> isolated problem.
>

Yes, till this point the problem is quite clear and can be reproduced,
however, my question was for the next part for which there doesn't
seem to be a concrete test case.

> But it has another issue that FPW_CHANGE record
> can be missing or wrong FPW state after recovery end.
>
> It comes from the fact that responsibility to update the flag is
> not atomically passed from startup to checkpointer. (The window
> is after UpdateFullPageWrites() call and until setting
> SharedRecoveryInProgress to false.)
>

I understand your concern about missing FPW_CHANGE WAL record during
the above window but is that really a problem because till the
promotion is complete, it is not expected that we write any WAL
record.  Now, the other part of the problem you mentioned is "wrong
FPW state" which can happen because we don't read
Insert->fullPageWrites under lock.  If you are concerned about that
then I think we can solve it with a much less invasive change.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: Alexander Kukushkin
Date: 21 September 2018, 14:43:28
Subject: Re: Connection slots reserved for replication

From: Pavel Stehule
Date: 21 September 2018, 15:01:15
Subject: Re: Proposal for disk quota feature

Re: Problem while setting the fpw with SIGHUP - Mailing list pgsql-hackers

Previous

Next