Re: Problem while setting the fpw with SIGHUP - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Problem while setting the fpw with SIGHUP |
Date | |
Msg-id | CAA4eK1K7Pg6HDMCHZ61XdppSt4nt5HJ0qv-uaTMMAZsTQo4fLg@mail.gmail.com Whole thread Raw |
In response to | Re: Problem while setting the fpw with SIGHUP (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
List | pgsql-hackers |
On Tue, Sep 18, 2018 at 12:46 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > > At Fri, 14 Sep 2018 16:30:37 +0530, Amit Kapila <amit.kapila16@gmail.com> wrote in <CAA4eK1+Xfx5jD2CHmLPNqXeOzqRLKG9TNr8wfs3-cPf9Ln9RVg@mail.gmail.com> > > On Fri, Sep 14, 2018 at 12:57 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Thu, Sep 06, 2018 at 04:37:28PM -0700, Michael Paquier wrote: > > > > /* > > > > * Properly accept or ignore signals the postmaster might send us. > > > > */ > > > > - pqsignal(SIGHUP, StartupProcSigHupHandler); /* reload config file */ > > > > + pqsignal(SIGHUP, SIG_IGN); /* ignore reload config */ > > > > > > > > I am finally coming back to this patch set, and that's one of the first > > > > things I am going to help moving on for this CF. And this bit from the > > > > last patch series is not acceptable as there are some parameters which > > > > are used by the startup process which can be reloaded. One of them is > > > > wal_retrieve_retry_interval for tuning when fetching WAL at recovery. > > > > > > So, I have been working on this problem again and I have reviewed the > > > thread, and there have been many things discussed in the last couple of > > > months: > > > 1) We do not want to initialize XLogInsert stuff unconditionally for all > > > processes at the moment recovery begins, but we just want to initialize > > > it once WAL write is open for business. > > > 2) Both the checkpointer and the startup process can call > > > UpdateFullPageWrites() which can cause Insert->fullPageWrites to get > > > incorrect values. > > > > Can you share the steps to reproduce this problem? > > The window for the issue is small. > > It happens if checkpointer first looks SharedRecoveryInProgress > is turned off at the beginning of the CheckPointerMain loop. > The window is needed be widen to make sure the issue happens. > > Applying the first patch attched, the following steps will cause > the issue with high reliability. > > 1. initdb (full_page_writes = on by default) > > 2. put recovery.conf so that server can accept promote signal > > 3. touch /tmp/hoge > change full_page_write to off in postgresql.conf > > 4. pg_ctl_promote > > The attached second file is a script do the above steps. > Server writes the following log message and die. > > | 2018-09-18 13:55:49.928 JST [16405] LOG: database system is ready to accept connections > | TRAP: FailedAssertion("!(CritSectionCount == 0)", File: "mcxt.c", Line: 731) > | 2018-09-18 13:55:50.546 JST [16405] LOG: checkpointer process (PID 16407) was terminated by signal 6: Aborted > > We can preallocating the XLogInsert buffer just to prevent the > crash. This is done by calling RecoveryInProgress() before > UpdateFullPageWrites() finds it turned to false. This is an > isolated problem. > Yes, till this point the problem is quite clear and can be reproduced, however, my question was for the next part for which there doesn't seem to be a concrete test case. > But it has another issue that FPW_CHANGE record > can be missing or wrong FPW state after recovery end. > > It comes from the fact that responsibility to update the flag is > not atomically passed from startup to checkpointer. (The window > is after UpdateFullPageWrites() call and until setting > SharedRecoveryInProgress to false.) > I understand your concern about missing FPW_CHANGE WAL record during the above window but is that really a problem because till the promotion is complete, it is not expected that we write any WAL record. Now, the other part of the problem you mentioned is "wrong FPW state" which can happen because we don't read Insert->fullPageWrites under lock. If you are concerned about that then I think we can solve it with a much less invasive change. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: