Re: Hot standby doesn't come up on some situation. - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Hot standby doesn't come up on some situation.
Date
Msg-id 53171DD8.9000006@vmware.com
Whole thread Raw
In response to Re: Hot standby doesn't come up on some situation.  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: Hot standby doesn't come up on some situation.
List pgsql-hackers
On 03/05/2014 10:51 AM, Kyotaro HORIGUCHI wrote:
> Hello,
>
> After all, I have confirmed that this fixes the problem on crash
> recovery of hot-standby botfor 9.3 and HEAD and no problem was
> found except unreadability :(

Ok, committed. Thanks!

Any concrete suggestions about the readability? Is there some particular 
spot that needs clarifying?

> By the way, I moderately want to fix an assertion message to a
> ordinary one. Details are below.
>
> ====
> The server stops with following message during restarting after
> crash requesting archive recovery when the WAL has been produced
> with the wal_level below WAL_LEVEL_HOT_STANDBY.
>
> | TRAP: FailedAssertion("!(((oldestActiveXID) != ((TransactionId) 0)))", File: "xlog.c", Line: 6799)
> | LOG:  startup process (PID 7270) was terminated by signal 6: Aborted
>
> Surely this is the consequence of illegal operation but I think
> it is also not a issue of assertion - which fires on something
> wrong in design or quite rare cases(this case ?).

Ah, I see. Yes, that's definitely a bug. If you don't hit the assertion, 
because the oldestActiveXID is set in the checkpoint record even though 
wal_level is 'archive', or if you simply have assertions disabled, the 
system will start up in hot standby mode even though it's not safe.

> So it might be
> better to show message as below on the case.
>
> | FATAL:  Checkpoint doesn't have valid oldest active transaction id
> | HINT:  Reading WAL might have been written under insufficient wal_level.
>
> This could do in this way,
> ======
> diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
> index e3d5e10..bb6922a 100644
> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -6789,7 +6789,13 @@ StartupXLOG(void)
>               if (wasShutdown)
>                   oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
>               else
> +            {
>                   oldestActiveXID = checkPoint.oldestActiveXid;
> +                if (!TransactionIdIsValid(oldestActiveXID))
> +                    ereport(FATAL,
> +                            (errmsg("Checkpoint doesn't have valid oldest active transaction id"),
> +                             errhint("Reading WAL might have been written under insufficient wal_level.")));
> +            }
>               Assert(TransactionIdIsValid(oldestActiveXID));
>
>               /* Tell procarray about the range of xids it has to deal with */
> =====
>
>
> What do you think about this? Feel free dumping this if you feel
> negative on this.

Hmm. When I test that with 9.2, oldestActiveXID is not 0, even though 
wal_level is 'archive'. So the above patch doesn't fix the whole problem.

The real bug here is that CheckRequiredParameterValues() tests for 
InArchiveRecovery, when it should be testing for 
ArchiveRecoveryRequested. Otherwise, the checks are not performed when 
going through the crash recovery followed by archive recovery. I 
should've changed that as part of the commit that added the crash 
recovery then archive recovery behavior.

Fixed, thanks for pointing it out!

- Heikki



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Review: Patch FORCE_NULL option for copy COPY in CSV mode
Next
From: Michael Paquier
Date:
Subject: Re: Review: Patch FORCE_NULL option for copy COPY in CSV mode