Re: BUG #15346: Replica fails to start after the crash - Mailing list pgsql-hackers

From Alexander Kukushkin
Subject Re: BUG #15346: Replica fails to start after the crash
Date
Msg-id CAFh8B=n-NSo2Ktz_DG3W4FAFr2xHYr7FyRNdOF6g7T2o1-CD4w@mail.gmail.com
Whole thread Raw
Responses Re: BUG #15346: Replica fails to start after the crash
Re: BUG #15346: Replica fails to start after the crash
List pgsql-hackers
Hello hackers!

It seems bgwriter running on the replicas is broken in the commit
8d68ee6 and as a result bgwriter never updates minRecoveryPoint in the
pg_control.Please see a detailed explanation below.

2018-08-29 22:54 GMT+02:00 Michael Paquier <michael@paquier.xyz>:

> This is not a solution in my opinion, as you could invalidate activities
> of backends connected to the database when the incorrect consistent
> point allows connections to come in too early.

That true, but I still think it is better than aborting startup process...

> What happens if you replay with hot_standby = on up to the latest point,
> without any concurrent connections, then issue a checkpoint on the
> standby once you got to a point newer than the complain, and finally
> restart the standby with the bgworker?
>
> Another idea I have would be to make the standby promote, issue a
> checkpoint on it, and then use pg_rewind as a trick to update the
> control file to a point newer than the inconsistency.  As PG < 9.6.10
> could make the minimum recovery point go backwards, applying the upgrade
> after the consistent point got to an incorrect state would trigger the
> failure.

Well, all these actions probably help to relife symptoms and replay
WAL up to the point when it becomes really consistent.

I was really long trying to figure out how it could happen that some
of the pages were written on disk, but pg_control wasn't updated, And
I think I managed to put all pieces of the puzzle into a nice picture:

static void
UpdateMinRecoveryPoint(XLogRecPtr lsn, bool force)
{
        /* Quick check using our local copy of the variable */
        if (!updateMinRecoveryPoint || (!force && lsn <= minRecoveryPoint))
                return;

        /*
         * An invalid minRecoveryPoint means that we need to recover
all the WAL,
         * i.e., we're doing crash recovery.  We never modify the control file's
         * value in that case, so we can short-circuit future checks
here too. The
         * local values of minRecoveryPoint and minRecoveryPointTLI
should not be
         * updated until crash recovery finishes.
         */
        if (XLogRecPtrIsInvalid(minRecoveryPoint))
        {
                updateMinRecoveryPoint = false;
                return;
        }

This code was changed in the commit 8d68ee6 and it broke bgwriter. Now
bgwriter never updates pg_control when flushes dirty pages to disk.
How it happens:
When bgwriter starts,  minRecoveryPoint is not initialized and if I
attach with gdb, it shows that value of minRecoveryPoint = 0,
therefore it is Invalid.
As a result, updateMinRecoveryPoint is set to false and on the next
call of UpdateMinRecoveryPoint from bgwriter it returns from the
function after the very first if.
Bgwriter itself never changes updateMinRecoveryPoint to true and boom,
we can get a lot of pages written to disk, but minRecoveryPoint in the
pg_control won't be updated!

If the replica happened to crash in such conditions it reaches a
consistency much earlier than it should!

Regards,
--
Alexander Kukushkin


pgsql-hackers by date:

Previous
From: Michael Banck
Date:
Subject: Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -roption)
Next
From: Yugo Nagata
Date:
Subject: Re: pg_verify_checksums -d option (was: Re: pg_verify_checksums -roption)