Re: [HACKERS] Broken hint bits (freeze) - Mailing list pgsql-hackers

From Vladimir Borodin
Subject Re: [HACKERS] Broken hint bits (freeze)
Date
Msg-id 9FD5B557-25D5-4199-92CE-C278AEBF97BB@simply.name
Whole thread Raw
In response to Re: [HACKERS] Broken hint bits (freeze)  (Michael Paquier <michael.paquier@gmail.com>)
List pgsql-hackers

26 мая 2017 г., в 23:04, Michael Paquier <michael.paquier@gmail.com> написал(а):

On Fri, May 26, 2017 at 2:39 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
Yeah, I think this is quite suspicious.  This seems to indicate that
not all WAL records are replicated before the switchover.  What is the
value of "synchronous_commit" you are using?  I think you somehow need
to ensure before switchover that all the WAL is replicated to ensure
this is not a setup problem.

It is so easy to corrupt a server because of an incorrect base backup
flow or an incorrect switchover that it would be good first to
understand how you are doing your switchover. Any corruption happening
after a promotion, a failover or a switchover may be the top of the
iceberg of what's on the data pages, and you may just see one symptom
among other problems. Particularly, did you kill the master in any
violent way after promoting the standby? Has the former master been
able to perform at least once a clean shutdown checkpoint and has it
been rewound?

At first we cleanly stop the former master and then we extra check that all data has been received by the standby to be promoted. Old master is returned by simply generating recovery.conf and starting it, all other standbys are done in the same way. WAL history on all hosts and in archive remains linear, no pg_rewind is needed. This procedure has been well tested on 9.3 (when the ability to do such switchover without restoring standbys from backup appeared), automated and has not changed for all these years.

Actually we have already found that LSNs of all corrupted tuples are somewhere near the time of upgrade from 9.5 to 9.6. There is still no evidence but it seems that it is mostly related to upgrade procedure. We now extract backups of 9.5 and 9.6 for a database where we now have corrupted pages to check this version. But it is still not obvious if it could be a pg_upgrade bug or our mistake in a way we did upgrade.

--
Michael


--
May the force be with you…

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [HACKERS] Broken hint bits (freeze)
Next
From: Peter Eisentraut
Date:
Subject: Re: [HACKERS] logical replication and PANIC during shutdowncheckpoint in publisher