Hi Michael,
> That's annoying. Because that means that the control file of your
> server maps to a consistent point which is older than some of the
> relation pages. How was the base backup of this node created? Please
> remember that when taking a base backup from a standby, you should
> backup the control file last, as there is no control of end backup with
> records available. So it seems to me that the origin of your problem
> comes from an incorrect base backup expectation?
We are running the cluster of 3 nodes (m4.large + EBS volume for
PGDATA) on AWS. Replicas were initialized about a years ago with
pg_basebackup and working absolutely fine. In the past year I did a
few minor upgrades with switchover (first upgrade of the replicas,
switchover, and upgrade the former primary). The last switchover was
done on the August 19th. This instance was working as a replica for
about three days until the sudden crash of EC2 instance. On the new
instance we attached existing EBS volume with existing the PGDATA and
tried to start postgres. Consequences you can see in the very first
email.
> One idea I have would be to copy all the WAL segments up to the point
> where the pages to-be-updated are, and let Postgres replay all the local
> WALs first. However it is hard to say if that would be enough, as you
> could have more references to pages even newer than the btree one you
> just found.
Well, I did some experiments, among them was the approach you suggest,
i.e. I commented out restore_command in the recovery.conf and copied
quite a few WAL segments to the pg_xlog. Results are the same. It
aborts as long as there are connections open :(
Regards,
--
Alexander Kukushkin