Re: pg_rewind rewinded too much... - Mailing list pgsql-general

From Michael Paquier
Subject Re: pg_rewind rewinded too much...
Date
Msg-id 20161129115959.GA6438@paquier.xyz
Whole thread Raw
In response to pg_rewind rewinded too much...  (marin@kset.org)
List pgsql-general
On Tue, Nov 29, 2016 at 09:30:26AM +0100, marin@kset.org wrote:
> I am doing some testing with pg_rewind and I encountered a very strange
> situation. So, let me describe the scenario:
>
> 1. Create a master (server1) node (max_wal_size = 2GB, checkpoint_timeout =
> 10min)
> 2. Create a test database on it (server1)
> 3. Create a test table (server1)
> 4. Insert a couple of million rows ( >> 5GB) but cancel the query at some
> point leaving a few gigs of useless WAL files (server1)
> 5. Insert a couple of thousand rows ( < 1GB) (server1)
> 6. Build a slave (server2) and connect to the master (server1)
> 7. Promote the slave (server2 now master)
> 8. Insert one row on the old master (server1)
> 9. Shutdown the old master (server1)
> 10. Issue a pg_rewind on the old master (server1)
> 11. Configure recovery.conf and start the new slave (server1)

This is a correct use flow with pg_rewind. This will force the old master
to replay from the last checkpoint before WAL forked.

> All of this under 10 minutes :)
>
> The timeline switch is at:
> 1       1/B2A64F80      no recovery target specified
>
> The pg_controldata on the slave (server1) gives:
> Latest checkpoint location:           1/B3000108
> Prior checkpoint location:            1/B3000060
> Latest checkpoint's REDO location:    1/B30000D0
> Latest checkpoint's REDO WAL file:    0000000200000001000000B3

That looks normal to me. Did you enforce a checkpoint on the promoted
slave? When using a live source server, pg_rewind has to look at what is
the current timeline on the source by looking at the control data file
and that is not updated at promotion but at the next checkpoint. You
should be careful about that.

> But when starting the slave i get the following errors in the log file:
>
> 2016-11-28 15:38:44 UTC: WAL 00000002.history successfully restored
> 2016-11-28 15:38:43 UTC [6]: [14-1] user=,db= LOG:  restored log file
> "00000002.history" from archive
> 2016-11-28 15:38:44 UTC: Error copying WAL 00000002000000010000009D!
> 2016-11-28 15:38:44 UTC: Error copying WAL 00000001000000010000009D!

So that's where the last checkpoint before WAL forked is. That's before B3 so
there is nothing weird.

> So the problem is that the slave (server1) went back in time, to a location
> that is not on the new master (first WAL available on the new master is
> 0000000100000001000000AE). The old archive on the slave (server1) is gone
> (deleted) so there is no way to fix this. And this is not a problem per se,
> because you can use pg_basebackup or rsync instead of pg_rewind. The real
> problem is that I can't diagnose this problem until I start the server
> process and read the logs. Shouldn't pg_controldata report the first WAL
> needed to start the instance?

It could, but honestly that's not necessary to complicate what pg_rewind
does. Once a rewind completes, pg_rewind creates a backup_label file that
includes the real data used by the startup process to determine from which
segment replay will start. I think that you had better rely on its content
to check what is the first segment the new slave needs to replay and if it
is still present on the new master.
--
Michael

Attachment

pgsql-general by date:

Previous
From: Berend Tober
Date:
Subject: Re: Invoice Table Design
Next
From: Stuart Bishop
Date:
Subject: Re: Storing files: 2.3TBytes, 17M file count