Thread: pg_rewind rewinded too much...

pg_rewind rewinded too much...

From
marin@kset.org
Date:
Hi,

I am doing some testing with pg_rewind and I encountered a very strange
situation. So, let me describe the scenario:

1. Create a master (server1) node (max_wal_size = 2GB,
checkpoint_timeout = 10min)
2. Create a test database on it (server1)
3. Create a test table (server1)
4. Insert a couple of million rows ( >> 5GB) but cancel the query at
some point leaving a few gigs of useless WAL files (server1)
5. Insert a couple of thousand rows ( < 1GB) (server1)
6. Build a slave (server2) and connect to the master (server1)
7. Promote the slave (server2 now master)
8. Insert one row on the old master (server1)
9. Shutdown the old master (server1)
10. Issue a pg_rewind on the old master (server1)
11. Configure recovery.conf and start the new slave (server1)

All of this under 10 minutes :)

The timeline switch is at:
1       1/B2A64F80      no recovery target specified

The pg_controldata on the slave (server1) gives:
Latest checkpoint location:           1/B3000108
Prior checkpoint location:            1/B3000060
Latest checkpoint's REDO location:    1/B30000D0
Latest checkpoint's REDO WAL file:    0000000200000001000000B3

But when starting the slave i get the following errors in the log file:

2016-11-28 15:38:44 UTC: WAL 00000002.history successfully restored
2016-11-28 15:38:43 UTC [6]: [14-1] user=,db= LOG:  restored log file
"00000002.history" from archive
2016-11-28 15:38:44 UTC: Error copying WAL 00000002000000010000009D!
2016-11-28 15:38:44 UTC: Error copying WAL 00000001000000010000009D!
2016-11-28 15:38:44 UTC [494]: [1-1] user=,db= LOG:  started streaming
WAL from primary at 1/9D000000 on timeline 2
2016-11-28 15:38:44 UTC [494]: [2-1] user=,db= FATAL:  could not receive
data from WAL stream: ERROR:  requested WAL segment
00000002000000010000009D has already been removed

So the problem is that the slave (server1) went back in time, to a
location that is not on the new master (first WAL available on the new
master is 0000000100000001000000AE). The old archive on the slave
(server1) is gone (deleted) so there is no way to fix this. And this is
not a problem per se, because you can use pg_basebackup or rsync instead
of pg_rewind. The real problem is that I can't diagnose this problem
until I start the server process and read the logs. Shouldn't
pg_controldata report the first WAL needed to start the instance?

Regards,
Mladen Marinović



Re: pg_rewind rewinded too much...

From
Michael Paquier
Date:
On Tue, Nov 29, 2016 at 09:30:26AM +0100, marin@kset.org wrote:
> I am doing some testing with pg_rewind and I encountered a very strange
> situation. So, let me describe the scenario:
>
> 1. Create a master (server1) node (max_wal_size = 2GB, checkpoint_timeout =
> 10min)
> 2. Create a test database on it (server1)
> 3. Create a test table (server1)
> 4. Insert a couple of million rows ( >> 5GB) but cancel the query at some
> point leaving a few gigs of useless WAL files (server1)
> 5. Insert a couple of thousand rows ( < 1GB) (server1)
> 6. Build a slave (server2) and connect to the master (server1)
> 7. Promote the slave (server2 now master)
> 8. Insert one row on the old master (server1)
> 9. Shutdown the old master (server1)
> 10. Issue a pg_rewind on the old master (server1)
> 11. Configure recovery.conf and start the new slave (server1)

This is a correct use flow with pg_rewind. This will force the old master
to replay from the last checkpoint before WAL forked.

> All of this under 10 minutes :)
>
> The timeline switch is at:
> 1       1/B2A64F80      no recovery target specified
>
> The pg_controldata on the slave (server1) gives:
> Latest checkpoint location:           1/B3000108
> Prior checkpoint location:            1/B3000060
> Latest checkpoint's REDO location:    1/B30000D0
> Latest checkpoint's REDO WAL file:    0000000200000001000000B3

That looks normal to me. Did you enforce a checkpoint on the promoted
slave? When using a live source server, pg_rewind has to look at what is
the current timeline on the source by looking at the control data file
and that is not updated at promotion but at the next checkpoint. You
should be careful about that.

> But when starting the slave i get the following errors in the log file:
>
> 2016-11-28 15:38:44 UTC: WAL 00000002.history successfully restored
> 2016-11-28 15:38:43 UTC [6]: [14-1] user=,db= LOG:  restored log file
> "00000002.history" from archive
> 2016-11-28 15:38:44 UTC: Error copying WAL 00000002000000010000009D!
> 2016-11-28 15:38:44 UTC: Error copying WAL 00000001000000010000009D!

So that's where the last checkpoint before WAL forked is. That's before B3 so
there is nothing weird.

> So the problem is that the slave (server1) went back in time, to a location
> that is not on the new master (first WAL available on the new master is
> 0000000100000001000000AE). The old archive on the slave (server1) is gone
> (deleted) so there is no way to fix this. And this is not a problem per se,
> because you can use pg_basebackup or rsync instead of pg_rewind. The real
> problem is that I can't diagnose this problem until I start the server
> process and read the logs. Shouldn't pg_controldata report the first WAL
> needed to start the instance?

It could, but honestly that's not necessary to complicate what pg_rewind
does. Once a rewind completes, pg_rewind creates a backup_label file that
includes the real data used by the startup process to determine from which
segment replay will start. I think that you had better rely on its content
to check what is the first segment the new slave needs to replay and if it
is still present on the new master.
--
Michael

Attachment