Thread: Missing WAL file after running pg_rewind
Hi
We had a failover situation where our monitoring watchdog processes promoted the slave to become the new master.
I restarted the old master database to ensure a clean stop/start and performed pg_rewind on the old master to resync with the new master. However, after successful rewind, there was an error restarting the new slave.
The steps I took were:
1. Stop all watchdogs
2. Start/stop the old master
3. Run ‘checkpoint’ on new master
4. Run the pg_rewind on old master to resync with new master
5. Start the old master (as new slave)
Step 4 pg_rewind was successful with the new slave rewind to the same new timeline of the new master, however during the restart of the new slave it failed to start with the following errors:
80) FATAL: the database system is starting up
cp: cannot stat ‘/pg_backup/backup/archive_sync/0000000400000383000000BF’: No such file or directory
cp: cannot stat ‘/pg_backup/backup/archive_sync/0000000300000383000000BF’: No such file or directory
cp: cannot stat ‘/pg_backup/backup/archive_sync/0000000200000383000000BF’: No such file or directory
cp: cannot stat ‘/pg_backup/backup/archive_sync/0000000100000383000000BF’: No such file or directory
2018-01-11 23:21:59 ACDT [112235]: [1-1] db=,user= app=,host= LOG: started streaming WAL from primary at
383/BE000000 on timeline 6
2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL: could not receive data from WAL stre
am: ERROR: requested WAL segment 0000000600000383000000BE has already been removed
I checked the both the archive and pg_xlog directories on the new master and cannot locate missing file.
Has anyone experience this before with pg_rewind?
The earliest wall files in the archive directory was around just after the failover occurred.
Eg, in the archive directory on the new Master:
$ ls -l
total 15745032
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000500000383000000C0.partial
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C0
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C1
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C2
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C
And on the pg_xlog directory on the new Master:
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000080
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000081
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000082
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000083
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000084
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000085
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000086
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000087
Thanks
Dylan
On Thu, Jan 11, 2018 at 04:58:02PM +0000, Dylan Luong wrote: > The steps I took were: > > 1. Stop all watchdogs > > 2. Start/stop the old master > > 3. Run 'checkpoint' on new master > > 4. Run the pg_rewind on old master to resync with new master > > 5. Start the old master (as new slave) That's a sane flow to me. > 2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL: could not receive data from WAL stre > am: ERROR: requested WAL segment 0000000600000383000000BE has already been removed > > Has anyone experience this before with pg_rewind? When restarting a standby after a rewind has been done to it, note that, in order to recover to a consistent point, it needs to replay WAL from the previous checkpoint checkpoint where WAL has forked during the promotion up to the point where the rewind has finished. Per your logs, I am getting that the previous checkpoint before the timeline jump is located in segment 0000000X00000383000000BE, but this did not get archived. > The earliest wall files in the archive directory was around just after the failover occurred. > > Eg, in the archive directory on the new Master: > $ ls -l > total 15745032 > -rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000500000383000000C0.partial > -rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C0 > -rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C1 > -rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C2 Yeah, you are looking for the WAL segment just before the last, partial WAL segment of the previous timeline. Depending on your archiving strategy, I guess that you should have set archive_mode = 'always' so as the server which was the standby before the promotion is also able to store them. -- Michael
Attachment
The file exist in the archive directory of the old master but it is for the previous timeline, ie 5 and not 6, ie 0000000500000383000000BE. Can I just rename the file to 6 timeline? Ie 0000000600000383000000BE -----Original Message----- From: Michael Paquier [mailto:michael.paquier@gmail.com] Sent: Friday, 12 January 2018 12:08 PM To: Dylan Luong <Dylan.Luong@unisa.edu.au> Cc: pgsql-general@lists.postgresql.org Subject: Re: Missing WAL file after running pg_rewind On Thu, Jan 11, 2018 at 04:58:02PM +0000, Dylan Luong wrote: > The steps I took were: > > 1. Stop all watchdogs > > 2. Start/stop the old master > > 3. Run 'checkpoint' on new master > > 4. Run the pg_rewind on old master to resync with new master > > 5. Start the old master (as new slave) That's a sane flow to me. > 2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL: > could not receive data from WAL stre > am: ERROR: requested WAL segment 0000000600000383000000BE has already > been removed > > Has anyone experience this before with pg_rewind? When restarting a standby after a rewind has been done to it, note that, in order to recover to a consistent point, it needsto replay WAL from the previous checkpoint checkpoint where WAL has forked during the promotion up to the point wherethe rewind has finished. Per your logs, I am getting that the previous checkpoint before the timeline jump is locatedin segment 0000000X00000383000000BE, but this did not get archived. > The earliest wall files in the archive directory was around just after the failover occurred. > > Eg, in the archive directory on the new Master: > $ ls -l > total 15745032 > -rw-------. 1 postgres postgres 16777216 Jan 11 17:52 > 0000000500000383000000C0.partial -rw-------. 1 postgres postgres > 16777216 Jan 11 17:52 0000000600000383000000C0 -rw-------. 1 postgres > postgres 16777216 Jan 11 17:52 0000000600000383000000C1 -rw-------. 1 > postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C2 Yeah, you are looking for the WAL segment just before the last, partial WAL segment of the previous timeline. Depending onyour archiving strategy, I guess that you should have set archive_mode = 'always' so as the server which was the standbybefore the promotion is also able to store them. -- Michael
On Fri, Jan 12, 2018 at 09:44:25PM +0000, Dylan Luong wrote: > The file exist in the archive directory of the old master but it is > for the previous timeline, ie 5 and not 6, ie > 0000000500000383000000BE. Can I just rename the file to 6 timeline? Ie > 0000000600000383000000BE What are the contents of the history file for this new timeline? You are looking at 00000006.history which should be archived as well. You could do that assuming that WAL has forked on this segment at promotion as both segments would have the same contents up to the point where WAL has forked. -- Michael
Attachment
The content of the history file 00000006.history is: $ more 00000006.history 1 2CE/8A000098 no recovery target specified 2 2CE/FF974EF0 no recovery target specified 3 2CF/51000098 no recovery target specified 4 2D1/C90ACD88 no recovery target specified 5 383/C0790E50 no recovery target specified Here are the last few archive file of the old timeline (old master): 0000000500000383000000BD 0000000500000383000000BE 0000000500000383000000BF 0000000500000383000000C0 0000000500000383000000C1 Here are the first few archive files of the new timeline (new master): 0000000500000383000000C0.partial 0000000600000383000000C0 0000000600000383000000C1 0000000600000383000000C2 0000000600000383000000C3 Looks like it has folked at C0. But why is the new slave asking for 0000000600000383000000BE on timeline during the restore after the pg_rewind? And notfrom C0? Dylan -----Original Message----- From: Michael Paquier [mailto:michael.paquier@gmail.com] Sent: Saturday, 13 January 2018 9:04 PM To: Dylan Luong <Dylan.Luong@unisa.edu.au> Cc: pgsql-general@lists.postgresql.org Subject: Re: Missing WAL file after running pg_rewind On Fri, Jan 12, 2018 at 09:44:25PM +0000, Dylan Luong wrote: > The file exist in the archive directory of the old master but it is > for the previous timeline, ie 5 and not 6, ie > 0000000500000383000000BE. Can I just rename the file to 6 timeline? Ie > 0000000600000383000000BE What are the contents of the history file for this new timeline? You are looking at 00000006.history which should be archivedas well. You could do that assuming that WAL has forked on this segment at promotion as both segments would havethe same contents up to the point where WAL has forked. -- Michael
On Mon, Jan 15, 2018 at 12:13:46AM +0000, Dylan Luong wrote: > The content of the history file 00000006.history is: Please do not top-post. This breaks the logic of the thread. > $ more 00000006.history > 1 2CE/8A000098 no recovery target specified > 2 2CE/FF974EF0 no recovery target specified > 3 2CF/51000098 no recovery target specified > 4 2D1/C90ACD88 no recovery target specified > 5 383/C0790E50 no recovery target specified So indeed the timeline has forked at 0000000500000383000000C0, which is what the .partial file also means. > Here are the last few archive file of the old timeline (old master): > > 0000000500000383000000BD > 0000000500000383000000BE > 0000000500000383000000BF > 0000000500000383000000C0 > 0000000500000383000000C1 > > Here are the first few archive files of the new timeline (new master): > 0000000500000383000000C0.partial > 0000000600000383000000C0 > 0000000600000383000000C1 > 0000000600000383000000C2 > 0000000600000383000000C3 > > Looks like it has folked at C0. > But why is the new slave asking for 0000000600000383000000BE on > timeline during the restore after the pg_rewind? And not from C0? The rewound standby needs to recover from the point of the last checkpoint where timeline forked. So I can guess that this record is located in segment 0000000500000383000000BE. Now I find strange that the standby requests for a segment on the new timeline 5, while it should start at TLI 5. Are you sure that the standby had the means been able to fetch segment 0000000500000383000000BE? Something looks weird from your operational point of view with your archives.. -- Michael