Thread: BUG #15589: Due to missing wal,restore ends prematurely and opens database for read/write
BUG #15589: Due to missing wal,restore ends prematurely and opens database for read/write
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15589 Logged by: Leif Gunnar Erlandsen Email address: leif@lako.no PostgreSQL version: 11.1 Operating system: Red Hat Enterprise Linux Server release 7.6 Description: recovery.conf was specified as restore_command = 'cp /db/disk1/restore/archivelogs/archivelogs/%f %p' recovery_target_time = '2019-01-03 13:00:00' Due to some missing wal-files restore ended. 2019-01-10 12:05:50 CET [68417]: [67-1] user=,db=,app=,client= LOG: restored log file "0000000500000158000000FF" from archive cp: cannot stat '/db/disk1/restore/archivelogs/archivelogs/000000050000015900000000': No such file or directory 2019-01-10 12:05:50 CET [68417]: [68-1] user=,db=,app=,client= LOG: redo done at 158/FFFFFFB8 2019-01-10 12:05:50 CET [68417]: [69-1] user=,db=,app=,client= LOG: last completed transaction was at log time 2019-01-03 06:34:45.935752+01 2019-01-10 12:05:50 CET [68417]: [70-1] user=,db=,app=,client= LOG: restored log file "0000000500000158000000FF" from archive cp: cannot stat '/db/disk1/restore/archivelogs/archivelogs/00000006.history': No such file or directory 2019-01-10 12:05:50 CET [68417]: [71-1] user=,db=,app=,client= LOG: selected new timeline ID: 6 2019-01-10 12:05:50 CET [68417]: [72-1] user=,db=,app=,client= LOG: archive recovery complete cp: cannot stat '/db/disk1/restore/archivelogs/archivelogs/00000005.history': No such file or directory 2019-01-10 12:05:51 CET [68420]: [2-1] user=,db=,app=,client= LOG: restartpoint complete: wrote 44395 buffers (5.4%); 1 WAL file(s) added, 0 removed, 0 recycled; write=6.310 s, sync=0.268 s, total=6.631 s; sync files=178, longest=0.072 s, average=0.001 s; distance=64019 kB, estimate=64019 kB 2019-01-10 12:05:51 CET [68420]: [3-1] user=,db=,app=,client= LOG: recovery restart point at 158/C4E84F98 2019-01-10 12:05:51 CET [68420]: [4-1] user=,db=,app=,client= DETAIL: Last completed transaction was at log time 2019-01-03 06:34:45.935752+01. 2019-01-10 12:05:51 CET [68420]: [5-1] user=,db=,app=,client= LOG: checkpoints are occurring too frequently (7 seconds apart) 2019-01-10 12:05:51 CET [68420]: [6-1] user=,db=,app=,client= HINT: Consider increasing the configuration parameter "max_wal_size". 2019-01-10 12:05:51 CET [68420]: [7-1] user=,db=,app=,client= LOG: checkpoint starting: end-of-recovery immediate wait xlog 2019-01-10 12:05:51 CET [68420]: [8-1] user=,db=,app=,client= LOG: checkpoint complete: wrote 18678 buffers (2.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.251 s, sync=0.006 s, total=0.312 s; sync files=149, longest=0.006 s, average=0.000 s; distance=968172 kB, estimate=968172 kB 2019-01-10 12:05:51 CET [68415]: [8-1] user=,db=,app=,client= LOG: database system is ready to accept connections I found the missing wal-files and performed restore again from the start. Then recovery paused when it was at correct time. 2019-01-10 13:46:28 CET [77004]: [3334-1] user=,db=,app=,client= LOG: restored log file "0000000500000165000000C2" from archive 2019-01-10 13:46:28 CET [77007]: [318-1] user=,db=,app=,client= LOG: restartpoint complete: wrote 87591 buffers (10.7%); 0 WAL file(s) added, 22 removed, 20 recycled; write=3.049 s, sync=0.001 s, total=3.192 s; sync files=143, longest=0.001 s, average=0.000 s; distance=688531 kB, estimate=689818 kB 2019-01-10 13:46:28 CET [77007]: [319-1] user=,db=,app=,client= LOG: recovery restart point at 165/9706C358 2019-01-10 13:46:28 CET [77007]: [320-1] user=,db=,app=,client= DETAIL: Last completed transaction was at log time 2019-01-03 12:13:22.014815+01. 2019-01-10 13:46:28 CET [77007]: [321-1] user=,db=,app=,client= LOG: restartpoint starting: xlog 2019-01-10 13:46:29 CET [77004]: [3335-1] user=,db=,app=,client= LOG: restored log file "0000000500000165000000C3" from archive 2019-01-10 13:46:29 CET [77004]: [3336-1] user=,db=,app=,client= LOG: recovery stopping before commit of transaction 3316604, time 2019-01-03 13:00:01.563953+01 2019-01-10 13:46:29 CET [77004]: [3337-1] user=,db=,app=,client= LOG: recovery has paused 2019-01-10 13:46:29 CET [77004]: [3338-1] user=,db=,app=,client= HINT: Execute pg_wal_replay_resume() to continue. PostgreSQL should have paused recovery also on the first scenario. Then I could have added missing wal and continued with restore.
Re: BUG #15589: Due to missing wal, restore ends prematurely andopens database for read/write
From
Jeff Janes
Date:
On Fri, Jan 11, 2019 at 4:08 AM PG Bug reporting form <noreply@postgresql.org> wrote:
PostgreSQL should have paused recovery also on the first scenario. Then I
could have added missing wal and continued with restore.
I agree with you that something here is not very user friendly. But the counter argument is that you should not be hiding WAL files from the system in the first place. Once it is told that the file doesn't exist, it doesn't make sense to pause because non-existence is usually permanent, so there is nothing to be done after a pause. Or in other words, the pause feature is to let you change your mind about what time point you want to recover to (moving it only forward), not to let you change your mind about what WAL files exist in the first place. So I don't think this is a bug, but perhaps there is room for improvement.
At the least, I think we should log an explicit WARNING if the WAL stream ends before the specified PIT is reached.
Cheers,
Jeff
Re: BUG #15589: Due to missing wal, restore ends prematurely andopens database for read/write
From
leif@lako.no
Date:
"Jeff Janes" <jeff.janes@gmail.com> skrev 11. januar 2019 kl. 15:03:
No one hides WAL files from the system deliberately. A few days of WAL files could take up a lot of space. And they could come from different backup sets.
If you have a gap in your restored WAL-files and have specified date and time further ahead, an warning and a pause should be issued.
Without the pause in recover the warning is of little use as the database is already opened for read and write.
--
Leif
On Fri, Jan 11, 2019 at 4:08 AM PG Bug reporting form <noreply@postgresql.org> wrote:
PostgreSQL should have paused recovery also on the first scenario. Then I
could have added missing wal and continued with restore.I agree with you that something here is not very user friendly. But the counter argument is that you should not be hiding WAL files from the system in the first place. Once it is told that the file doesn't exist, it doesn't make sense to pause because non-existence is usually permanent, so there is nothing to be done after a pause. Or in other words, the pause feature is to let you change your mind about what time point you want to recover to (moving it only forward), not to let you change your mind about what WAL files exist in the first place. So I don't think this is a bug, but perhaps there is room for improvement.At the least, I think we should log an explicit WARNING if the WAL stream ends before the specified PIT is reached.
No one hides WAL files from the system deliberately. A few days of WAL files could take up a lot of space. And they could come from different backup sets.
If you have a gap in your restored WAL-files and have specified date and time further ahead, an warning and a pause should be issued.
Without the pause in recover the warning is of little use as the database is already opened for read and write.
--
Leif
Re: BUG #15589: Due to missing wal, restore ends prematurely andopens database for read/write
From
leif@lako.no
Date:
"Jeff Janes" <jeff.janes@gmail.com> skrev 11. januar 2019 kl. 15:03:
The documentation for recovery.conf states that with recovery_target_time setOn Fri, Jan 11, 2019 at 4:08 AM PG Bug reporting form <noreply@postgresql.org> wrote:
PostgreSQL should have paused recovery also on the first scenario. Then I
could have added missing wal and continued with restore.At the least, I think we should log an explicit WARNING if the WAL stream ends before the specified PIT is reached.
recovery_target_action
defaults to pause.Even if stop point is not reached, pause should be activated.
After checking the source this might be solved with a small addition in StartupXLOG.c
Someone with more experience with the source code should check this out.
if (reachedStopPoint)
{
if (!reachedConsistency)
ereport(FATAL,
(errmsg("requested recovery stop point is before consistent recovery point")));
/*
* This is the last point where we can restart recovery with a
* new recovery target, if we shutdown and begin again. After
* this, Resource Managers may choose to do permanent
* corrective actions at end of recovery.
*/
switch (recoveryTargetAction)
{
case RECOVERY_TARGET_ACTION_SHUTDOWN:
/*
* exit with special return code to request shutdown
* of postmaster. Log messages issued from
* postmaster.
*/
proc_exit(3);
case RECOVERY_TARGET_ACTION_PAUSE:
SetRecoveryPause(true);
recoveryPausesHere();
/* drop into promote */
case RECOVERY_TARGET_ACTION_PROMOTE:
break;
}
/* Add these lines .... */
} else if (recoveryTarget == RECOVERY_TARGET_TIME)
{
/*
* Stop point not reached but next WAL could not be read
* Some explanation and warning should be logged
*/
switch (recoveryTargetAction)
{
case RECOVERY_TARGET_ACTION_PAUSE:
SetRecoveryPause(true);
recoveryPausesHere();
break;
}
}
/* .... until here */
--
Leif