Thread: BUG #15589: Due to missing wal,restore ends prematurely and opens database for read/write

BUG #15589: Due to missing wal,restore ends prematurely and opens database for read/write

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      15589
Logged by:          Leif Gunnar Erlandsen
Email address:      leif@lako.no
PostgreSQL version: 11.1
Operating system:   Red Hat Enterprise Linux Server release 7.6
Description:

recovery.conf was specified as
restore_command = 'cp /db/disk1/restore/archivelogs/archivelogs/%f %p'
recovery_target_time = '2019-01-03 13:00:00'

Due to some missing wal-files restore ended.

2019-01-10 12:05:50 CET [68417]: [67-1] user=,db=,app=,client= LOG:
restored log file "0000000500000158000000FF" from archive
cp: cannot stat
'/db/disk1/restore/archivelogs/archivelogs/000000050000015900000000': No
such file or directory
2019-01-10 12:05:50 CET [68417]: [68-1] user=,db=,app=,client= LOG:  redo
done at 158/FFFFFFB8
2019-01-10 12:05:50 CET [68417]: [69-1] user=,db=,app=,client= LOG:  last
completed transaction was at log time 2019-01-03 06:34:45.935752+01
2019-01-10 12:05:50 CET [68417]: [70-1] user=,db=,app=,client= LOG:
restored log file "0000000500000158000000FF" from archive
cp: cannot stat
'/db/disk1/restore/archivelogs/archivelogs/00000006.history': No such file
or directory
2019-01-10 12:05:50 CET [68417]: [71-1] user=,db=,app=,client= LOG:
selected new timeline ID: 6
2019-01-10 12:05:50 CET [68417]: [72-1] user=,db=,app=,client= LOG:  archive
recovery complete
cp: cannot stat
'/db/disk1/restore/archivelogs/archivelogs/00000005.history': No such file
or directory
2019-01-10 12:05:51 CET [68420]: [2-1] user=,db=,app=,client= LOG:
restartpoint complete: wrote 44395 buffers (5.4%); 1 WAL file(s) added, 0
removed, 0 recycled; write=6.310 s, sync=0.268 s, total=6.631 s; sync
files=178, longest=0.072 s, average=0.001 s; distance=64019 kB,
estimate=64019 kB
2019-01-10 12:05:51 CET [68420]: [3-1] user=,db=,app=,client= LOG:  recovery
restart point at 158/C4E84F98
2019-01-10 12:05:51 CET [68420]: [4-1] user=,db=,app=,client= DETAIL:  Last
completed transaction was at log time 2019-01-03 06:34:45.935752+01.
2019-01-10 12:05:51 CET [68420]: [5-1] user=,db=,app=,client= LOG:
checkpoints are occurring too frequently (7 seconds apart)
2019-01-10 12:05:51 CET [68420]: [6-1] user=,db=,app=,client= HINT:
Consider increasing the configuration parameter "max_wal_size".
2019-01-10 12:05:51 CET [68420]: [7-1] user=,db=,app=,client= LOG:
checkpoint starting: end-of-recovery immediate wait xlog
2019-01-10 12:05:51 CET [68420]: [8-1] user=,db=,app=,client= LOG:
checkpoint complete: wrote 18678 buffers (2.3%); 0 WAL file(s) added, 0
removed, 0 recycled; write=0.251 s, sync=0.006 s, total=0.312 s; sync
files=149, longest=0.006 s, average=0.000 s; distance=968172 kB,
estimate=968172 kB
2019-01-10 12:05:51 CET [68415]: [8-1] user=,db=,app=,client= LOG:  database
system is ready to accept connections


I found the missing wal-files and performed restore again from the start.
Then recovery paused when it was at correct time.

2019-01-10 13:46:28 CET [77004]: [3334-1] user=,db=,app=,client= LOG:
restored log file "0000000500000165000000C2" from archive
2019-01-10 13:46:28 CET [77007]: [318-1] user=,db=,app=,client= LOG:
restartpoint complete: wrote 87591 buffers (10.7%); 0 WAL file(s) added, 22
removed, 20 recycled; write=3.049 s, sync=0.001 s, total=3.192 s; sync
files=143, longest=0.001 s, average=0.000 s; distance=688531 kB,
estimate=689818 kB
2019-01-10 13:46:28 CET [77007]: [319-1] user=,db=,app=,client= LOG:
recovery restart point at 165/9706C358
2019-01-10 13:46:28 CET [77007]: [320-1] user=,db=,app=,client= DETAIL:
Last completed transaction was at log time 2019-01-03 12:13:22.014815+01.
2019-01-10 13:46:28 CET [77007]: [321-1] user=,db=,app=,client= LOG:
restartpoint starting: xlog
2019-01-10 13:46:29 CET [77004]: [3335-1] user=,db=,app=,client= LOG:
restored log file "0000000500000165000000C3" from archive
2019-01-10 13:46:29 CET [77004]: [3336-1] user=,db=,app=,client= LOG:
recovery stopping before commit of transaction 3316604, time 2019-01-03
13:00:01.563953+01
2019-01-10 13:46:29 CET [77004]: [3337-1] user=,db=,app=,client= LOG:
recovery has paused
2019-01-10 13:46:29 CET [77004]: [3338-1] user=,db=,app=,client= HINT:
Execute pg_wal_replay_resume() to continue.


PostgreSQL should have paused recovery also on the first scenario. Then I
could have added missing wal and continued with restore.


On Fri, Jan 11, 2019 at 4:08 AM PG Bug reporting form <noreply@postgresql.org> wrote:

PostgreSQL should have paused recovery also on the first scenario. Then I
could have added missing wal and continued with restore.

I agree with you that something here is not very user friendly.  But the counter argument is that you should not be hiding WAL files from the system in the first place.  Once it is told that the file doesn't exist, it doesn't make sense to pause because non-existence is usually permanent, so there is nothing to be done after a pause.  Or in other words, the pause feature is to let you change your mind about what time point you want to recover to (moving it only forward), not to let you change your mind about what WAL files exist in the first place.  So I don't think this is a bug, but perhaps there is room for improvement.

At the least, I think we should log an explicit WARNING if the WAL stream ends before the specified PIT is reached.

Cheers,

Jeff
"Jeff Janes" <jeff.janes@gmail.com> skrev 11. januar 2019 kl. 15:03:
On Fri, Jan 11, 2019 at 4:08 AM PG Bug reporting form <noreply@postgresql.org> wrote:

PostgreSQL should have paused recovery also on the first scenario. Then I
could have added missing wal and continued with restore.
I agree with you that something here is not very user friendly. But the counter argument is that you should not be hiding WAL files from the system in the first place. Once it is told that the file doesn't exist, it doesn't make sense to pause because non-existence is usually permanent, so there is nothing to be done after a pause. Or in other words, the pause feature is to let you change your mind about what time point you want to recover to (moving it only forward), not to let you change your mind about what WAL files exist in the first place. So I don't think this is a bug, but perhaps there is room for improvement.
At the least, I think we should log an explicit WARNING if the WAL stream ends before the specified PIT is reached.

No one hides WAL files from the system deliberately. A few days of WAL files could take up a lot of space. And they could come from different backup sets.
If you have a gap in your restored WAL-files and have specified date and time further ahead, an warning and a pause should be issued.
Without the pause in recover the warning is of little use as the database is already opened for read and write.

--
Leif





"Jeff Janes" <jeff.janes@gmail.com> skrev 11. januar 2019 kl. 15:03:
On Fri, Jan 11, 2019 at 4:08 AM PG Bug reporting form <noreply@postgresql.org> wrote:

PostgreSQL should have paused recovery also on the first scenario. Then I
could have added missing wal and continued with restore.
At the least, I think we should log an explicit WARNING if the WAL stream ends before the specified PIT is reached.
The documentation for recovery.conf states that with recovery_target_time set recovery_target_action defaults to pause.
Even if stop point is not reached, pause should be activated.
A
fter checking the source this might be solved with a small addition in StartupXLOG.c
Someone with more experience with the source code should check this out.
if (reachedStopPoint)
{
if (!reachedConsistency)
ereport(FATAL,
(errmsg("requested recovery stop point is before consistent recovery point")));
/*
* This is the last point where we can restart recovery with a
* new recovery target, if we shutdown and begin again. After
* this, Resource Managers may choose to do permanent
* corrective actions at end of recovery.
*/
switch (recoveryTargetAction)
{
case RECOVERY_TARGET_ACTION_SHUTDOWN:
/*
* exit with special return code to request shutdown
* of postmaster. Log messages issued from
* postmaster.
*/
proc_exit(3);
case RECOVERY_TARGET_ACTION_PAUSE:
SetRecoveryPause(true);
recoveryPausesHere();
/* drop into promote */
case RECOVERY_TARGET_ACTION_PROMOTE:
break;
}
/* Add these lines .... */
} else if (recoveryTarget == RECOVERY_TARGET_TIME)
{
/*
* Stop point not reached but next WAL could not be read
* Some explanation and warning should be logged
*/
switch (recoveryTargetAction)
{
case RECOVERY_TARGET_ACTION_PAUSE:
SetRecoveryPause(true);
recoveryPausesHere();
break;
}
}
/* .... until here */

--

Leif