Thread: Fwd: Re: BUG #15589: Due to missing wal, restore ends prematurely and opens database for read/write

Hi
I have reported a bug via PostgreSQL bug report form, but havent got any response so far.
This might not be a bug, but a feature not implemented yet.
I suggest to make a small addition to StartupXLOG to solve the issue.



git diff src/backend/access/transam/xlog.c
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d804f0..d0e5bb3f84 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7277,6 +7277,19 @@ StartupXLOG(void)

                                        case RECOVERY_TARGET_ACTION_PROMOTE:
                                                break;
+                               }
+                       } else if (recoveryTarget == RECOVERY_TARGET_TIME)
+                       {
+                               /*
+                                * Stop point not reached but next WAL could not be read
+                                * Some explanation and warning should be logged
+                               */
+                               switch (recoveryTargetAction)
+                               {
+                                       case RECOVERY_TARGET_ACTION_PAUSE:
+                                       SetRecoveryPause(true);
+                                       recoveryPausesHere();
+                                       break;
                                }
                        }





The scenario I want to solve is:
Need to restore backup to another server.
 Restores pgbasebackup files
 Restores som wal-files
 Extract pgbasebackup files
 creates recover.conf with pit
 Starts postgresql
 recover ends before pit due to missing wal-files
 database opens read/write

I think database should have paused recovery then I could restore
additional wal-files and restart postgresql to continue with recover.

With large databases and a lot of wal-files it is time consuming to repeat parts of the process.

Best regards
Leif Gunnar Erlandsen


At Wed, 30 Jan 2019 15:53:51 +0000, leif@lako.no wrote in <a3bf3b8910cd5adb8a5fbc8113eac0ab@lako.no>
> Hi
> I have reported a bug via PostgreSQL bug report form, but havent got any response so far.
> This might not be a bug, but a feature not implemented yet.
> I suggest to make a small addition to StartupXLOG to solve the issue.

I can understand what you want, but it doesn't seem acceptable
since it introduces inconsistency among target kinds.

> The scenario I want to solve is:
> Need to restore backup to another server.
>  Restores pgbasebackup files
>  Restores som wal-files
>  Extract pgbasebackup files
>  creates recover.conf with pit
>  Starts postgresql
>  recover ends before pit due to missing wal-files
>  database opens read/write
> 
> I think database should have paused recovery then I could restore 
> additional wal-files and restart postgresql to continue with recover.

I don't think no one expected that server follows
recovery_target_action without setting a target, so we can change
the behavior when any kind of target is specified. So I propose
to follow recovery_target_action even if not rached the target
when any recovery target isspecified.

With the attached PoC (for master), recovery stops as follows:

LOG:  consistent recovery state reached at 0/2F000000
LOG:  database system is ready to accept read only connections
rc_work/00000001000000000000002F’: No such file or directory
WARNING:  not reached specfied recovery target, take specified action anyway
DETAIL:  This means a wrong target or missing of expected WAL files.
LOG:  recovery has paused
HINT:  Execute pg_wal_replay_resume() to continue.

If no target is specifed, it promtes immediately ignoring r_t_action.

If this is acceptable I'll post complete version (including
documentation). I don't think this back-patcheable.

> With large databases and a lot of wal-files it is time consuming to repeat parts of the process.

I understand your concern.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d804f0..081bdd86ec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7246,12 +7246,25 @@ StartupXLOG(void)
              * end of main redo apply loop
              */
 
-            if (reachedStopPoint)
+            /*
+             * If recovery target is specified, specified action is expected
+             * to be taken regardless whether the target is reached or not .
+             */
+            if (recoveryTarget != RECOVERY_TARGET_UNSET)
             {
+                /*
+                 * At this point we don't consider the case where we are
+                 * before consistent point even if not reached stop point.
+                 */
                 if (!reachedConsistency)
                     ereport(FATAL,
                             (errmsg("requested recovery stop point is before consistent recovery point")));
 
+                if (!reachedStopPoint)
+                    ereport(WARNING,
+                            (errmsg ("not yet reached specfied recovery target, take specified action anyway"),
+                             errdetail("This means a wrong target or missing WAL files.")));
+
                 /*
                  * This is the last point where we can restart recovery with a
                  * new recovery target, if we shutdown and begin again. After

"Kyotaro HORIGUCHI" <horiguchi.kyotaro@lab.ntt.co.jp> skrev 31. januar 2019 kl. 13:28:

> If this is acceptable I'll post complete version (including
> documentation). I don't think this back-patcheable.
>

If you are asking me, then I think this is exactly what I wanted, thank you for your effort.


>> With large databases and a lot of wal-files it is time consuming to repeat parts of the process.
>
> I understand your concern.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center


regards
Leif Gunnar Erlandsen


On Thu, Jan 31, 2019 at 09:26:48PM +0900, Kyotaro HORIGUCHI wrote:
> I don't think no one expected that server follows
> recovery_target_action without setting a target, so we can change
> the behavior when any kind of target is specified. So I propose
> to follow recovery_target_action even if not rached the target
> when any recovery target isspecified.

Quoting the docs:
https://www.postgresql.org/docs/current/recovery-target-settings.html
recovery_target_action (enum)
"Specifies what action the server should take once the recovery target
is *reached*."

So what we have now is that an action would be taken iff a stop point
is defined and reached.  What this patch changes is that the action
would be taken even if the stop point has *not* been reached once the
end of a WAL stream is found.

+       * to be taken regardless whether the target is reached or not .
Nit 1: Dot at the end has an extra space.

Nit 2: s/specfied/specified/

Please do not take me wrong, I can see that there could be use cases
where it is possible to take an action at the end of a WAL stream if
there is less WAL than what was planned, perhaps if the OP has set
an incorrect stop position too far in the future, still too much WAL
would have been replayed so it would make the base backup unusable for
future uses.  Also, it looks incorrect to me to change an existing
behavior and to use the same semantics for triggering an action if a
stop point is defined and reached.
--
Michael

Attachment
"Michael Paquier" <michael@paquier.xyz> skrev 26. februar 2019 kl. 09:13:

> On Thu, Jan 31, 2019 at 09:26:48PM +0900, Kyotaro HORIGUCHI wrote:
>
>> I don't think no one expected that server follows
>> recovery_target_action without setting a target, so we can change
>> the behavior when any kind of target is specified. So I propose
>> to follow recovery_target_action even if not rached the target
>> when any recovery target isspecified.
>
> Quoting the docs:
> https://www.postgresql.org/docs/current/recovery-target-settings.html
> recovery_target_action (enum)
> "Specifies what action the server should take once the recovery target
> is *reached*."

I know this and recovery_target_action in my case was "pause".
Recovery target was specified with a date and time.

> So what we have now is that an action would be taken iff a stop point
> is defined and reached. What this patch changes is that the action
> would be taken even if the stop point has *not* been reached once the
> end of a WAL stream is found.

Yes, and this is expected behaviour in my use case. This was a PITR scenario, to a new server, and not crash recovery.
I restored a backup and placed WAL-files in a separate directory, then I created a recovery.conf with correct
recovery_target_time.
After PostgreSQL started it stopped after a short while and opened the database in read/write.
Checks showed target was not reached. Log showed that no more WAL could be found.
If PostgreSQL had followed recovery_target_action, then I could have restored the missing WAL-files and continued
replayof WAL. 
As this was not the case I had to restart the process from the beginning, this took many hours.
Another thing to consider is that in instances such as this one, where a lot of WAL was needed for replay, it is not
alwaysgiven that we have the sufficient amount of available disk space in order to store them all at the same time. 


> Please do not take me wrong, I can see that there could be use cases
> where it is possible to take an action at the end of a WAL stream if
> there is less WAL than what was planned, perhaps if the OP has set
> an incorrect stop position too far in the future, still too much WAL
> would have been replayed so it would make the base backup unusable for
> future uses. Also, it looks incorrect to me to change an existing
> behavior and to use the same semantics for triggering an action if a
> stop point is defined and reached.

I did not set an incorrect stop position. I see this change as something most in a similar situation would expect from
theirdatabase system. 

AFAIK the doc does not specify what happens if recovery_target_time is specified but not reached. But as default
recovery_target_actionis set to "pause" I would have assumed "pause" to be the action. 

regards
Leif Gunnar Erlandsen