Re: patch proposal - Mailing list pgsql-hackers

From Venkata B Nagothi
Subject Re: patch proposal
Date
Msg-id CAEyp7J_2KhY5QzJWAWq-VBBiU6_R+rX8n+-pbYiQxyC6JmAaFQ@mail.gmail.com
Whole thread Raw
In response to Re: patch proposal  (David Steele <david@pgmasters.net>)
Responses Re: patch proposal  (David Steele <david@pgmasters.net>)
Re: patch proposal  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers

On Tue, Aug 16, 2016 at 2:50 AM, David Steele <david@pgmasters.net> wrote:
On 8/15/16 2:33 AM, Venkata B Nagothi wrote:

> During the recovery process, It would be nice if PostgreSQL generates an
> error by aborting the recovery process (instead of starting-up the
> cluster) if the intended recovery target point is not reached and give
> an option to DBA to resume the recovery process from where it exactly
> stopped.

Thom wrote a patch [1] recently that gives warnings in this case.  You
might want to have a look at that first.

That is good to know. Yes, this patch is about generating a more meaningful output messages for recovery process, which makes sense.
 
> The issue here is, if by any chance, the required WALs are not available
> or if there is any WAL missing or corrupted at the restore_command
> location, then PostgreSQL recovers until the end of the last available
> WAL and starts-up the cluster.

You can use pause_at_recovery_target/recovery_target_action (depending
on your version) to prevent promotion.  That would work for your stated
scenario but not for the scenario where replay starts (or the database
reaches consistency) after the recovery target.

The above said parameters can be configured to pause, shutdown or prevent promotion only after reaching the recovery target point. 
To clarify, I am referring to a scenario where recovery target point is not reached at all ( i mean, half-complete or in-complete recovery) and there are lots of WALs still pending to be replayed - in this situation, PostgreSQL just completes the archive recovery until the end of the last available WAL (WAL file "00000001000000000000001E" in my case) and starts-up the cluster by generating an error message (saying "00000001000000000000001F" not found). 

Note: I am testing in PostgreSQL-9.5

LOG:  restored log file "00000001000000000000001E" from archive
cp: cannot stat ‘/data/pgrestore9531/00000001000000000000001F’: No such file or directory
LOG:  redo done at 0/1EFFDBB8
LOG:  last completed transaction was at log time 2016-08-15 11:04:26.795902+10

I have used the following recovery* parameters in the recovery.conf file here and have intentionally not supplied all the WAL archives needed for the recovery process to reach the target xid.

recovery_target_xid = xxxx, 
recovery_target_inclusive = true 
recovery_target_action = pause

It would be nice if PostgreSQL pauses the recovery in-case its not complete (because of missing or corrupt WAL), shutdown the cluster and allows the DBA to restart the replay of the remaining WAL Archive files to continue recovery (from where it stopped previously) until the recovery target point is reached. 

Regards,
Venkata B N

Fujitsu Australia

pgsql-hackers by date:

Previous
From: Rushabh Lathia
Date:
Subject: Re: [parallel query] random server crash while running tpc-h query on power2
Next
From: Gavin Flower
Date:
Subject: Re: Why --backup-and-modify-in-place in perltidy config?