On 2/26/19 6:51 AM, Michael Paquier wrote:
> On Mon, Feb 25, 2019 at 08:17:27PM +0200, David Steele wrote:
>> Here's the really obvious bad thing: if users do not update their procedures
>> and we ignore backup_label.pending on startup then they will end up with a
>> corrupt database because it will not replay from the correct checkpoint. If
>> we error on the presence of backup_label.pending then we are right back to
>> where we started.
>
> Not really. If we error on backup_label.pending, we can make the
> difference between a backend which has crashed in the middle of an
> exclusive backup without replaying anything and a backend which is
> started based on a base backup, so an operator can take some action to
> see what's wrong with the server. If you issue an error, users can
> also see that their custom backup script is wrong because they forgot
> to rename the flag after taking a backup of the data folder(s).
The operator still has a decision to make, manually, just as they do
now. The wrong decision may mean a corrupt database.
Here's the scenario:
1) They do a restore, forget to rename backup_label.pending.
2) Postgres won't start, which is the same action we take now.
3) The user is not sure what to do, rename or delete? They delete, and
the cluster is corrupted.
Worse, they have scripted the deletion of backup_label so that the
cluster will restart on crash. This is the recommendation from our
documentation after all. If that script runs after a restore instead of
a crash, then the cluster will be corrupt -- silently.
--
-David
david@pgmasters.net