RE: Stronger safeguard for archive recovery not to miss data - Mailing list pgsql-hackers

From osumi.takamichi@fujitsu.com
Subject RE: Stronger safeguard for archive recovery not to miss data
Date
Msg-id OSBPR01MB488879E8A1807F595AD82B27EDA10@OSBPR01MB4888.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Stronger safeguard for archive recovery not to miss data  (Laurenz Albe <laurenz.albe@cybertec.at>)
Responses Re: Stronger safeguard for archive recovery not to miss data  (Laurenz Albe <laurenz.albe@cybertec.at>)
List pgsql-hackers
Hi


On Wed, Jan 20, 2021 9:03 PM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> 
> On Wed, 2021-01-20 at 13:10 +0900, Fujii Masao wrote:
> > +                                errhint("Run recovery again from a
> > + new base backup taken after setting wal_level higher than
> > + minimal")));
> >
> > Isn't it impossible to do this in normal archive recovery case? In
> > that case, since the server crashed and the database got corrupted,
> > probably we cannot take a new base backup.
> >
> > Originally even when users accidentally set wal_level to minimal, they
> > could start the server from the backup by disabling hot_standby and salvage
> the data.
> > But with the patch, we lost the way to do that. Right? I was wondering
> > that WARNING was used intentionally there for that case.
OK. I understand that this WARNING is necessary to give user a chance
to start up the server again and salvage his/her data,
when hot_standby=off and the server goes through a period of wal_level=minimal
during an archive recovery.


> I would argue that if you set your "wal_level" to minimal, do some work, reset it
> to "replica" and recover past that, two things could happen:
> 
> 1. Since "archive_mode" was off, you are missing some WAL segments and
>    cannot recover past that point anyway.
> 
> 2. You are missing some relevant WAL records, and your recovered
>    database is corrupted.  This is very likely, because you probably
>    switched to "minimal" with the intention to generate less WAL.
> 
> Now I see your point that a corrupted database may be better than no database
> at all, but wouldn't you agree that a warning in the log (that nobody reads) is
> too little information?
> 
> Normally we don't take such a relaxed attitude towards data corruption.
Yeah, we thought we needed stronger protection for that reason.


> Perhaps there could be a GUC "recovery_allow_data_corruption" to override this
> check, but I'd say that a warning is too little.
> 
> >                 if (ControlFile->wal_level < WAL_LEVEL_REPLICA)
> >                         ereport(ERROR,
> >                                         (errmsg("hot standby is not
> possible because wal_level was not set to \"replica\" or higher on the primary
> server"),
> >                                          errhint("Either set wal_level
> > to \"replica\" on the primary, or turn off hot_standby here.")));
> >
> > With the patch, we never reach the above code?
> 
> Right, that would have to go.  I didn't notice that.
Adding a condition to check if "recovery_allow_data_corruption" is 'on' around the end of
CheckRequiredParameterValues() sounds safer for me too, although
implementing a new GUC parameter sounds bigger than what I expected at first.
The default of the value should be 'off' to protect users from getting the corrupted server.
Does everyone agree with this direction ?


Best Regards,
    Takamichi Osumi

pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: PoC/WIP: Extended statistics on expressions
Next
From: Dilip Kumar
Date:
Subject: Re: Is Recovery actually paused?