Re: Stronger safeguard for archive recovery not to miss data - Mailing list pgsql-hackers
From | Fujii Masao |
---|---|
Subject | Re: Stronger safeguard for archive recovery not to miss data |
Date | |
Msg-id | d9eaa61f-1854-b259-1957-c9bf94f1ab22@oss.nttdata.com Whole thread Raw |
In response to | Re: Stronger safeguard for archive recovery not to miss data (Kyotaro Horiguchi <horikyota.ntt@gmail.com>) |
Responses |
RE: Stronger safeguard for archive recovery not to miss data
("osumi.takamichi@fujitsu.com" <osumi.takamichi@fujitsu.com>)
|
List | pgsql-hackers |
On 2021/04/05 16:13, Kyotaro Horiguchi wrote: > At Mon, 5 Apr 2021 12:34:53 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in >> >> >> On 2021/04/04 11:58, osumi.takamichi@fujitsu.com wrote: >>>> IMO it's better to comment why this server restart is necessary. >>>> As far as I understand correctly, this is necessary to ensure the WAL >>>> file >>>> containing the record about the change of wal_level (to minimal) is >>>> archived, >>>> so that the subsequent archive recovery will be able to replay it. >>> OK, added some comments. Further, I felt the way I wrote this part was >>> not good at all and self-evident >>> and developers who read this test would feel uneasy about that point. >>> So, a little bit fixed that test so that we can get clearer conviction >>> for wal archive. >> >> LGTM. Thanks for updating the patch! >> >> Attached is the updated version of the patch. I applied the following >> changes. > > + errhint("Use a backup taken after setting wal_level to higher than minimal " > + "or recover to the point in time before wal_level was changed to minimal even though it may causedata loss."))); > > Looking the HINT message, I thought that it's hard to find where up to > I should recover. Yes. And, what's the worse, when archive recovery finds WAL generated with wal_level=minimal and fails, "minimal" is saved in pg_control's wal_level. This means that subsequent archive recovery always fails at the beginning of recovery (before entering WAL replay main loop), in that case. So even if recovery_targrt_lsn is specified, archive recovery fails before checking that. Any recovery target settings have no effect on that case. Maybe we can avoid this, for example, by changing xlog_redo() so that it calls CheckRequiredParameterValues() before UpdateControlFile(). But I'm not sure if this change is safe. Probably we need more time to consider this, but right now there is no so much time left at this stage. At least the HINT message "or recover to the point in time before wal_level was changed to minimal even though it may cause data loss." should be removed because it's not helpful at all... Ok, so if archive recovery finds WAL generated with wal_level=minimal and fails, and also there is no backup taken after wal_level is set to higher than minimal, basically [1] we lose whole database. I think that those who set wal_level to minimal understand that this setting can cause data loss, for example, any data loaded with wal_level=minimal may be lost later. But I'm afraid that they might not understand the risk of whole database loss. Even if they take new backup just after they set wal_level to higher than minimal, there is still the risk of whole database loss until the backup is completed. This makes me think that we should document this risk.... Thought? [1] BTW, one very tricky way to recover from this situation seems to copy all required WAL files from the archive to pg_wal and forcibly run a crash recovery from the backup. Since crash recovery doesn't check wal_level, we can avoid the issue by doing that. But this is very tricky. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
pgsql-hackers by date: