Re: Requiring recovery.signal or standby.signal when recovering with a backup_label - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Requiring recovery.signal or standby.signal when recovering with a backup_label |
Date | |
Msg-id | CA+TgmoaWtRYdtQOwp=DeZW0-OJ_UF9juqaFXEmEbq8QPF1JNvg@mail.gmail.com Whole thread Raw |
In response to | Re: Requiring recovery.signal or standby.signal when recovering with a backup_label (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
|
List | pgsql-hackers |
On Tue, Oct 31, 2023 at 7:39 PM Michael Paquier <michael@paquier.xyz> wrote: > Point 7. of what you quote says to use one? True that this needs a > refresh, and perhaps a bit fat warning about the fact that these are > required if you want to fetch WAL from other sources than the local > pg_wal/. Perhaps there may be a point of revisiting the default > behavior of recovery_target_timeline in this case, I don't know. I don't really know what to say to this -- sure, point 7 of "Recovering Using a Continuous Archive Backup" says to use recovery.signal. But as I said in the preceding paragraph, it doesn't say either "use recovery.signal or standby.signal". Nor does it or anything else in the documentation explain under what circumstances you're allowed to have neither. So the whole thing is very unclear. > >> As you're telling me, and I've considered that as an option as well, > >> perhaps we should just consider the presence of a backup_label file > >> with no .signal files as a synonym of crash recovery? In the recovery > >> path, currently the essence of the problem is when we do > >> InArchiveRecovery=true, but ArchiveRecoveryRequested=false, meaning > >> that it should do archive recovery but we don't want it, and that does > >> not really make sense. The rest of the code sort of implies that this > >> is not a suported combination. So basically, my suggestion here, is > >> to just replay WAL up to the end of what's in your local pg_wal/ and > >> hope for the best, without TLI jumps, except that we'd do nothing. > > > > This sentence seems to be incomplete. > > I've re-read it, and it looks OK to me. Well, the sentence ends with "except that we'd do nothing" and I don't know what that means. It would make sense to me if it said "except that we'd do nothing about <whatever>" or "except that we'd do nothing instead of <something>" but as you've written it basically seems to boil down to "my suggestion is to replay WAL except do nothing" which makes no sense. If you replay WAL, you're not doing nothing. > > But I was not saying we should treat the case where we have a > > backup_label file like crash recovery. The real question here is why > > we don't treat it fully like archive recovery. > > Timeline jump at the end of recovery? Archive recovery forces a TLI > jump by default at the end of redo if there's a signal file, and some > users may not want a TLI jump by default? Uggh. I don't know what to think about that. I bet some people do want that, but that makes it pretty easy to end up with multiple copies of the same cluster running on the same TLI, too, which is not a thing that you really want to have happen. At the end of the day, I'm coming around to the view that the biggest problem here is the documentation. Nobody can really know what's supposed to work right now because the documentation doesn't say which things you are and are not allowed to do and what results you should expect in each case. If it did, it would be easier to discuss possible behavior changes. Right now, it's hard to change any code at all, because there's no list of supported scenarios, so you can't tell whether a potential change affects a scenario that somebody thinks should work, or only cases that nobody can possibly care about. It's sort of possible to reason your way through that, to an extent, but it's pretty hard. The fact that I didn't know that starting from a backup with neither recovery.signal nor standby.signal was a thing that anybody did or cared about is good evidence of that. I'm coming to the understanding that we have four supported scenarios. One, no backup_label, no recovery.signal, and no standby.signal. Hence, replay WAL until the end, then start up. Two, backup_label exists but neither recovery.signal nor standby.signal does. As before, but if I understand correctly, now we can check that we reached the backup end location. Three, recovery.signal exists, with or without backup_label. Now we create a new TLI at the end of recovery, and also, now can fetch WAL that is not present in pg_wal using primary_conninfo or restore_command. In fact, I think we may prefer to do that over using WAL we have locally, but I'm not quite sure about that. Fourth, standby.signal exists, with or without backup_label. As the previous scenario, but now when we reach the end of WAL we wait for more to appear instead of ending recovery. I have a feeling this is not quite an exhaustive list of differences between the various modes, and I'm not even sure that it lists all of the things someone might try to do. Thoughts? I also feel like the terminology here sometimes obscures more than it illuminates. For instance, it seems like ArchiveRecoveryRequested really means "are any signal files present?" while InArchiveRecovery means "are we fetching WAL from outside pg_wal rather than using what's in pg_wal?". But these are not obvious from the names, and sometimes we have additional variables with overlapping meanings, like readSource, which indicates whether we're reading from pg_wal, the archive, or the walreceiver, and yet is probably not redundant with InArchiveRecovery. In any event, I think that we need to start with the question of what behavior(s) we want to expose to users, and then back into the question of what internal variables and states need to exist in order to support that behavior. We cannot start by deciding what variables we'd like to get rid of and then trying to justify the resulting behavior changes on the grounds that they simplify the code. Users aren't going to like that, hackers aren't going to like that, and the resulting behavior probably won't be anything great. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: