needless complexity in StartupXLOG - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | needless complexity in StartupXLOG |
Date | |
Msg-id | CA+TgmoYmw==TOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA@mail.gmail.com Whole thread Raw |
Responses |
Re: needless complexity in StartupXLOG
Re: needless complexity in StartupXLOG Re: needless complexity in StartupXLOG |
List | pgsql-hackers |
StartupXLOG() has code beginning around line 7900 of xlog.c that decides, at the end of recovery, between four possible courses of action. It either writes an end-of-recovery record, or requests a checkpoint, or creates a checkpoint, or does nothing, depending on the value of 3 flag variables, and also on whether we're still able to read the last checkpoint record: checkPointLoc = ControlFile->checkPoint; /* * Confirm the last checkpoint is available for us to recover * from if we fail. */ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false); if (record != NULL) { promoted = true; It seems to me that ReadCheckpointRecord() should never fail here. It should always be true, even when we're not in recovery, that the last checkpoint record is readable. If there's ever a situation where that is not true, even for an instant, then a crash at that point will be unrecoverable. Now, one way that it could happen is if we've got a bug in the code someplace that removes WAL segments too soon. However, if we have such a bug, we should fix it. What this code does is says "oh, I guess we removed the old checkpoint too soon, no big deal, let's just be more aggressive about getting the next one done," which I do not think is the right response. Aside from a bug, the only other way I can see it happening is if someone is manually removing WAL segments as the server is running through recovery, perhaps as some last-ditch play to avoid running out of disk space. I don't think the server needs to have - or should have - code to cater to such crazy scenarios. Therefore I think that this check, at least in its current form, is not sensible. My first thought was that we should do the check unconditionally, rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and ERROR if it fails. But then I wondered what the point of that would be exactly. If we have such a bug -- and to the best of my knowledge there's no evidence that we do -- there's no particular reason it should only happen at the end of recovery. It could happen any time the system -- or the user, or malicious space aliens -- remove files from pg_wal, and we have no real idea about the timing of malicious space alien activity, so doing the test here rather than anywhere else just seems like a shot in the dark. Perhaps the most logical place would be to move it to CreateCheckPoint() just after we remove old xlog files, but we don't have the xlogreader there, and anyway I don't see how it's really going to help. What bug do we expect to catch by removing files we think we don't need and then checking that we didn't remove the files we think we do need? That seems more like grasping at straws than a serious attempt to make things work better. So at the moment I am leaning toward the view that we should just remove this check entirely, as in the attached, proposed patch. Really, I think we should consider going further. If it's safe to write an end-of-recovery record rather than a checkpoint, why not do so all the time? Right now we fail to do that in the above-described "impossible" scenario where the previous checkpoint record can't be read, or if we're exiting archive recovery for some reason other than a promotion request, or if we're in single-user mode, or if we're in crash recovery. Presumably, people would like to start up the server quickly in all of those scenarios, so the only reason not to use this technology all the time is if we think it's safe in some scenarios and not others. I can't think of a reason why it matters why we're exiting archive recovery, nor can I think of a reason why it matters whether we're in single user mode. The distinction between crash recovery and archive recovery does seem to matter, but if anything the crash recovery scenario seems simpler, because then there's only one timeline involved. I realize that conservatism may have played a role in this code ending up looking the way that it does; someone seems to have thought it would be better not to rely on a new idea in all cases. From my point of view, though, it's scary to have so many cases, especially cases that don't seem like they should ever be reached. I think that simplifying the logic here and trying to do the same things in as many cases as we can will lead to better robustness. Imagine if instead of all the hairy logic we have now we just replaced this whole if (IsInRecovery) stanza with this: if (InRecovery) CreateEndOfRecoveryRecord(); That would be WAY easier to reason about than the rat's nest we have here today. Now, I am not sure what it would take to get there, but I think that is the direction we ought to be heading. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: