Thread: needless complexity in StartupXLOG
StartupXLOG() has code beginning around line 7900 of xlog.c that decides, at the end of recovery, between four possible courses of action. It either writes an end-of-recovery record, or requests a checkpoint, or creates a checkpoint, or does nothing, depending on the value of 3 flag variables, and also on whether we're still able to read the last checkpoint record: checkPointLoc = ControlFile->checkPoint; /* * Confirm the last checkpoint is available for us to recover * from if we fail. */ record = ReadCheckpointRecord(xlogreader, checkPointLoc, 1, false); if (record != NULL) { promoted = true; It seems to me that ReadCheckpointRecord() should never fail here. It should always be true, even when we're not in recovery, that the last checkpoint record is readable. If there's ever a situation where that is not true, even for an instant, then a crash at that point will be unrecoverable. Now, one way that it could happen is if we've got a bug in the code someplace that removes WAL segments too soon. However, if we have such a bug, we should fix it. What this code does is says "oh, I guess we removed the old checkpoint too soon, no big deal, let's just be more aggressive about getting the next one done," which I do not think is the right response. Aside from a bug, the only other way I can see it happening is if someone is manually removing WAL segments as the server is running through recovery, perhaps as some last-ditch play to avoid running out of disk space. I don't think the server needs to have - or should have - code to cater to such crazy scenarios. Therefore I think that this check, at least in its current form, is not sensible. My first thought was that we should do the check unconditionally, rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and ERROR if it fails. But then I wondered what the point of that would be exactly. If we have such a bug -- and to the best of my knowledge there's no evidence that we do -- there's no particular reason it should only happen at the end of recovery. It could happen any time the system -- or the user, or malicious space aliens -- remove files from pg_wal, and we have no real idea about the timing of malicious space alien activity, so doing the test here rather than anywhere else just seems like a shot in the dark. Perhaps the most logical place would be to move it to CreateCheckPoint() just after we remove old xlog files, but we don't have the xlogreader there, and anyway I don't see how it's really going to help. What bug do we expect to catch by removing files we think we don't need and then checking that we didn't remove the files we think we do need? That seems more like grasping at straws than a serious attempt to make things work better. So at the moment I am leaning toward the view that we should just remove this check entirely, as in the attached, proposed patch. Really, I think we should consider going further. If it's safe to write an end-of-recovery record rather than a checkpoint, why not do so all the time? Right now we fail to do that in the above-described "impossible" scenario where the previous checkpoint record can't be read, or if we're exiting archive recovery for some reason other than a promotion request, or if we're in single-user mode, or if we're in crash recovery. Presumably, people would like to start up the server quickly in all of those scenarios, so the only reason not to use this technology all the time is if we think it's safe in some scenarios and not others. I can't think of a reason why it matters why we're exiting archive recovery, nor can I think of a reason why it matters whether we're in single user mode. The distinction between crash recovery and archive recovery does seem to matter, but if anything the crash recovery scenario seems simpler, because then there's only one timeline involved. I realize that conservatism may have played a role in this code ending up looking the way that it does; someone seems to have thought it would be better not to rely on a new idea in all cases. From my point of view, though, it's scary to have so many cases, especially cases that don't seem like they should ever be reached. I think that simplifying the logic here and trying to do the same things in as many cases as we can will lead to better robustness. Imagine if instead of all the hairy logic we have now we just replaced this whole if (IsInRecovery) stanza with this: if (InRecovery) CreateEndOfRecoveryRecord(); That would be WAY easier to reason about than the rat's nest we have here today. Now, I am not sure what it would take to get there, but I think that is the direction we ought to be heading. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > So at the moment I am leaning toward the view that we should just > remove this check entirely, as in the attached, proposed patch. Haven't dug in deeply but at least following your explanation and reading over the patch and the code a bit, I tend to agree. > Really, I think we should consider going further. If it's safe to > write an end-of-recovery record rather than a checkpoint, why not do > so all the time? Right now we fail to do that in the above-described > "impossible" scenario where the previous checkpoint record can't be > read, or if we're exiting archive recovery for some reason other than > a promotion request, or if we're in single-user mode, or if we're in > crash recovery. Presumably, people would like to start up the server > quickly in all of those scenarios, so the only reason not to use this > technology all the time is if we think it's safe in some scenarios and > not others. I can't think of a reason why it matters why we're exiting > archive recovery, nor can I think of a reason why it matters whether > we're in single user mode. The distinction between crash recovery and > archive recovery does seem to matter, but if anything the crash > recovery scenario seems simpler, because then there's only one > timeline involved. Yeah, tend to agree with this too ... but something I find a bit curious is the comment: * Insert a special WAL record to mark the end of * recovery, since we aren't doing a checkpoint. ... immediately after setting promoted = true, and then at the end of StartupXLOG() having: if (promoted) RequestCheckpoint(CHECKPOINT_FORCE); maybe I'm missing something, but seems like that comment isn't being terribly clear. Perhaps we aren't doing a full checkpoint *there*, but sure looks like we're going to do one moments later regardless of anything else since we've set promoted to true..? > I realize that conservatism may have played a role in this code ending > up looking the way that it does; someone seems to have thought it > would be better not to rely on a new idea in all cases. From my point > of view, though, it's scary to have so many cases, especially cases > that don't seem like they should ever be reached. I think that > simplifying the logic here and trying to do the same things in as many > cases as we can will lead to better robustness. Imagine if instead of > all the hairy logic we have now we just replaced this whole if > (IsInRecovery) stanza with this: > > if (InRecovery) > CreateEndOfRecoveryRecord(); > > That would be WAY easier to reason about than the rat's nest we have > here today. Now, I am not sure what it would take to get there, but I > think that is the direction we ought to be heading. Agreed that simpler logic is better, provided it's correct logic, of course. Finding better ways to test all of this would be really nice. Thanks! Stephen
Attachment
On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote: > Yeah, tend to agree with this too ... but something I find a bit curious > is the comment: > > * Insert a special WAL record to mark the end of > * recovery, since we aren't doing a checkpoint. > > ... immediately after setting promoted = true, and then at the end of > StartupXLOG() having: > > if (promoted) > RequestCheckpoint(CHECKPOINT_FORCE); > > maybe I'm missing something, but seems like that comment isn't being > terribly clear. Perhaps we aren't doing a full checkpoint *there*, but > sure looks like we're going to do one moments later regardless of > anything else since we've set promoted to true..? Yep. So it's a question of whether we allow operations that might write WAL in the meantime. When we write the checkpoint record right here, there can't be any WAL from the new server lifetime until the checkpoint completes. When we write an end-of-recovery record, there can. And there could actually be quite a bit, because if we do the checkpoint right in this section of code, it will be a fast checkpoint, whereas in the code you quoted above, it's a spread checkpoint, which takes a lot longer. So the question is whether it's reasonable to give the checkpoint some time to complete or whether it needs to be completed right now. > Agreed that simpler logic is better, provided it's correct logic, of > course. Finding better ways to test all of this would be really nice. Yeah, and there again, it's a lot easier to test if we don't have as many cases. Now no single change is going to fix that, but the number of flag variables in xlog.c is simply bonkers. This particular stretch of code uses 3 of them to even decide whether to attempt the test in question, and all of those are set in complex ways depending on the values of still more flag variables. The comments where bgwriterLaunched is set claim that we only do this during archive recovery, not crash recovery, but the code depends on the value of ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we can't get the bgwriter to run even during crash recovery in the scenario described by: * It's possible that archive recovery was requested, but we don't * know how far we need to replay the WAL before we reach consistency. * This can happen for example if a base backup is taken from a * running server using an atomic filesystem snapshot, without calling * pg_start/stop_backup. Or if you just kill a running primary server * and put it into archive recovery by creating a recovery signal * file. If we ran the bgwriter all the time during crash recovery, we'd know for sure whether that causes any problems. If we only do it like this in certain corner cases, it's much more likely that we have bugs. Grumble, grumble. -- Robert Haas EDB: http://www.enterprisedb.com
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote: > > Yeah, tend to agree with this too ... but something I find a bit curious > > is the comment: > > > > * Insert a special WAL record to mark the end of > > * recovery, since we aren't doing a checkpoint. > > > > ... immediately after setting promoted = true, and then at the end of > > StartupXLOG() having: > > > > if (promoted) > > RequestCheckpoint(CHECKPOINT_FORCE); > > > > maybe I'm missing something, but seems like that comment isn't being > > terribly clear. Perhaps we aren't doing a full checkpoint *there*, but > > sure looks like we're going to do one moments later regardless of > > anything else since we've set promoted to true..? > > Yep. So it's a question of whether we allow operations that might > write WAL in the meantime. When we write the checkpoint record right > here, there can't be any WAL from the new server lifetime until the > checkpoint completes. When we write an end-of-recovery record, there > can. And there could actually be quite a bit, because if we do the > checkpoint right in this section of code, it will be a fast > checkpoint, whereas in the code you quoted above, it's a spread > checkpoint, which takes a lot longer. So the question is whether it's > reasonable to give the checkpoint some time to complete or whether it > needs to be completed right now. All I was really trying to point out above was that the comment might be improved upon, just so someone understands that we aren't doing a checkpoint at this particular place, but one will be done later due to the promotion. Maybe I'm being a bit extra with that, but that was my thought when reading the code and the use of the promoted flag variable. > > Agreed that simpler logic is better, provided it's correct logic, of > > course. Finding better ways to test all of this would be really nice. > > Yeah, and there again, it's a lot easier to test if we don't have as > many cases. Now no single change is going to fix that, but the number > of flag variables in xlog.c is simply bonkers. This particular stretch > of code uses 3 of them to even decide whether to attempt the test in > question, and all of those are set in complex ways depending on the > values of still more flag variables. The comments where > bgwriterLaunched is set claim that we only do this during archive > recovery, not crash recovery, but the code depends on the value of > ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we > can't get the bgwriter to run even during crash recovery in the > scenario described by: > > * It's possible that archive recovery was requested, but we don't > * know how far we need to replay the WAL before we reach consistency. > * This can happen for example if a base backup is taken from a > * running server using an atomic filesystem snapshot, without calling > * pg_start/stop_backup. Or if you just kill a running primary server > * and put it into archive recovery by creating a recovery signal > * file. > > If we ran the bgwriter all the time during crash recovery, we'd know > for sure whether that causes any problems. If we only do it like this > in certain corner cases, it's much more likely that we have bugs. > Grumble, grumble. Yeah ... not to mention that it really is just incredibly dangerous to use such an approach for PITR. For my 2c, I'd rather we figure out a way to prevent this than to imply that we support it when we have no way of knowing if we actually have replayed far enough to be consistent. That isn't to say that using snapshots for database backups isn't possible, but it should be done in-between pg_start/stop_backup calls which properly grab the returned info from those and store the backup label with the snapshot, etc. Thanks, Stephen
Attachment
On Mon, Jul 26, 2021 at 03:53:20PM -0400, Robert Haas wrote: > Yeah, and there again, it's a lot easier to test if we don't have as > many cases. Now no single change is going to fix that, but the number > of flag variables in xlog.c is simply bonkers. This particular stretch > of code uses 3 of them to even decide whether to attempt the test in > question, and all of those are set in complex ways depending on the > values of still more flag variables. The comments where > bgwriterLaunched is set claim that we only do this during archive > recovery, not crash recovery, but the code depends on the value of > ArchiveRecoveryRequested, not InArchiveRecovery. So I wonder if we > can't get the bgwriter to run even during crash recovery in the > scenario described by: I'm not following along closely and maybe you're already aware of this one? https://commitfest.postgresql.org/33/2706/ Background writer and checkpointer in crash recovery @Thomas: https://www.postgresql.org/message-id/CA%2BTgmoYmw%3D%3DTOJ6EzYb_vcjyS09NkzrVKSyBKUUyo1zBEaJASA%40mail.gmail.com > * It's possible that archive recovery was requested, but we don't > * know how far we need to replay the WAL before we reach consistency. > * This can happen for example if a base backup is taken from a > * running server using an atomic filesystem snapshot, without calling > * pg_start/stop_backup. Or if you just kill a running primary server > * and put it into archive recovery by creating a recovery signal > * file. > > If we ran the bgwriter all the time during crash recovery, we'd know > for sure whether that causes any problems. If we only do it like this > in certain corner cases, it's much more likely that we have bugs. > Grumble, grumble. -- Justin
On Mon, Jul 26, 2021 at 4:15 PM Stephen Frost <sfrost@snowman.net> wrote: > All I was really trying to point out above was that the comment might be > improved upon, just so someone understands that we aren't doing a > checkpoint at this particular place, but one will be done later due to > the promotion. Maybe I'm being a bit extra with that, but that was my > thought when reading the code and the use of the promoted flag variable. Yeah, I agree, it confused me too, at first. > Yeah ... not to mention that it really is just incredibly dangerous to > use such an approach for PITR. For my 2c, I'd rather we figure out a > way to prevent this than to imply that we support it when we have no way > of knowing if we actually have replayed far enough to be consistent. > That isn't to say that using snapshots for database backups isn't > possible, but it should be done in-between pg_start/stop_backup calls > which properly grab the returned info from those and store the backup > label with the snapshot, etc. My position on that is that I would not particularly recommend the technique described here, but I would not choose to try to block it either. That's an argument for another thread, though. -- Robert Haas EDB: http://www.enterprisedb.com
At Mon, 26 Jul 2021 16:15:23 -0400, Stephen Frost <sfrost@snowman.net> wrote in > Greetings, > > * Robert Haas (robertmhaas@gmail.com) wrote: > > On Mon, Jul 26, 2021 at 1:32 PM Stephen Frost <sfrost@snowman.net> wrote: > > > Yeah, tend to agree with this too ... but something I find a bit curious > > > is the comment: > > > > > > * Insert a special WAL record to mark the end of > > > * recovery, since we aren't doing a checkpoint. > > > > > > ... immediately after setting promoted = true, and then at the end of > > > StartupXLOG() having: > > > > > > if (promoted) > > > RequestCheckpoint(CHECKPOINT_FORCE); > > > > > > maybe I'm missing something, but seems like that comment isn't being > > > terribly clear. Perhaps we aren't doing a full checkpoint *there*, but > > > sure looks like we're going to do one moments later regardless of > > > anything else since we've set promoted to true..? > > > > Yep. So it's a question of whether we allow operations that might > > write WAL in the meantime. When we write the checkpoint record right > > here, there can't be any WAL from the new server lifetime until the > > checkpoint completes. When we write an end-of-recovery record, there > > can. And there could actually be quite a bit, because if we do the > > checkpoint right in this section of code, it will be a fast > > checkpoint, whereas in the code you quoted above, it's a spread > > checkpoint, which takes a lot longer. So the question is whether it's > > reasonable to give the checkpoint some time to complete or whether it > > needs to be completed right now. > > All I was really trying to point out above was that the comment might be > improved upon, just so someone understands that we aren't doing a > checkpoint at this particular place, but one will be done later due to > the promotion. Maybe I'm being a bit extra with that, but that was my > thought when reading the code and the use of the promoted flag variable. (I feel we don't need to check for last-checkpoint, too.) FWIW, by the way, I complained that the variable name "promoted" is a bit confusing. It's old name was fast_promoted, which means that fast promotion is being *requsted* and ongoing. On the other hand the current name "promoted" still means "(fast=non-fallback) promotion is ongoing" so there was a conversation as the follows. https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com >> How about changing it to fallback_promotion, or some names with more >> behavior-specific name like immediate_checkpoint_needed? > > I like doEndOfRecoveryCkpt or something, but I have no strong opinion > about that flag naming. So I'm ok with immediate_checkpoint_needed > if others also like that, too. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > FWIW, by the way, I complained that the variable name "promoted" is a > bit confusing. It's old name was fast_promoted, which means that fast > promotion is being *requsted* and ongoing. On the other hand the > current name "promoted" still means "(fast=non-fallback) promotion is > ongoing" so there was a conversation as the follows. > > https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com I agree - that variable name is also not great. I am open to making improvements in that area and in others that have been suggested on this thread, but my immediate goal is to figure out whether anyone objects to me committing the posted patch. If nobody comes up with a reason why it's a bad idea in the next few days, I'll plan to move ahead with it. -- Robert Haas EDB: http://www.enterprisedb.com
At Tue, 27 Jul 2021 11:03:14 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Tue, Jul 27, 2021 at 9:18 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > > FWIW, by the way, I complained that the variable name "promoted" is a > > bit confusing. It's old name was fast_promoted, which means that fast > > promotion is being *requsted* and ongoing. On the other hand the > > current name "promoted" still means "(fast=non-fallback) promotion is > > ongoing" so there was a conversation as the follows. > > > > https://www.postgresql.org/message-id/9fdd994d-a531-a52b-7906-e1cc22701310%40oss.nttdata.com > > I agree - that variable name is also not great. I am open to making > improvements in that area and in others that have been suggested on > this thread, but my immediate goal is to figure out whether anyone > objects to me committing the posted patch. If nobody comes up with a > reason why it's a bad idea in the next few days, I'll plan to move > ahead with it. That's fine with me. I still haven't find a way to lose the last checkpoint due to software failure. Repeated promotion without having new checkpoints is safe as expected. We don't remove WAL files unless a checkpoint completes, and a checkpoint preserves segments back to the one containing its redo point. In short, I'm for it. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Hi, On 2021-07-26 12:12:53 -0400, Robert Haas wrote: > My first thought was that we should do the check unconditionally, > rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and > ERROR if it fails. But then I wondered what the point of that would be > exactly. If we have such a bug -- and to the best of my knowledge > there's no evidence that we do -- there's no particular reason it > should only happen at the end of recovery. It could happen any time > the system -- or the user, or malicious space aliens -- remove files > from pg_wal, and we have no real idea about the timing of malicious > space alien activity, so doing the test here rather than anywhere else > just seems like a shot in the dark. Yea. The history of that code being added doesn't suggest that there was a concrete issue being addressed, from what I can tell. > So at the moment I am leaning toward the view that we should just > remove this check entirely, as in the attached, proposed patch. +1 > Really, I think we should consider going further. If it's safe to > write an end-of-recovery record rather than a checkpoint, why not do > so all the time? +many. The current split doesn't make much sense. For one, it often is a huge issue if crash recovery takes a long time - why should we incur the cost that we are OK avoiding during promotions? For another, end-of-recovery is a crucial path for correctness, reducing the number of non-trivial paths is good. > Imagine if instead of > all the hairy logic we have now we just replaced this whole if > (IsInRecovery) stanza with this: > > if (InRecovery) > CreateEndOfRecoveryRecord(); > > That would be WAY easier to reason about than the rat's nest we have > here today. Now, I am not sure what it would take to get there, but I > think that is the direction we ought to be heading. What are we going to do in the single user ([1]) case in this awesome future? I guess we could just not create a checkpoint until single user mode is shut down / creates a checkpoint for other reasons? Greetings, Andres Freund [1] I really wish somebody had the energy to just remove single user and bootstrap modes. The degree to which they increase complexity in the rest of the system is entirely unreasonable. There's not actually any reason bootstrapping can't happen with checkpointer et al running, it's just autovacuum that'd need to be blocked.
On Thu, Jul 29, 2021 at 3:28 AM Andres Freund <andres@anarazel.de> wrote: > > [1] I really wish somebody had the energy to just remove single user and > bootstrap modes. The degree to which they increase complexity in the rest of > the system is entirely unreasonable. There's not actually any reason > bootstrapping can't happen with checkpointer et al running, it's just > autovacuum that'd need to be blocked. Any objection to adding an entry for that in the wiki TODO?
On Mon, Jul 26, 2021 at 9:43 PM Robert Haas <robertmhaas@gmail.com> wrote: > > StartupXLOG() has code beginning around line 7900 of xlog.c that > decides, at the end of recovery, between four possible courses of > action. It either writes an end-of-recovery record, or requests a > checkpoint, or creates a checkpoint, or does nothing, depending on the > value of 3 flag variables, and also on whether we're still able to > read the last checkpoint record: > > checkPointLoc = ControlFile->checkPoint; > > /* > * Confirm the last checkpoint is available for us to recover > * from if we fail. > */ > record = ReadCheckpointRecord(xlogreader, > checkPointLoc, 1, false); > if (record != NULL) > { > promoted = true; > > It seems to me that ReadCheckpointRecord() should never fail here. It > should always be true, even when we're not in recovery, that the last > checkpoint record is readable. If there's ever a situation where that > is not true, even for an instant, then a crash at that point will be > unrecoverable. Now, one way that it could happen is if we've got a bug > in the code someplace that removes WAL segments too soon. However, if > we have such a bug, we should fix it. What this code does is says "oh, > I guess we removed the old checkpoint too soon, no big deal, let's > just be more aggressive about getting the next one done," which I do > not think is the right response. Aside from a bug, the only other way > I can see it happening is if someone is manually removing WAL segments > as the server is running through recovery, perhaps as some last-ditch > play to avoid running out of disk space. I don't think the server > needs to have - or should have - code to cater to such crazy > scenarios. Therefore I think that this check, at least in its current > form, is not sensible. > > My first thought was that we should do the check unconditionally, > rather than just when bgwriterLaunched && LocalPromoteIsTriggered, and > ERROR if it fails. But then I wondered what the point of that would be > exactly. If we have such a bug -- and to the best of my knowledge > there's no evidence that we do -- there's no particular reason it > should only happen at the end of recovery. It could happen any time > the system -- or the user, or malicious space aliens -- remove files > from pg_wal, and we have no real idea about the timing of malicious > space alien activity, so doing the test here rather than anywhere else > just seems like a shot in the dark. Perhaps the most logical place > would be to move it to CreateCheckPoint() just after we remove old > xlog files, but we don't have the xlogreader there, and anyway I don't > see how it's really going to help. What bug do we expect to catch by > removing files we think we don't need and then checking that we didn't > remove the files we think we do need? That seems more like grasping at > straws than a serious attempt to make things work better. > > So at the moment I am leaning toward the view that we should just > remove this check entirely, as in the attached, proposed patch. > Can we have an elog() fatal error or warning to make sure that the last checkpoint is still readable? Since the case where the user (knowingly or unknowingly) or some buggy code has removed the WAL file containing the last checkpoint could be possible. If it is then we would have a hard time finding out when we get further unexpected behavior due to this. Thoughts? > Really, I think we should consider going further. If it's safe to > write an end-of-recovery record rather than a checkpoint, why not do > so all the time? Right now we fail to do that in the above-described > "impossible" scenario where the previous checkpoint record can't be > read, or if we're exiting archive recovery for some reason other than > a promotion request, or if we're in single-user mode, or if we're in > crash recovery. Presumably, people would like to start up the server > quickly in all of those scenarios, so the only reason not to use this > technology all the time is if we think it's safe in some scenarios and > not others. I can't think of a reason why it matters why we're exiting > archive recovery, nor can I think of a reason why it matters whether > we're in single user mode. The distinction between crash recovery and > archive recovery does seem to matter, but if anything the crash > recovery scenario seems simpler, because then there's only one > timeline involved. > > I realize that conservatism may have played a role in this code ending > up looking the way that it does; someone seems to have thought it > would be better not to rely on a new idea in all cases. From my point > of view, though, it's scary to have so many cases, especially cases > that don't seem like they should ever be reached. I think that > simplifying the logic here and trying to do the same things in as many > cases as we can will lead to better robustness. Imagine if instead of > all the hairy logic we have now we just replaced this whole if > (IsInRecovery) stanza with this: > > if (InRecovery) > CreateEndOfRecoveryRecord(); +1, and do the checkpoint at the end unconditionally as we are doing for the promotion. Regards, Amul
On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote: > > Imagine if instead of > > all the hairy logic we have now we just replaced this whole if > > (IsInRecovery) stanza with this: > > > > if (InRecovery) > > CreateEndOfRecoveryRecord(); > > > > That would be WAY easier to reason about than the rat's nest we have > > here today. Now, I am not sure what it would take to get there, but I > > think that is the direction we ought to be heading. > > What are we going to do in the single user ([1]) case in this awesome future? > I guess we could just not create a checkpoint until single user mode is shut > down / creates a checkpoint for other reasons? It probably depends on who writes this thus-far-hypothetical patch, but my thought is that we'd unconditionally request a checkpoint after writing the end-of-recovery record, same as we do now if (promoted). If we happened to be in single-user mode, then that checkpoint request would turn into performing a checkpoint on the spot in the one and only process we've got, but StartupXLOG() wouldn't really need to care what happens under the hood after it called RequestCheckpoint(). > [1] I really wish somebody had the energy to just remove single user and > bootstrap modes. The degree to which they increase complexity in the rest of > the system is entirely unreasonable. There's not actually any reason > bootstrapping can't happen with checkpointer et al running, it's just > autovacuum that'd need to be blocked. I don't know whether this is the way forward or not. I think a lot of the complexity of the current system is incidental rather than intrinsic. If I were going to work on this, I'd probably working on trying to tidy up the code and reduce the number of places that need to care about IsUnderPostmaster and IsPostmasterEnvironment, rather than trying to get rid of them. I suspect that's a necessary prerequisite step anyway, and not a trivial effort either. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 29, 2021 at 1:47 AM Amul Sul <sulamul@gmail.com> wrote: > Can we have an elog() fatal error or warning to make sure that the > last checkpoint is still readable? Since the case where the user > (knowingly or unknowingly) or some buggy code has removed the WAL file > containing the last checkpoint could be possible. If it is then we > would have a hard time finding out when we get further unexpected > behavior due to this. Thoughts? Sure, we could, but I don't think we should. Such crazy things can happen any time, not just at the point where this check is happening. It's not particularly more likely to happen here vs. any other place where we could insert a check. Should we check everywhere, all the time, just in case? > > I realize that conservatism may have played a role in this code ending > > up looking the way that it does; someone seems to have thought it > > would be better not to rely on a new idea in all cases. From my point > > of view, though, it's scary to have so many cases, especially cases > > that don't seem like they should ever be reached. I think that > > simplifying the logic here and trying to do the same things in as many > > cases as we can will lead to better robustness. Imagine if instead of > > all the hairy logic we have now we just replaced this whole if > > (IsInRecovery) stanza with this: > > > > if (InRecovery) > > CreateEndOfRecoveryRecord(); > > +1, and do the checkpoint at the end unconditionally as we are doing > for the promotion. Yeah, that was my thought, too. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2021-07-29 12:49:19 +0800, Julien Rouhaud wrote: > On Thu, Jul 29, 2021 at 3:28 AM Andres Freund <andres@anarazel.de> wrote: > > > > [1] I really wish somebody had the energy to just remove single user and > > bootstrap modes. The degree to which they increase complexity in the rest of > > the system is entirely unreasonable. There's not actually any reason > > bootstrapping can't happen with checkpointer et al running, it's just > > autovacuum that'd need to be blocked. > > Any objection to adding an entry for that in the wiki TODO? Not sure there's enough concensus on the idea for that. I personally think that's a good approach at reducing relevant complexity, but I don't know if anybody agrees... Greetings, Andres Freund
On Thu, Jul 29, 2021 at 3:16 PM Andres Freund <andres@anarazel.de> wrote: > Not sure there's enough concensus on the idea for that. I personally > think that's a good approach at reducing relevant complexity, but I > don't know if anybody agrees... There does seem to be agreement on the proposed patch, so I have committed it. Thanks to all for the discussion and the links to other relevant threads. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote: > > > Imagine if instead of > > > all the hairy logic we have now we just replaced this whole if > > > (IsInRecovery) stanza with this: > > > > > > if (InRecovery) > > > CreateEndOfRecoveryRecord(); > > > > > > That would be WAY easier to reason about than the rat's nest we have > > > here today. Now, I am not sure what it would take to get there, but I > > > think that is the direction we ought to be heading. > > > > What are we going to do in the single user ([1]) case in this awesome future? > > I guess we could just not create a checkpoint until single user mode is shut > > down / creates a checkpoint for other reasons? > > It probably depends on who writes this thus-far-hypothetical patch, > but my thought is that we'd unconditionally request a checkpoint after > writing the end-of-recovery record, same as we do now if (promoted). > If we happened to be in single-user mode, then that checkpoint request > would turn into performing a checkpoint on the spot in the one and > only process we've got, but StartupXLOG() wouldn't really need to care > what happens under the hood after it called RequestCheckpoint(). I decided to try writing a patch to use an end-of-recovery record rather than a checkpoint record in all cases. The patch itself was pretty simple but didn't work. There are two different reasons why it didn't work, which I'll explain in a minute. I'm not sure whether there are any other problems; these are the only two things that cause problems with 'make check-world', but that's hardly a guarantee of anything. Anyway, I thought it would be useful to report these issues first and hopefully get some feedback. The first problem I hit was that GetRunningTransactionData() does Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)). While that does seem like a superficially reasonable thing to assert, StartupXLOG() initializes latestCompletedXid by calling TransactionIdRetreat on the value extracted from checkPoint.nextXid, and the first-ever checkpoint that is written at the beginning of WAL has a nextXid of 3, so when starting up from that checkpoint nextXid is 2, which is not a normal XID. When we try to create the next checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls GetRunningTransactionData() and the assertion fails. In the current code, we avoid this more or less accidentally because LogStandbySnapshot() is skipped when starting from a shutdown checkpoint. If we write an end-of-recovery record and then trigger a checkpoint afterwards, then we don't avoid the problem. Although I'm impishly tempted to suggest that we #define SecondNormalTransactionId 4 and then use that to create the first checkpoint instead of FirstNormalTransactionId -- naturally with no comments explaining why we're doing it -- I think the real problem is that the assertion is just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be InvalidTransactionId or BootstrapTransactionId, but FrozenTransactionId is a legal, if corner-case, value, at least as the code stands today. Unfortunately we can't just relax the assertion, because the XLOG_RUNNING_XACTS record will eventually be handed to ProcArrayApplyRecoveryInfo() for processing ... and that function contains a matching assertion which would in turn fail. It in turn passes the value to MaintainLatestCompletedXidRecovery() which contains yet another matching assertion, so the restriction to normal XIDs here looks pretty deliberate. There are no comments, though, so the reader is left to guess why. I see one problem: MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which expects a normal XID. Perhaps it's best to just dodge the entire issue by skipping LogStandbySnapshot() if latestCompletedXid happens to be 2, but that feels like a hack, because AFAICS the real problem is that StartupXLog() doesn't agree with the rest of the code on whether 2 is a legal case, and maybe we ought to be storing a value that doesn't need to be computed via TransactionIdRetreat(). The second problem I hit was a preexisting bug where, under wal_level=minimal, redo of a "create tablespace" record can blow away data of which we have no other copy. See http://postgr.es/m/CA+TgmoaLO9ncuwvr2nN-J4VEP5XyAcy=zKiHxQzBbFRxxGxm0w@mail.gmail.com for details. That bug happens to make src/test/recovery's 018_wal_optimize.pl fail one of the tests, because the test starts and stops the server repeatedly, with the result that with the attached patch, we just keep writing end-of-recovery records and never getting time to finish a checkpoint before the next shutdown, so every test replays the CREATE TABLESPACE record and everything that previous tests did. The "wal_level = minimal, SET TABLESPACE commit subtransaction" fails because it's the only one that (1) uses the tablespace for a new table, (2) commits, and (3) runs before a checkpoint is manually forced. It's also worth noting that if we go this way, CHECKPOINT_END_OF_RECOVERY should be ripped out entirely. We'd still be triggering a checkpoint at the end of recovery, but because it could be running concurrently with WAL-generating activity, it wouldn't be an end-of-recovery checkpoint in the sense that we now use that term. In particular, you couldn't assume that no other write transactions are running at the point when this checkpoint is performed. I haven't yet tried ripping that out and doing so might reveal other problems. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > I decided to try writing a patch to use an end-of-recovery record > rather than a checkpoint record in all cases. > > The first problem I hit was that GetRunningTransactionData() does > Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)). > > Unfortunately we can't just relax the assertion, because the > XLOG_RUNNING_XACTS record will eventually be handed to > ProcArrayApplyRecoveryInfo() for processing ... and that function > contains a matching assertion which would in turn fail. It in turn > passes the value to MaintainLatestCompletedXidRecovery() which > contains yet another matching assertion, so the restriction to normal > XIDs here looks pretty deliberate. There are no comments, though, so > the reader is left to guess why. I see one problem: > MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which > expects a normal XID. Perhaps it's best to just dodge the entire issue > by skipping LogStandbySnapshot() if latestCompletedXid happens to be > 2, but that feels like a hack, because AFAICS the real problem is that > StartupXLog() doesn't agree with the rest of the code on whether 2 is > a legal case, and maybe we ought to be storing a value that doesn't > need to be computed via TransactionIdRetreat(). Anyone have any thoughts about this? -- Robert Haas EDB: http://www.enterprisedb.com
At Thu, 2 Sep 2021 11:30:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > > I decided to try writing a patch to use an end-of-recovery record > > rather than a checkpoint record in all cases. > > > > The first problem I hit was that GetRunningTransactionData() does > > Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)). > > > > Unfortunately we can't just relax the assertion, because the > > XLOG_RUNNING_XACTS record will eventually be handed to > > ProcArrayApplyRecoveryInfo() for processing ... and that function > > contains a matching assertion which would in turn fail. It in turn > > passes the value to MaintainLatestCompletedXidRecovery() which > > contains yet another matching assertion, so the restriction to normal > > XIDs here looks pretty deliberate. There are no comments, though, so > > the reader is left to guess why. I see one problem: > > MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which > > expects a normal XID. Perhaps it's best to just dodge the entire issue > > by skipping LogStandbySnapshot() if latestCompletedXid happens to be > > 2, but that feels like a hack, because AFAICS the real problem is that > > StartupXLog() doesn't agree with the rest of the code on whether 2 is > > a legal case, and maybe we ought to be storing a value that doesn't > > need to be computed via TransactionIdRetreat(). > > Anyone have any thoughts about this? I tried to reproduce this but just replacing the end-of-recovery checkpoint request with issuing an end-of-recovery record didn't cause make check-workd fail for me. Do you have an idea of any other requirement to cause that? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Fri, Sep 3, 2021 at 10:23 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > > At Thu, 2 Sep 2021 11:30:59 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > > On Mon, Aug 9, 2021 at 3:00 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > I decided to try writing a patch to use an end-of-recovery record > > > rather than a checkpoint record in all cases. > > > > > > The first problem I hit was that GetRunningTransactionData() does > > > Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)). > > > > > > Unfortunately we can't just relax the assertion, because the > > > XLOG_RUNNING_XACTS record will eventually be handed to > > > ProcArrayApplyRecoveryInfo() for processing ... and that function > > > contains a matching assertion which would in turn fail. It in turn > > > passes the value to MaintainLatestCompletedXidRecovery() which > > > contains yet another matching assertion, so the restriction to normal > > > XIDs here looks pretty deliberate. There are no comments, though, so > > > the reader is left to guess why. I see one problem: > > > MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which > > > expects a normal XID. Perhaps it's best to just dodge the entire issue > > > by skipping LogStandbySnapshot() if latestCompletedXid happens to be > > > 2, but that feels like a hack, because AFAICS the real problem is that > > > StartupXLog() doesn't agree with the rest of the code on whether 2 is > > > a legal case, and maybe we ought to be storing a value that doesn't > > > need to be computed via TransactionIdRetreat(). > > > > Anyone have any thoughts about this? > > I tried to reproduce this but just replacing the end-of-recovery > checkpoint request with issuing an end-of-recovery record didn't cause > make check-workd fail for me. Do you have an idea of any other > requirement to cause that? > You might need the following change at the end of StartupXLOG(): - if (promoted) - RequestCheckpoint(CHECKPOINT_FORCE); + RequestCheckpoint(CHECKPOINT_FORCE); Regards, Amul
On Fri, Sep 3, 2021 at 12:52 AM Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote: > I tried to reproduce this but just replacing the end-of-recovery > checkpoint request with issuing an end-of-recovery record didn't cause > make check-workd fail for me. Do you have an idea of any other > requirement to cause that? Did you use the patch I posted, or a different one? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 10, 2021 at 12:31 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 29, 2021 at 11:49 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jul 28, 2021 at 3:28 PM Andres Freund <andres@anarazel.de> wrote: > > > > Imagine if instead of > > > > all the hairy logic we have now we just replaced this whole if > > > > (IsInRecovery) stanza with this: > > > > > > > > if (InRecovery) > > > > CreateEndOfRecoveryRecord(); > > > > > > > > That would be WAY easier to reason about than the rat's nest we have > > > > here today. Now, I am not sure what it would take to get there, but I > > > > think that is the direction we ought to be heading. > > > > > > What are we going to do in the single user ([1]) case in this awesome future? > > > I guess we could just not create a checkpoint until single user mode is shut > > > down / creates a checkpoint for other reasons? > > > > It probably depends on who writes this thus-far-hypothetical patch, > > but my thought is that we'd unconditionally request a checkpoint after > > writing the end-of-recovery record, same as we do now if (promoted). > > If we happened to be in single-user mode, then that checkpoint request > > would turn into performing a checkpoint on the spot in the one and > > only process we've got, but StartupXLOG() wouldn't really need to care > > what happens under the hood after it called RequestCheckpoint(). > > I decided to try writing a patch to use an end-of-recovery record > rather than a checkpoint record in all cases. The patch itself was > pretty simple but didn't work. There are two different reasons why it > didn't work, which I'll explain in a minute. I'm not sure whether > there are any other problems; these are the only two things that cause > problems with 'make check-world', but that's hardly a guarantee of > anything. Anyway, I thought it would be useful to report these issues > first and hopefully get some feedback. > > The first problem I hit was that GetRunningTransactionData() does > Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)). > While that does seem like a superficially reasonable thing to assert, > StartupXLOG() initializes latestCompletedXid by calling > TransactionIdRetreat on the value extracted from checkPoint.nextXid, > and the first-ever checkpoint that is written at the beginning of WAL > has a nextXid of 3, so when starting up from that checkpoint nextXid > is 2, which is not a normal XID. When we try to create the next > checkpoint, CreateCheckPoint() does LogStandbySnapshot() which calls > GetRunningTransactionData() and the assertion fails. In the current > code, we avoid this more or less accidentally because > LogStandbySnapshot() is skipped when starting from a shutdown > checkpoint. If we write an end-of-recovery record and then trigger a > checkpoint afterwards, then we don't avoid the problem. Although I'm > impishly tempted to suggest that we #define SecondNormalTransactionId > 4 and then use that to create the first checkpoint instead of > FirstNormalTransactionId -- naturally with no comments explaining why > we're doing it -- I think the real problem is that the assertion is > just wrong. CurrentRunningXacts->latestCompletedXid shouldn't be > InvalidTransactionId or BootstrapTransactionId, but > FrozenTransactionId is a legal, if corner-case, value, at least as the > code stands today. > > Unfortunately we can't just relax the assertion, because the > XLOG_RUNNING_XACTS record will eventually be handed to > ProcArrayApplyRecoveryInfo() for processing ... and that function > contains a matching assertion which would in turn fail. It in turn > passes the value to MaintainLatestCompletedXidRecovery() which > contains yet another matching assertion, so the restriction to normal > XIDs here looks pretty deliberate. There are no comments, though, so > the reader is left to guess why. I see one problem: > MaintainLatestCompletedXidRecovery uses FullXidRelativeTo, which > expects a normal XID. Perhaps it's best to just dodge the entire issue > by skipping LogStandbySnapshot() if latestCompletedXid happens to be > 2, > By reading above explanation, it seems like it is better to skip LogStandbySnapshot() for this proposal. Can't we consider skipping LogStandbySnapshot() by passing some explicit flag-like 'recovery_checkpoint' or something like that? I think there is some prior discussion in another thread related to this work but it would be easier to follow if you can summarize the same. With Regards, Amit Kapila.