Thread: Re: using an end-of-recovery record in all cases
At Fri, 3 Sep 2021 15:56:35 +0530, Amul Sul <sulamul@gmail.com> wrote in > On Fri, Sep 3, 2021 at 10:23 AM Kyotaro Horiguchi > <horikyota.ntt@gmail.com> wrote: > You might need the following change at the end of StartupXLOG(): > > - if (promoted) > - RequestCheckpoint(CHECKPOINT_FORCE); > + RequestCheckpoint(CHECKPOINT_FORCE); At Fri, 3 Sep 2021 10:13:53 -0400, Robert Haas <robertmhaas@gmail.com> wrote in > Did you use the patch I posted, or a different one Thanks to both of you. That was my stupid. ..But even with the patch (with removal of 018_..pl test file), I didn't get check-world failed.. I'll retry later. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
I was trying to understand the v1 patch and found that at the end RequestCheckpoint() is called unconditionally, I think that should have been called if REDO had performed, here is the snip from the v1 patch: /* - * If this was a promotion, request an (online) checkpoint now. This isn't - * required for consistency, but the last restartpoint might be far back, - * and in case of a crash, recovering from it might take a longer than is - * appropriate now that we're not in standby mode anymore. + * Request an (online) checkpoint now. Note that, until this is complete, + * a crash would start replay from the same WAL location we did, or from + * the last restartpoint that completed. We don't want to let that + * situation persist for longer than necessary, since users don't like + * long recovery times. On the other hand, they also want to be able to + * start doing useful work again as quickly as possible. Therfore, we + * don't pass CHECKPOINT_IMMEDIATE to avoid bogging down the system. + * + * Note that the consequence of requesting a checkpoint here only after + * we've allowed WAL writes is that a single checkpoint cycle can span + * multiple server lifetimes. So for example if you want to something to + * happen at least once per checkpoint cycle or at most once per + * checkpoint cycle, you have to consider what happens if the server + * is restarted someplace in the middle. */ - if (promoted) - RequestCheckpoint(CHECKPOINT_FORCE); + RequestCheckpoint(CHECKPOINT_FORCE); When I try to call that conditionally like attached, I don't see any regression failure, correct me if I am missing something here. Regards, Amul
Attachment
On Tue, Oct 5, 2021 at 7:44 AM Amul Sul <sulamul@gmail.com> wrote: > I was trying to understand the v1 patch and found that at the end > RequestCheckpoint() is called unconditionally, I think that should > have been called if REDO had performed, You're right. But I don't think we need an extra variable like this, right? We can just test InRecovery? -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 5 Oct 2021 at 9:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Oct 5, 2021 at 7:44 AM Amul Sul <sulamul@gmail.com> wrote:
> I was trying to understand the v1 patch and found that at the end
> RequestCheckpoint() is called unconditionally, I think that should
> have been called if REDO had performed,
You're right. But I don't think we need an extra variable like this,
right? We can just test InRecovery?
No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggested in other thread.
Regards,
Amul
On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote: > No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggested inother thread. Hmm, right, that makes sense. Perhaps I should start remembering what I said in my own emails. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote: > > No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggestedin other thread. > > Hmm, right, that makes sense. Perhaps I should start remembering what > I said in my own emails. > Here I end up with the attached version where I have dropped the changes for standby.c and 018_wal_optimize.pl files. Also, I am not sure that we should have the changes for bgwriter.c and slot.c in this patch, but that's not touched. Regards, Amul
Attachment
On Wed, Oct 6, 2021 at 7:24 PM Amul Sul <sulamul@gmail.com> wrote: > > On Tue, Oct 5, 2021 at 10:42 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Oct 5, 2021 at 12:41 PM Amul Sul <sulamul@gmail.com> wrote: > > > No, InRecovery flag get cleared before this point. I think, we can use lastReplayedEndRecPtr what you have suggestedin other thread. > > > > Hmm, right, that makes sense. Perhaps I should start remembering what > > I said in my own emails. > > > > Here I end up with the attached version where I have dropped the > changes for standby.c and 018_wal_optimize.pl files. Also, I am not > sure that we should have the changes for bgwriter.c and slot.c in this > patch, but that's not touched. > Attached is the rebased and updated version. The patch removes the newly introduced PerformRecoveryXLogAction() function. In addition to that, removed the CHECKPOINT_END_OF_RECOVERY flag and its related code. Also, dropped changes for bgwriter.c and slot.c in this patch, which seem unrelated to this work. Regards, Amul
Attachment
Hi, On Mon, Oct 18, 2021 at 10:56:53AM +0530, Amul Sul wrote: > > Attached is the rebased and updated version. The patch removes the > newly introduced PerformRecoveryXLogAction() function. In addition to > that, removed the CHECKPOINT_END_OF_RECOVERY flag and its related > code. Also, dropped changes for bgwriter.c and slot.c in this patch, which > seem unrelated to this work. The cfbot reports that this version of the patch doesn't apply anymore: http://cfbot.cputube.org/patch_36_3365.log === Applying patches on top of PostgreSQL commit ID 0c53a6658e47217ad3dd416a5543fc87c3ecfd58 === === applying patch ./v3-0001-Always-use-an-end-of-recovery-record-rather-than-.patch patching file src/backend/access/transam/xlog.c [...] Hunk #14 FAILED at 9061. Hunk #15 FAILED at 9241. 2 out of 15 hunks FAILED -- saving rejects to file src/backend/access/transam/xlog.c.rej Can you send a rebased version? In the meantime I will switch the cf entry to Waiting on Author.
On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > The cfbot reports that this version of the patch doesn't apply anymore: Here is a new version of the patch which, unlike v1, I think is something we could seriously consider applying (not before v16, of course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I attach a second patch as well which nukes checkPoint.PrevTimeLineID as well. I mentioned two problems with $SUBJECT in the first email with this subject line. One was a bug, which Noah has since fixed (thanks, Noah). The other problem is that LogStandbySnapshot() and a bunch of its friends expect latestCompletedXid to always be a normal XID, which is a problem because (1) we currently set nextXid to 3 and (2) at startup, we compute latestCompletedXid = nextXid - 1. The current code dodges this kind of accidentally: the checkpoint that happens at startup is a "shutdown checkpoint" and thus skips logging a standby snapshot, since a shutdown checkpoint is a sure indicator that there are no running transactions. With the changes, the checkpoint at startup happens after we've started allowing write transactions, and thus a standby snapshot needs to be logged also. In the cases where the end-of-recovery record was already being used, the problem could have happened already, except for the fact that those cases involve a standby promotion, which doesn't happen during initdb. I explored a few possible ways of solving this problem. The first thing I considered was replacing latestCompletedXid with a name like incompleteXidHorizon. The idea is that, where latestCompletedXid is the highest XID that is known to have committed or aborted, incompleteXidHorizon would be the lowest XID such that all known commits or aborts are for lower XIDs. In effect, incompleteXidHorizon would be latestCompletedXid + 1. Since latestCompletedXid is always normal except when it's 2, incompleteXidHorizon would be normal in all cases. Initially this seemed fairly promising, but it kind of fell down when I realized that we copy latestCompletedXid into ComputeXidHorizonsResult.latest_completed. That seemed to me to make the consequences of the change a bit more far-reaching than I liked. Also, it wasn't entirely clear to me that I wouldn't be introducing any off-by-one errors into various wraparound calculations with this approach. The second thing I considered was skipping LogStandbySnapshot() during initdb. There are two ways of doing this and neither of them seem that great to me. Something that does work is to skip LogStandbySnapshot() when in single user mode, but there is no particular reason why WAL generated in single user mode couldn't be replayed on a standby, so this doesn't seem great. It's too big a hammer for what we really want, which is just to suppress this during initdb. Another way of approaching it is to skip it in bootstrap mode, but that actually doesn't work: initdb then fails during the post-bootstrapping step rather than during bootstrapping. I thought about patching around that by forcing the code that generates checkpoint records to forcibly advance an XID of 3 to 4, but that seemed like working around the problem from the wrong end. So ... I decided that the best approach here seems to be to just skip FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the first write transaction that the cluster ever processes. That's very simple and doesn't seem likely to break anything else. On the downside it seems a bit grotty, but I don't see anything better, and on the whole, I think with this approach we come out substantially ahead. 0001 removes 3 times as many lines as it inserts, and 0002 saves a few more lines of code. Now, I still don't really know that there isn't some theoretical difficulty here that makes this whole approach a non-starter, but I also can't think of what it might be. If the promotion case, which has used the end-of-recovery record for many years, is basically safe, despite the fact that it switches TLIs, then it seems to me that the crash recovery case, which doesn't have that complication, ought to be safe too. But I might well be missing something, so if you see a problem, please speak up! -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > The cfbot reports that this version of the patch doesn't apply anymore: > > Here is a new version of the patch which, unlike v1, I think is > something we could seriously consider applying (not before v16, of > course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I > attach a second patch as well which nukes checkPoint.PrevTimeLineID as > well. > > I mentioned two problems with $SUBJECT in the first email with this > subject line. One was a bug, which Noah has since fixed (thanks, > Noah). The other problem is that LogStandbySnapshot() and a bunch of > its friends expect latestCompletedXid to always be a normal XID, which > is a problem because (1) we currently set nextXid to 3 and (2) at > startup, we compute latestCompletedXid = nextXid - 1. The current code > dodges this kind of accidentally: the checkpoint that happens at > startup is a "shutdown checkpoint" and thus skips logging a standby > snapshot, since a shutdown checkpoint is a sure indicator that there > are no running transactions. With the changes, the checkpoint at > startup happens after we've started allowing write transactions, and > thus a standby snapshot needs to be logged also. In the cases where > the end-of-recovery record was already being used, the problem could > have happened already, except for the fact that those cases involve a > standby promotion, which doesn't happen during initdb. I explored a > few possible ways of solving this problem. > > The first thing I considered was replacing latestCompletedXid with a > name like incompleteXidHorizon. The idea is that, where > latestCompletedXid is the highest XID that is known to have committed > or aborted, incompleteXidHorizon would be the lowest XID such that all > known commits or aborts are for lower XIDs. In effect, > incompleteXidHorizon would be latestCompletedXid + 1. Since > latestCompletedXid is always normal except when it's 2, > incompleteXidHorizon would be normal in all cases. Initially this > seemed fairly promising, but it kind of fell down when I realized that > we copy latestCompletedXid into > ComputeXidHorizonsResult.latest_completed. That seemed to me to make > the consequences of the change a bit more far-reaching than I liked. > Also, it wasn't entirely clear to me that I wouldn't be introducing > any off-by-one errors into various wraparound calculations with this > approach. > > The second thing I considered was skipping LogStandbySnapshot() during > initdb. There are two ways of doing this and neither of them seem that > great to me. Something that does work is to skip LogStandbySnapshot() > when in single user mode, but there is no particular reason why WAL > generated in single user mode couldn't be replayed on a standby, so > this doesn't seem great. It's too big a hammer for what we really > want, which is just to suppress this during initdb. Another way of > approaching it is to skip it in bootstrap mode, but that actually > doesn't work: initdb then fails during the post-bootstrapping step > rather than during bootstrapping. I thought about patching around that > by forcing the code that generates checkpoint records to forcibly > advance an XID of 3 to 4, but that seemed like working around the > problem from the wrong end. > > So ... I decided that the best approach here seems to be to just skip > FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the > first write transaction that the cluster ever processes. That's very > simple and doesn't seem likely to break anything else. On the downside > it seems a bit grotty, but I don't see anything better, and on the > whole, I think with this approach we come out substantially ahead. > 0001 removes 3 times as many lines as it inserts, and 0002 saves a few > more lines of code. > > Now, I still don't really know that there isn't some theoretical > difficulty here that makes this whole approach a non-starter, but I > also can't think of what it might be. If the promotion case, which has > used the end-of-recovery record for many years, is basically safe, > despite the fact that it switches TLIs, then it seems to me that the > crash recovery case, which doesn't have that complication, ought to be > safe too. But I might well be missing something, so if you see a > problem, please speak up! > /* - * If this was a promotion, request an (online) checkpoint now. This isn't - * required for consistency, but the last restartpoint might be far back, - * and in case of a crash, recovering from it might take a longer than is - * appropriate now that we're not in standby mode anymore. + * Request an (online) checkpoint now. This isn't required for consistency, + * but the last restartpoint might be far back, and in case of a crash, + * recovering from it might take a longer than is appropriate now that + * we're not in standby mode anymore. */ - if (promoted) - RequestCheckpoint(CHECKPOINT_FORCE); + RequestCheckpoint(CHECKPOINT_FORCE); } I think RequestCheckpoint() should be called conditionally. What is the need of the checkpoint if we haven't been through the recovery, in other words, starting up from a clean shutdown? Regards, Amul
On Mon, Apr 18, 2022 at 11:56 PM Amul Sul <sulamul@gmail.com> wrote: > I think RequestCheckpoint() should be called conditionally. What is the need > of the checkpoint if we haven't been through the recovery, in other words, > starting up from a clean shutdown? Good point. v5 attached. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Apr 18, 2022 at 04:44:03PM -0400, Robert Haas wrote: > Here is a new version of the patch which, unlike v1, I think is > something we could seriously consider applying (not before v16, of > course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I > attach a second patch as well which nukes checkPoint.PrevTimeLineID as > well. I'd like to add a big +1 for this change. IIUC this should help with some of the problems I've noted elsewhere [0]. > I mentioned two problems with $SUBJECT in the first email with this > subject line. One was a bug, which Noah has since fixed (thanks, > Noah). The other problem is that LogStandbySnapshot() and a bunch of > its friends expect latestCompletedXid to always be a normal XID, which > is a problem because (1) we currently set nextXid to 3 and (2) at > startup, we compute latestCompletedXid = nextXid - 1. The current code > dodges this kind of accidentally: the checkpoint that happens at > startup is a "shutdown checkpoint" and thus skips logging a standby > snapshot, since a shutdown checkpoint is a sure indicator that there > are no running transactions. With the changes, the checkpoint at > startup happens after we've started allowing write transactions, and > thus a standby snapshot needs to be logged also. In the cases where > the end-of-recovery record was already being used, the problem could > have happened already, except for the fact that those cases involve a > standby promotion, which doesn't happen during initdb. I explored a > few possible ways of solving this problem. Shouldn't latestCompletedXid be set to MaxTransactionId in this case? Or is this related to the logic in FullTransactionIdRetreat() that avoids skipping over the "actual" special transaction IDs? > So ... I decided that the best approach here seems to be to just skip > FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the > first write transaction that the cluster ever processes. That's very > simple and doesn't seem likely to break anything else. On the downside > it seems a bit grotty, but I don't see anything better, and on the > whole, I think with this approach we come out substantially ahead. > 0001 removes 3 times as many lines as it inserts, and 0002 saves a few > more lines of code. This doesn't seem all that bad to me. It's a little hacky, but it's very easy to understand and only happens once per initdb. I don't think it's worth any extra complexity. > Now, I still don't really know that there isn't some theoretical > difficulty here that makes this whole approach a non-starter, but I > also can't think of what it might be. If the promotion case, which has > used the end-of-recovery record for many years, is basically safe, > despite the fact that it switches TLIs, then it seems to me that the > crash recovery case, which doesn't have that complication, ought to be > safe too. But I might well be missing something, so if you see a > problem, please speak up! Your reasoning seems sound to me. [0] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
At Tue, 19 Apr 2022 13:37:59 -0700, Nathan Bossart <nathandbossart@gmail.com> wrote in > On Mon, Apr 18, 2022 at 04:44:03PM -0400, Robert Haas wrote: > > Here is a new version of the patch which, unlike v1, I think is > > something we could seriously consider applying (not before v16, of > > course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I > > attach a second patch as well which nukes checkPoint.PrevTimeLineID as > > well. > > I'd like to add a big +1 for this change. IIUC this should help with some > of the problems I've noted elsewhere [0]. Agreed. > > I mentioned two problems with $SUBJECT in the first email with this > > subject line. One was a bug, which Noah has since fixed (thanks, > > Noah). The other problem is that LogStandbySnapshot() and a bunch of > > its friends expect latestCompletedXid to always be a normal XID, which > > is a problem because (1) we currently set nextXid to 3 and (2) at > > startup, we compute latestCompletedXid = nextXid - 1. The current code > > dodges this kind of accidentally: the checkpoint that happens at > > startup is a "shutdown checkpoint" and thus skips logging a standby > > snapshot, since a shutdown checkpoint is a sure indicator that there > > are no running transactions. With the changes, the checkpoint at > > startup happens after we've started allowing write transactions, and > > thus a standby snapshot needs to be logged also. In the cases where > > the end-of-recovery record was already being used, the problem could > > have happened already, except for the fact that those cases involve a > > standby promotion, which doesn't happen during initdb. I explored a > > few possible ways of solving this problem. > > Shouldn't latestCompletedXid be set to MaxTransactionId in this case? Or > is this related to the logic in FullTransactionIdRetreat() that avoids > skipping over the "actual" special transaction IDs? As the result FullTransactionIdRetreat(FirstNormalFullTransactionId) results in FrozenTransactionId, which looks odd. It seems to me rather should be InvalidFullTransactionId, or simply should assert-out that case. But incrmenting the very first xid avoid all that complexity. It is somewhat hacky but very smiple and understandable. > > So ... I decided that the best approach here seems to be to just skip > > FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the > > first write transaction that the cluster ever processes. That's very > > simple and doesn't seem likely to break anything else. On the downside > > it seems a bit grotty, but I don't see anything better, and on the > > whole, I think with this approach we come out substantially ahead. > > 0001 removes 3 times as many lines as it inserts, and 0002 saves a few > > more lines of code. > > This doesn't seem all that bad to me. It's a little hacky, but it's very > easy to understand and only happens once per initdb. I don't think it's > worth any extra complexity. +1. > > Now, I still don't really know that there isn't some theoretical > > difficulty here that makes this whole approach a non-starter, but I > > also can't think of what it might be. If the promotion case, which has > > used the end-of-recovery record for many years, is basically safe, > > despite the fact that it switches TLIs, then it seems to me that the > > crash recovery case, which doesn't have that complication, ought to be > > safe too. But I might well be missing something, so if you see a > > problem, please speak up! > > Your reasoning seems sound to me. > > [0] https://postgr.es/m/C1EE64B0-D4DB-40F3-98C8-0CED324D34CB%40amazon.com FWIW, I don't find a flaw in the reasoning, too. By the way do we need to leave CheckPoint.PrevTimeLineID? It is always the same value with ThisTimeLineID. The most dubious part is ApplyWalRecord but XLOG_CHECKPOINT_SHUTDOWN no longer induces timeline switch. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Tue, Apr 19, 2022 at 2:14 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Sat, Jan 15, 2022 at 11:52 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > The cfbot reports that this version of the patch doesn't apply anymore: > > Here is a new version of the patch which, unlike v1, I think is > something we could seriously consider applying (not before v16, of > course). It now removes CHECKPOINT_END_OF_RECOVERY completely, and I > attach a second patch as well which nukes checkPoint.PrevTimeLineID as > well. > > I mentioned two problems with $SUBJECT in the first email with this > subject line. One was a bug, which Noah has since fixed (thanks, > Noah). The other problem is that LogStandbySnapshot() and a bunch of > its friends expect latestCompletedXid to always be a normal XID, which > is a problem because (1) we currently set nextXid to 3 and (2) at > startup, we compute latestCompletedXid = nextXid - 1. The current code > dodges this kind of accidentally: the checkpoint that happens at > startup is a "shutdown checkpoint" and thus skips logging a standby > snapshot, since a shutdown checkpoint is a sure indicator that there > are no running transactions. With the changes, the checkpoint at > startup happens after we've started allowing write transactions, and > thus a standby snapshot needs to be logged also. In the cases where > the end-of-recovery record was already being used, the problem could > have happened already, except for the fact that those cases involve a > standby promotion, which doesn't happen during initdb. I explored a > few possible ways of solving this problem. > > The first thing I considered was replacing latestCompletedXid with a > name like incompleteXidHorizon. The idea is that, where > latestCompletedXid is the highest XID that is known to have committed > or aborted, incompleteXidHorizon would be the lowest XID such that all > known commits or aborts are for lower XIDs. In effect, > incompleteXidHorizon would be latestCompletedXid + 1. Since > latestCompletedXid is always normal except when it's 2, > incompleteXidHorizon would be normal in all cases. Initially this > seemed fairly promising, but it kind of fell down when I realized that > we copy latestCompletedXid into > ComputeXidHorizonsResult.latest_completed. That seemed to me to make > the consequences of the change a bit more far-reaching than I liked. > Also, it wasn't entirely clear to me that I wouldn't be introducing > any off-by-one errors into various wraparound calculations with this > approach. > > The second thing I considered was skipping LogStandbySnapshot() during > initdb. There are two ways of doing this and neither of them seem that > great to me. Something that does work is to skip LogStandbySnapshot() > when in single user mode, but there is no particular reason why WAL > generated in single user mode couldn't be replayed on a standby, so > this doesn't seem great. It's too big a hammer for what we really > want, which is just to suppress this during initdb. Another way of > approaching it is to skip it in bootstrap mode, but that actually > doesn't work: initdb then fails during the post-bootstrapping step > rather than during bootstrapping. I thought about patching around that > by forcing the code that generates checkpoint records to forcibly > advance an XID of 3 to 4, but that seemed like working around the > problem from the wrong end. > > So ... I decided that the best approach here seems to be to just skip > FirstNormalTransactionId and use FirstNormalTransactionId + 1 for the > first write transaction that the cluster ever processes. That's very > simple and doesn't seem likely to break anything else. On the downside > it seems a bit grotty, but I don't see anything better, and on the > whole, I think with this approach we come out substantially ahead. IIUC, the failure was something like this on initdb: running bootstrap script ... TRAP: FailedAssertion("TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)", File: "procarray.c", Line: 2892, PID: 60363) /bin/postgres(ExceptionalCondition+0xb9)[0xb3917d] /bin/postgres(GetRunningTransactionData+0x36c)[0x96aa26] /bin/postgres(LogStandbySnapshot+0x64)[0x974393] /bin/postgres(CreateCheckPoint+0x67f)[0x5928bf] /bin/postgres(RequestCheckpoint+0x26)[0x8ca649] /bin/postgres(StartupXLOG+0xf51)[0x591126] /bin/postgres(InitPostgres+0x188)[0xb4f2ac] /bin/postgres(BootstrapModeMain+0x4d3)[0x5ac6de] /bin/postgres(main+0x275)[0x7ca72e] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f71af82d445] /bin/postgres[0x48aae9] child process was terminated by signal 6: Aborted initdb: removing data directory "/inst/data" That was happening because RequestCheckpoint() was called from StartupXLOG() unconditionally, but with the v5 patch that is not true. If my understanding is correct then we don't need any handling for latestCompletedXid, at least in this patch. Regards, Amul
On Tue, Apr 19, 2022 at 4:38 PM Nathan Bossart <nathandbossart@gmail.com> wrote: > Shouldn't latestCompletedXid be set to MaxTransactionId in this case? Or > is this related to the logic in FullTransactionIdRetreat() that avoids > skipping over the "actual" special transaction IDs? The problem here is this code: /* also initialize latestCompletedXid, to nextXid - 1 */ LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE); ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid; FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid); LWLockRelease(ProcArrayLock); If nextXid is 3, then latestCompletedXid gets 2. But in GetRunningTransactionData: Assert(TransactionIdIsNormal(CurrentRunningXacts->latestCompletedXid)); > Your reasoning seems sound to me. I was talking with Thomas Munro yesterday and he thinks there is a problem with relfilenode reuse here. In normal running, when a relation is dropped, we leave behind a 0-length file until the next checkpoint; this keeps that relfilenode from being used even if the OID counter wraps around. If we didn't do that, then imagine that while running with wal_level=minimal, we drop an existing relation, create a new relation with the same OID, load some data into it, and crash, all within the same checkpoint cycle, then we will be able to replay the drop, but we will not be able to restore the relation contents afterward because at wal_level=minimal they are not logged. Apparently, we don't create tombstone files during recovery because we know that there will be a checkpoint at the end. With the existing use of the end-of-recovery record, we always know that wal_level>minimal, because we're only using it on standbys. But with this use that wouldn't be true any more. So I guess we need to start creating tombstone files even during recovery, or else do something like what Dilip coded up in http://postgr.es/m/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com which I think would be a better solution at least in the long term. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Apr 20, 2022 at 09:26:07AM -0400, Robert Haas wrote: > I was talking with Thomas Munro yesterday and he thinks there is a > problem with relfilenode reuse here. In normal running, when a > relation is dropped, we leave behind a 0-length file until the next > checkpoint; this keeps that relfilenode from being used even if the > OID counter wraps around. If we didn't do that, then imagine that > while running with wal_level=minimal, we drop an existing relation, > create a new relation with the same OID, load some data into it, and > crash, all within the same checkpoint cycle, then we will be able to > replay the drop, but we will not be able to restore the relation > contents afterward because at wal_level=minimal they are not logged. > Apparently, we don't create tombstone files during recovery because we > know that there will be a checkpoint at the end. In the example you provided, won't the tombstone file already be present before the crash? During recovery, the tombstone file will be removed, and the new relation wouldn't use the same relfilenode anyway. I'm probably missing something obvious here. I do see the problem if we drop an existing relation, crash, reuse the filenode, and then crash again (all within the same checkpoint cycle). The first recovery would remove the tombstone file, and the second recovery would wipe out the new relation's files. > With the existing use of the end-of-recovery record, we always know > that wal_level>minimal, because we're only using it on standbys. But > with this use that wouldn't be true any more. So I guess we need to > start creating tombstone files even during recovery, or else do > something like what Dilip coded up in > http://postgr.es/m/CAFiTN-u=r8UTCSzu6_pnihYAtwR1=esq5sRegTEZ2tLa92fovA@mail.gmail.com > which I think would be a better solution at least in the long term. IMO this would be good just to reduce the branching a bit. I suppose removing the files immediately during recovery might be an optimization in some cases, but I am skeptical that it really makes that much of a difference in practice. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Thu, Apr 21, 2022 at 5:02 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > I do see the problem if we drop an existing relation, crash, reuse the > filenode, and then crash again (all within the same checkpoint cycle). The > first recovery would remove the tombstone file, and the second recovery > would wipe out the new relation's files. Right, the double-crash case is what I was worrying about. I'm not sure, but it might even be more likely than usual that you'll reuse the same relfilenode after the first crash, because the OID allocator will start from the same value.