Re: WAL segments removed from primary despite the fact that logical replication slot needs it. - Mailing list pgsql-bugs
From | Amit Kapila |
---|---|
Subject | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. |
Date | |
Msg-id | CAA4eK1Ls_UPvmvCRApxWVfR-dr7m1G5JoWTCH+Zp=Z6HZNWehw@mail.gmail.com Whole thread Raw |
In response to | Re: WAL segments removed from primary despite the fact that logical replication slot needs it. (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: WAL segments removed from primary despite the fact that logical replication slot needs it.
|
List | pgsql-bugs |
On Wed, Nov 16, 2022 at 12:08 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Nov 16, 2022 at 11:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > Found one in the time frame you mentioned: > > > > > 2022-11-10 21:03:24.612 UTC,"upgrayedd","canvas",21748,"10.1.238.101:35640",636d671b.54f4,39,"idle",2022-11-1021:03:23 UTC,7/0,0,DEBUG,00000,"failedto increase restart lsn: proposed 1039D/8B5773D8, after 1039D/9170B010, current candidate 1039D/83825958,current after 1039D/8B5773D8, flushed up to 1039D/91F41B50",,,,,,,,,"focal14" > > > > > > > > > > > > > Thanks! > > > > > > > > LSN 1039D/8B5773D8 seems to be related to this issue. If we advance > > > > slot's restart_lsn to this LSN, we remove WAL files older than > > > > 000000000001039D0000008A. > > > > > > > > In LogicalIncreaseRestartDecodingForSlot(), since > > > > "current_lsn(1039D/9170B010) < > > > > slot->data.confirmed_flush(1039D/91F41B50)", we executed the following > > > > part and called LogicalConfirmReceivedLocation(): > > > > > > > > else if (current_lsn <= slot->data.confirmed_flush) > > > > { > > > > slot->candidate_restart_valid = current_lsn; > > > > slot->candidate_restart_lsn = restart_lsn; > > > > > > > > /* our candidate can directly be used */ > > > > updated_lsn = true; > > > > } > > > > > > > > > > If this would have been executed in > > > LogicalIncreaseRestartDecodingForSlot(), then the values displayed in > > > the above DEBUG messages "current candidate 1039D/83825958, current > > > after 1039D/8B5773D8" should be the same as proposed and after > > > "proposed 1039D/8B5773D8, after 1039D/9170B010". Am, I missing > > > something? > > > > Oh, you're right. > > > > Given restart_lsn was 1039D/8B5773D8, slot->data.restart_lsn was equal > > to or greater than 1039D/8B5773D8 at that time but > > slot->candidate_restart_lsn was 1039D/83825958, right? Which is weird. > > > > Yes, that is weird but it had been a bit obvious if the same LOG would > have printed slot->data.restart_lsn. This means that somehow slot's > 'candidate_restart_lsn' somehow went behind its 'restart_lsn'. I can't > figure out yet how that can happen but if that happens then the slot's > restart_lsn can retreat in LogicalConfirmReceivedLocation() because we > don't check if slot's candidate_restart_lsn is lesser than its > restart_lsn before assigning the same in line > MyReplicationSlot->data.restart_lsn = > MyReplicationSlot->candidate_restart_lsn;. I think that should be > checked. Note that we call LogicalConfirmReceivedLocation() can be > called from ProcessStandbyReplyMessage(), so once the wrong > candidate_restart_lsn is set, it can be assigned to restart_lsn from > other code paths as well. > > I am not able to think how 'candidate_restart_lsn' can be set to an > LSN value prior to 'restart_lsn'. > In the below part of the code, we use the value of 'last_serialized_snapshot' for restart_lsn. else if (txn == NULL && builder->reorder->current_restart_decoding_lsn != InvalidXLogRecPtr && builder->last_serialized_snapshot != InvalidXLogRecPtr) LogicalIncreaseRestartDecodingForSlot(lsn, builder->last_serialized_snapshot); Now, say, after restart, we start reading from slot's restart_lsn which is 1039D/8B5773D8. At this LSN, we restored a snapshot that has the last_seriealized_snapshot set to 1039D/83825958. If that happens, then in LogicalIncreaseRestartDecodingForSlot, we can set these values to slot's candidate_*_lsn variables. Say, if this happens, next time whenever LogicalConfirmReceivedLocation() is called the value of slot's restart_lsn will be moved back. Once it is moved back, yet another restart will lead to this problem. Does this theory makes sense? -- With Regards, Amit Kapila.
pgsql-bugs by date: