Thread: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Hackers,
I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values.
If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good.
The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed.
I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem.
I reproduced the problem by the following way:
- Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution.
- Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection.
- Hard restart the server right after checkpoint completion.
- After restart slot's restart_lsn may point to removed WAL segment.
The proposed patch fixes it.
[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Hackers,
To ping the topic, I'd like to clarify what may be wrong with the idea described here, because I do not see any interest from the community. The topic is related to physical replication. The primary idea is to define the horizon of WAL segments (files) removal based on saved on disk restart LSN values. Now, the WAL segment removal horizon is calculated based on the current restart LSN values of slots, that can not be saved on disk at the time of the horizon calculation. The case take place when a slot is advancing during checkpoint as described earlier in the topic.
Such behaviour is not a problem when slots are used only for physical replication in a conventional way. But it may be a problem when physical slot is used for some other goals. For example, I have an extension which keeps the WAL using physical replication slots. It creates a new physical slot and advances it as needed. After restart, it can use restart lsn of the slot to read WAL from this LSN. In this case, there is no guarantee that restart lsn will point to an existing WAL segment.
The advantage of the current behaviour is that it requires a little bit less WAL to keep. The disadvantage is that physical slots do not guarantee WAL keeping starting from its' restart lsns in general.
I would be happy to get some advice, whether I am on the right or wrong way. Thank you in advance.
With best regards,
Vitaly
On Thursday, November 07, 2024 16:30 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:
Dear Hackers,
I'd like to introduce an improved version of my patch (see the attached file). My original idea was to take into account saved on disk restart_lsn (slot→restart_lsn_flushed) for persistent slots when removing WAL segment files. It helps tackle errors like: ERROR: requested WAL segment 000...0AA has already been removed.
Improvements:
- flushed_restart_lsn is used only for RS_PERSISTENT slots.
- Save physical slot on disk when advancing only once - if restart_lsn_flushed is invalid. It is needed because slots with invalid restart LSN are not used when calculating oldest LSN for WAL truncation. Once restart_lsn becomes valid, it should be saved to disk immediately to update restart_lsn_flushed.
Regression tests seems to be ok except:
- recovery/t/001_stream_rep.pl (checkpoint is needed)
- recovery/t/019_replslot_limit.pl (it seems, slot was invalidated, some adjustments are needed)
- pg_basebackup/t/020_pg_receivewal.pl (not sure about it)
There are some problems:
- More WAL segments may be kept. It may lead to invalidations of slots in some tests (recovery/t/019_replslot_limit.pl). A couple of tests should be adjusted.
With best regards,
Vitaly Davydov
On Thursday, October 31, 2024 13:32 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:
Sorry, attached the missed patch.
On Thursday, October 31, 2024 13:18 MSK, "Vitaly Davydov" <v.davydov@postgrespro.ru> wrote:
Dear Hackers,
I'd like to discuss a problem with replication slots's restart LSN. Physical slots are saved to disk at the beginning of checkpoint. At the end of checkpoint, old WAL segments are recycled or removed from disk, if they are not kept by slot's restart_lsn values.
If an existing physical slot is advanced in the middle of checkpoint execution, WAL segments, which are related to saved on disk restart LSN may be removed. It is because the calculation of the replication slot miminal LSN is occured at the end of checkpoint, prior to old WAL segments removal. If to hard stop (pg_stl -m immediate) the postgres instance right after checkpoint and to restart it, the slot's restart_lsn may point to the removed WAL segment. I believe, such behaviour is not good.
The doc [0] describes that restart_lsn may be set to the some past value after reload. There is a discussion [1] on pghackers where such behaviour is discussed. The main reason of not flushing physical slots on advancing is a performance reason. I'm ok with such behaviour, except of that the corresponding WAL segments should not be removed.
I propose to keep WAL segments by saved on disk (flushed) restart_lsn of slots. Add a new field restart_lsn_flushed into ReplicationSlot structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It doesn't change the format of storing the slot contents on disk. I attached a patch. It is not yet complete, but demonstate a way to solve the problem.
I reproduced the problem by the following way:
- Add some delay in CheckPointBuffers (pg_usleep) to emulate long checkpoint execution.
- Execute checkpoint and pg_replication_slot_advance right after starting of the checkpoint from another connection.
- Hard restart the server right after checkpoint completion.
- After restart slot's restart_lsn may point to removed WAL segment.
The proposed patch fixes it.
[0] https://www.postgresql.org/docs/current/logicaldecoding-explanation.html
[1] https://www.postgresql.org/message-id/flat/059cc53a-8b14-653a-a24d-5f867503b0ee%40postgrespro.ru
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On 10/31/24 11:18, Vitaly Davydov wrote: > Dear Hackers, > > > > I'd like to discuss a problem with replication slots's restart LSN. > Physical slots are saved to disk at the beginning of checkpoint. At the > end of checkpoint, old WAL segments are recycled or removed from disk, > if they are not kept by slot's restart_lsn values. > I agree that if we can lose WAL still needed for a replication slot, that is a bug. Retaining the WAL is the primary purpose of slots, and we just fixed a similar issue for logical replication. > > If an existing physical slot is advanced in the middle of checkpoint > execution, WAL segments, which are related to saved on disk restart LSN > may be removed. It is because the calculation of the replication slot > miminal LSN is occured at the end of checkpoint, prior to old WAL > segments removal. If to hard stop (pg_stl -m immediate) the postgres > instance right after checkpoint and to restart it, the slot's > restart_lsn may point to the removed WAL segment. I believe, such > behaviour is not good. > Not sure I 100% follow, but let me rephrase, just so that we're on the same page. CreateCheckPoint() does this: ... something ... CheckPointGuts(checkPoint.redo, flags); ... something ... RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr, checkPoint.ThisTimeLineID); The slots get synced in CheckPointGuts(), so IIUC you're saying if the slot gets advanced shortly after the sync, the RemoveOldXlogFiles() may remove still-needed WAL, because we happen to consider a fresh restart_lsn when calculating the logSegNo. Is that right? > > The doc [0] describes that restart_lsn may be set to the some past value > after reload. There is a discussion [1] on pghackers where such > behaviour is discussed. The main reason of not flushing physical slots > on advancing is a performance reason. I'm ok with such behaviour, except > of that the corresponding WAL segments should not be removed. > I don't know which part of [0] you refer to, but I guess you're referring to this: The current position of each slot is persisted only at checkpoint, so in the case of a crash the slot may return to an earlier LSN, which will then cause recent changes to be sent again when the server restarts. Logical decoding clients are responsible for avoiding ill effects from handling the same message more than once. Yes, it's fine if we discard the new in-memory restart_lsn value, and we do this for performance reasons - flushing the slot on every advance would be very expensive. I haven't read [1] as it's quite long, but I guess that's what it says. But we must not make any "permanent" actions based on the unflushed value, I think. Like, we should not remove WAL segments, for example. > > > I propose to keep WAL segments by saved on disk (flushed) restart_lsn of > slots. Add a new field restart_lsn_flushed into ReplicationSlot > structure. Copy restart_lsn to restart_lsn_flushed in SaveSlotToPath. It > doesn't change the format of storing the slot contents on disk. I > attached a patch. It is not yet complete, but demonstate a way to solve > the problem. > That seems like a possible fix this, yes. And maybe it's the right one. > > I reproduced the problem by the following way: > > * Add some delay in CheckPointBuffers (pg_usleep) to emulate long > checkpoint execution. > * Execute checkpoint and pg_replication_slot_advance right after > starting of the checkpoint from another connection. > * Hard restart the server right after checkpoint completion. > * After restart slot's restart_lsn may point to removed WAL segment. > > The proposed patch fixes it. > I tried to reproduce the issue using a stress test (checkpoint+restart in a loop), but so far without success :-( Can you clarify where exactly you added the pg_usleep(), and how long are the waits you added? I wonder if the sleep is really needed, considering the checkpoints are spread anyway. Also, what you mean by "hard reset"? What confuses me a bit is that we update the restart_lsn (and call ReplicationSlotsComputeRequiredLSN() to recalculate the global value) all the time. Walsender does that in PhysicalConfirmReceivedLocation for example. So we actually see the required LSN to move during checkpoint very often. So how come we don't see the issues much more often? Surely I miss something important. Another option might be that pg_replication_slot_advance() doesn't do something it should be doing. For example, shouldn't be marking the slot as dirty? regards -- Tomas Vondra
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On 11/20/24 14:40, Vitaly Davydov wrote: > Dear Hackers, > > > > To ping the topic, I'd like to clarify what may be wrong with the idea > described here, because I do not see any interest from the community. > The topic is related to physical replication. The primary idea is to > define the horizon of WAL segments (files) removal based on saved on > disk restart LSN values. Now, the WAL segment removal horizon is > calculated based on the current restart LSN values of slots, that can > not be saved on disk at the time of the horizon calculation. The case > take place when a slot is advancing during checkpoint as described > earlier in the topic. > Yeah, a simple way to fix this might be to make sure we don't use the required LSN value set after CheckPointReplicationSlots() to remove WAL. AFAICS the problem is KeepLogSeg() gets the new LSN value by: keep = XLogGetReplicationSlotMinimumLSN(); Let's say we get the LSN before calling CheckPointGuts(), and then pass it to KeepLogSeg, so that it doesn't need to get the fresh value. Wouldn't that fix the issue? > > Such behaviour is not a problem when slots are used only for physical > replication in a conventional way. But it may be a problem when physical > slot is used for some other goals. For example, I have an extension > which keeps the WAL using physical replication slots. It creates a new > physical slot and advances it as needed. After restart, it can use > restart lsn of the slot to read WAL from this LSN. In this case, there > is no guarantee that restart lsn will point to an existing WAL segment. > Yeah. > > The advantage of the current behaviour is that it requires a little bit > less WAL to keep. The disadvantage is that physical slots do not > guarantee WAL keeping starting from its' restart lsns in general. > If it's wrong, it doesn't really matter it has some advantages. regards -- Tomas Vondra
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On 11/20/24 18:24, Tomas Vondra wrote: > > ... > > What confuses me a bit is that we update the restart_lsn (and call > ReplicationSlotsComputeRequiredLSN() to recalculate the global value) > all the time. Walsender does that in PhysicalConfirmReceivedLocation for > example. So we actually see the required LSN to move during checkpoint > very often. So how come we don't see the issues much more often? Surely > I miss something important. > This question "How come we don't see this more often?" kept bugging me, and the answer is actually pretty simple. The restart_lsn can move backwards after a hard restart (for the reasons explained), but physical replication does not actually rely on that. The replica keeps track of the LSN it received (well, it uses the same LSN), and on reconnect it sends the startpoint to the primary. And the primary just proceeds use that instead of the (stale) restart LSN for the slot. And the startpoint is guaranteed (I think) to be at least restart_lsn. AFAICS this would work for pg_replication_slot_advance() too, that is if you remember the last LSN the slot advanced to, it should be possible to advance to it just fine. Of course, it requires a way to remember that LSN, which for a replica is not an issue. But this just highlights we can't rely on restart_lsn for this purpose. (Apologies if this repeats something obvious, or something you already said, Vitaly.) regards -- Tomas Vondra
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On 11/20/24 23:19, Tomas Vondra wrote: > On 11/20/24 18:24, Tomas Vondra wrote: >> >> ... >> >> What confuses me a bit is that we update the restart_lsn (and call >> ReplicationSlotsComputeRequiredLSN() to recalculate the global value) >> all the time. Walsender does that in PhysicalConfirmReceivedLocation for >> example. So we actually see the required LSN to move during checkpoint >> very often. So how come we don't see the issues much more often? Surely >> I miss something important. >> > > This question "How come we don't see this more often?" kept bugging me, > and the answer is actually pretty simple. > > The restart_lsn can move backwards after a hard restart (for the reasons > explained), but physical replication does not actually rely on that. The > replica keeps track of the LSN it received (well, it uses the same LSN), > and on reconnect it sends the startpoint to the primary. And the primary > just proceeds use that instead of the (stale) restart LSN for the slot. > And the startpoint is guaranteed (I think) to be at least restart_lsn. > > AFAICS this would work for pg_replication_slot_advance() too, that is if > you remember the last LSN the slot advanced to, it should be possible to > advance to it just fine. Of course, it requires a way to remember that > LSN, which for a replica is not an issue. But this just highlights we > can't rely on restart_lsn for this purpose. > I kept thinking about this (sorry it's this incremental), particularly if this applies to logical replication too. And AFAICS it does not, or at least not to this extent. For streaming, the subscriber sends the startpoint (just like physical replication), so it should be protected too. But then there's the SQL API - pg_logical_slot_get_changes(). And it turns out it ends up syncing the slot to disk pretty often, because for RUNNING_XACTS we call LogicalDecodingProcessRecord() + standby_decode(), which ends up calling SaveSlotToDisk(). And at the end we call LogicalConfirmReceivedLocation() for good measure, which saves the slot too, just to be sure. FWIW I suspect this still is not perfectly safe, because we may still crash / restart before the updated data.restart_lsn makes it to disk, but after it was already used to remove old WAL, although that's probably harder to hit. With streaming the subscriber will still send us the new startpoint, so that should not fail I think. But with the SQL API we probably can get into the "segment already removed" issues. I haven't tried reproducing this yet, I guess it should be possible using the injection points. Not sure when I get to this, though. In any case, doesn't this suggest SaveSlotToDisk() really is not that expensive, if we do it pretty often for logical replication? Which was presented as the main reason why pg_replication_slot_advance() doesn't do that. Maybe it should? If the advance is substantial I don't think it really matters, because there simply can't be that many of large advances. It amortizes, in a way. But even with smaller advances it should be fine, I think - if the goal is to not remove WAL prematurely, it's enough to flush when we move to the next segment. regards -- Tomas Vondra
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi Tomas,
Thank you for the reply and your interest to the investigation.
On Wednesday, November 20, 2024 20:24 MSK, Tomas Vondra <tomas@vondra.me> wrote:
> If an existing physical slot is advanced in the middle of checkpoint
> execution, WAL segments, which are related to saved on disk restart LSN
> may be removed. It is because the calculation of the replication slot
> miminal LSN is occured at the end of checkpoint, prior to old WAL
> segments removal. If to hard stop (pg_stl -m immediate) the postgres
> instance right after checkpoint and to restart it, the slot's
> restart_lsn may point to the removed WAL segment. I believe, such
> behaviour is not good.
>
Not sure I 100% follow, but let me rephrase, just so that we're on the
same page. CreateCheckPoint() does this:
... something ...
CheckPointGuts(checkPoint.redo, flags);
... something ...
RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr,
checkPoint.ThisTimeLineID);
The slots get synced in CheckPointGuts(), so IIUC you're saying if the
slot gets advanced shortly after the sync, the RemoveOldXlogFiles() may
remove still-needed WAL, because we happen to consider a fresh
restart_lsn when calculating the logSegNo. Is that right?
They key action here is to restart the instance with -m immediate (or kill it and start it again) right after checkpoint. After restart, the slot's restart_lsn will be read from the disk and address to the removed WAL segment, if it's LSN was davanced enought to switch to a new WAL segment.
I tried to reproduce the issue using a stress test (checkpoint+restart
in a loop), but so far without success :-(
Can you clarify where exactly you added the pg_usleep(), and how long
are the waits you added? I wonder if the sleep is really needed,
considering the checkpoints are spread anyway. Also, what you mean by
"hard reset"?
I added pg_usleep as show below (in CheckPointBuffers function):
CheckPointBuffers(int flags)
{
BufferSync(flags);
+ pg_usleep(10000000);
}
Below is the instruction how I run my test (pg_usleep should be added to the code):
CONSOLE> initdb -D pgdata
CONSOLE> pg_ctl -D pgdata -l logfile start
... open two psql terminals and connect to the database (lets call them as PSQL-1, PSQL-2)
PSQL-1> select pg_create_physical_replication_slot('myslot', true, false);
CONSOLE> pgbench -i -s 10 postgres # create some WAL records
PSQL-1> checkpoint; -- press ENTER key and go to PSQL-2 console and execute next line in 1 second
PSQL-2> select pg_replication_slot_advance('myslot', pg_current_wal_lsn()); -- advance repslot during checkpoint
... wait for checkpoint to complete
CONSOLE> pg_ctl -D pgdata -m immediate stop
CONSOLE> pg_ctl -D pgdata start
PSQL-1> \c
PSQL-1> create extension pg_walinspect;
PSQL-1> select pg_get_wal_record_info(restart_lsn) from pg_replication_slots where slot_name = 'myslot';
ERROR: requested WAL segment pg_wal/000000010000000000000001 has already been removed
I'm trying to create a perl test to reproduce it. Please, give me some time to create the test script.
I kept thinking about this (sorry it's this incremental), particularly
if this applies to logical replication too. And AFAICS it does not, or
at least not to this extent.
Yes, it is not applied to logical replication, because logical slot is synced when advancing.
eah, a simple way to fix this might be to make sure we don't use the
required LSN value set after CheckPointReplicationSlots() to remove WAL.
AFAICS the problem is KeepLogSeg() gets the new LSN value by:
keep = XLogGetReplicationSlotMinimumLSN();
Let's say we get the LSN before calling CheckPointGuts(), and then pass
it to KeepLogSeg, so that it doesn't need to get the fresh value.
Yes, it is another solution and it can fix the problem. The question - which solution to choose. Well, I prefer to add a new in-memory state variable in the slot structure. Such variable may be useful if we want to check whether the slot data is synced or not. The calculation of the keep value before CheckPointGuts(), IMHO, requires to change signatures of a number of functions. I may prepare a new patch where your solution is implemented.
I'm sorry, if I missed to answer to some other questions. I will answer later.
With best regards,
Vitaly
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Hackers, Let me please introduce a new version of the patch. Patch description: The slot data is flushed to the disk at the beginning of checkpoint. If an existing slot is advanced in the middle of checkpoint execution, its advanced restart LSN is taken to calculate the oldest LSN for WAL segments removal at the end of checkpoint. If the node is restarted just after the checkpoint, the slots data will be read from the disk at recovery with the oldest restart LSN which can refer to removed WAL segments. The patch introduces a new in-memory state for slots - flushed_restart_lsn which is used to calculate the oldest LSN for WAL segments removal. This state is updated every time with the current restart_lsn at the moment, when the slot is saving to disk. With best regards, Vitaly
Attachment
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, Vitaly! On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > The slot data is flushed to the disk at the beginning of checkpoint. If > an existing slot is advanced in the middle of checkpoint execution, its > advanced restart LSN is taken to calculate the oldest LSN for WAL > segments removal at the end of checkpoint. If the node is restarted just > after the checkpoint, the slots data will be read from the disk at > recovery with the oldest restart LSN which can refer to removed WAL > segments. > > The patch introduces a new in-memory state for slots - > flushed_restart_lsn which is used to calculate the oldest LSN for WAL > segments removal. This state is updated every time with the current > restart_lsn at the moment, when the slot is saving to disk. Thank you for your work on this subject. I think generally your approach is correct. When we're truncating the WAL log, we need to reply on the position that would be used in the case of server crush. That is the position flushed to the disk. While your patch is generality looks good, I'd like make following notes: 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need to advance the position of WAL needed by replication slots, the usage pattern probably could be changed. Thus, we probably need to call ReplicationSlotsComputeRequiredLSN() somewhere after change of restart_lsn_flushed while restart_lsn is not changed. And probably can skip ReplicationSlotsComputeRequiredLSN() in some cases when only restart_lsn is changed. 2) I think it's essential to include into the patch test caches which fail without patch. You could start from integrating [1] test into your patch, and then add more similar tests for different situations. Links. 1. https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi Alexander, Thank you for the review. I apologize for a late reply. I missed your email. > 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need > to advance the position of WAL needed by replication slots, the usage > pattern probably could be changed. Thus, we probably need to call > ReplicationSlotsComputeRequiredLSN() somewhere after change of > restart_lsn_flushed while restart_lsn is not changed. And probably > can skip ReplicationSlotsComputeRequiredLSN() in some cases when only > restart_lsn is changed. Yes, it is a good idea for investigation, thank you! I guess, It may work for persistent slots but I'm not sure about other types of slots (ephemeral and temporary). I have no clear understanding of consequences at the moment. I propose to postpone it for future, because the proposed changes will me more invasive. > 2) I think it's essential to include into the patch test caches which > fail without patch. You could start from integrating [1] test into > your patch, and then add more similar tests for different situations. The problem with TAP tests - it is hard to reproduce without injection points. The Tomas Vondra's tests create two new injection points. I have to add more injection points for new tests as well. Injection points help to test the code but make the code unreadable. I'm not sure, what is the policy of creating injection points? Personally, I would not like to add new injection points only to check this particular rare case. I'm trying to create a test without injection points that should fail occasionally, but I haven't succeeded at the moment. I have a question - is there any interest to backport the solution into existing major releases? I can prepare a patch where restart_lsn_flushed stored outside of ReplicationSlot structure and doesn't affect the existing API. With best regards, Vitaly On Friday, April 04, 2025 06:22 MSK, Alexander Korotkov <aekorotkov@gmail.com> wrote: > Hi, Vitaly! > > On Mon, Mar 3, 2025 at 5:12 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > The slot data is flushed to the disk at the beginning of checkpoint. If > > an existing slot is advanced in the middle of checkpoint execution, its > > advanced restart LSN is taken to calculate the oldest LSN for WAL > > segments removal at the end of checkpoint. If the node is restarted just > > after the checkpoint, the slots data will be read from the disk at > > recovery with the oldest restart LSN which can refer to removed WAL > > segments. > > > > The patch introduces a new in-memory state for slots - > > flushed_restart_lsn which is used to calculate the oldest LSN for WAL > > segments removal. This state is updated every time with the current > > restart_lsn at the moment, when the slot is saving to disk. > > Thank you for your work on this subject. I think generally your > approach is correct. When we're truncating the WAL log, we need to > reply on the position that would be used in the case of server crush. > That is the position flushed to the disk. > > While your patch is generality looks good, I'd like make following notes: > > 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need > to advance the position of WAL needed by replication slots, the usage > pattern probably could be changed. Thus, we probably need to call > ReplicationSlotsComputeRequiredLSN() somewhere after change of > restart_lsn_flushed while restart_lsn is not changed. And probably > can skip ReplicationSlotsComputeRequiredLSN() in some cases when only > restart_lsn is changed. > 2) I think it's essential to include into the patch test caches which > fail without patch. You could start from integrating [1] test into > your patch, and then add more similar tests for different situations. > > Links. > 1. https://www.postgresql.org/message-id/e3ac0535-e7a2-4a96-9b36-9f765e9cfec5%40vondra.me > > ------ > Regards, > Alexander Korotkov > Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Thu, Apr 24, 2025 at 5:32 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > Thank you for the review. I apologize for a late reply. I missed your email. > > > 1) As ReplicationSlotsComputeRequiredLSN() is called each time we need > > to advance the position of WAL needed by replication slots, the usage > > pattern probably could be changed. Thus, we probably need to call > > ReplicationSlotsComputeRequiredLSN() somewhere after change of > > restart_lsn_flushed while restart_lsn is not changed. And probably > > can skip ReplicationSlotsComputeRequiredLSN() in some cases when only > > restart_lsn is changed. > > Yes, it is a good idea for investigation, thank you! I guess, It may work for > persistent slots but I'm not sure about other types of slots (ephemeral and > temporary). I have no clear understanding of consequences at the moment. I > propose to postpone it for future, because the proposed changes will me more > invasive. Yes, that's different for different types of slots. So, removing ReplicationSlotsComputeRequiredLSN() doesn't look safe. But at least, we need to analyze if we need to add extra calls. > > 2) I think it's essential to include into the patch test caches which > > fail without patch. You could start from integrating [1] test into > > your patch, and then add more similar tests for different situations. > > The problem with TAP tests - it is hard to reproduce without injection points. > The Tomas Vondra's tests create two new injection points. I have to add more > injection points for new tests as well. Injection points help to test the code > but make the code unreadable. I'm not sure, what is the policy of creating > injection points? Personally, I would not like to add new injection points > only to check this particular rare case. I'm trying to create a test without > injection points that should fail occasionally, but I haven't succeeded at > the moment. I don't know if there is an explicit policy. I think we just add them as needed to reproduce important situations in TAP tests. So, feel free to them as many as you want to reproduce all the problematic situations. During review we can find if they seem too many, but don't bother about this at present stage. > I have a question - is there any interest to backport the solution into > existing major releases? As long as this is the bug, it should be backpatched to all supported affected releases. > I can prepare a patch where restart_lsn_flushed stored > outside of ReplicationSlot structure and doesn't affect the existing API. Yes, please! ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Mon, Apr 28, 2025 at 8:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > I have a question - is there any interest to backport the solution into > > existing major releases? > > As long as this is the bug, it should be backpatched to all supported > affected releases. Yes, but I think we cannot back-patch the proposed fix to back branches as it changes the ReplicationSlot struct defined in slot.h, which breaks ABI compatibility. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Tue, Apr 29, 2025 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Mon, Apr 28, 2025 at 8:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > > I have a question - is there any interest to backport the solution into > > > existing major releases? > > > > As long as this is the bug, it should be backpatched to all supported > > affected releases. > > Yes, but I think we cannot back-patch the proposed fix to back > branches as it changes the ReplicationSlot struct defined in slot.h, > which breaks ABI compatibility. Yes, and I think Vitaly already proposed to address this issue. This aspect also needs to be carefully reviewed for sure. On Thu, Apr 24, 2025 at 5:32 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > I can prepare a patch where restart_lsn_flushed stored > outside of ReplicationSlot structure and doesn't affect the existing API. ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Mon, Apr 28, 2025 at 6:39 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Tue, Apr 29, 2025 at 4:03 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Mon, Apr 28, 2025 at 8:17 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > > > > I have a question - is there any interest to backport the solution into > > > > existing major releases? > > > > > > As long as this is the bug, it should be backpatched to all supported > > > affected releases. > > > > Yes, but I think we cannot back-patch the proposed fix to back > > branches as it changes the ReplicationSlot struct defined in slot.h, > > which breaks ABI compatibility. > > Yes, and I think Vitaly already proposed to address this issue. This > aspect also needs to be carefully reviewed for sure. > > On Thu, Apr 24, 2025 at 5:32 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > I can prepare a patch where restart_lsn_flushed stored > > outside of ReplicationSlot structure and doesn't affect the existing API. Oh, I totally missed this part. Sorry for making noise. I'll review the patch once submitted. Regarding the proposed patch, I think we can somewhat follow last_saved_confirmed_flush field of ReplicationSlot. For example, we can set restart_lsn_flushed when restoring the slot from the disk, etc. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear All, Thank you for the attention to the patch. I updated a patch with a better solution for the master branch which can be easily backported to the other branches as we agree on the final solution. Two tests are introduced which are based on Tomas Vondra's test for logical slots with injection points from the discussion (patches: [1], [2], [3]). Tests are implemented as module tests in src/test/modules/test_replslot_required_lsn directory. I slightly modified the original test for logical slots and created a new simpler test for physical slots without any additional injection points. I prepared a new solution (patch [4]) which is also based on Tomas Vondra's proposal. With a fresh eye, I realized that it can fix the issue as well. It is easy and less invasive to implement. The new solution differs from my original solution: it is backward compatible (doesn't require any changes in ReplicationSlot structure). My original solution can be backward compatible as well if to allocate flushed_restart_lsn in a separate array in shmem, not in the ReplicationSlot structure, but I believe the new solution is the better one. If you still think that my previous solution is the better (I don't think so), I will prepare a backward compatible patch with my previous solution. I also proposed one more commit (patch [5]) which removes unnecessary calls of ReplicationSlotsComputeRequiredLSN function which seems to be redundant. This function updates the oldest required LSN for slots and it is called every time when slots' restart_lsn is changed. Once, we use the oldest required LSN in CreateCheckPoint/CreateRestartPoint to remove old WAL segments, I believe, there is no need to calculate the oldest value immediately when the slot is advancing and in other cases when restart_lsn is changed. It may affect on GetWALAvailability function because the oldest required LSN will be not up to date, but this function seems to be used in the system view pg_get_replication_slots and doesn't affect the logic of old WAL segments removal. I also have some doubts concerning advancing of logical replication slots: the call of ReplicationSlotsComputeRequiredLSN was removed. Not sure, how it can affect on ReplicationSlotsComputeRequiredXmin. This commit is not necessary for the fix but I think it is worth to consider. It may be dropped or applied only to the master branch. This patch can be easily backported to the major release branches. I will quickly prepare the patches for the major releases as we agree on the final solution. I apologize for such too late change in patch when it is already on commitfest. I'm not well experienced yet in the internals of PostgreSQL at the moment, sometimes the better solution needs some time to grow. In doing we learn :) [1] 0001-Add-injection-points-to-test-replication-slot-advanc.v2.patch [2] 0002-Add-TAP-test-to-check-logical-repl-slot-advance-duri.v2.patch [3] 0003-Add-TAP-test-to-check-physical-repl-slot-advance-dur.v2.patch [4] 0004-Keep-WAL-segments-by-slot-s-flushed-restart-LSN.v2.patch [5] 0005-Remove-redundant-ReplicationSlotsComputeRequiredLSN-.v2.patch With best regards, Vitaly
Attachment
- 0003-Add-TAP-test-to-check-physical-repl-slot-advance-dur.v2.patch
- 0004-Keep-WAL-segments-by-slot-s-flushed-restart-LSN.v2.patch
- 0001-Add-injection-points-to-test-replication-slot-advanc.v2.patch
- 0005-Remove-redundant-ReplicationSlotsComputeRequiredLSN-.v2.patch
- 0002-Add-TAP-test-to-check-logical-repl-slot-advance-duri.v2.patch
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi Alexander, Thank you very much for the review! > The patchset doesn't seem to build after 371f2db8b0, which adjusted > the signature of the INJECTION_POINT() macro. Could you, please, > update the patchset accordingly. I've updated the patch (see attached). Thanks. > I see in 0004 patch we're calling XLogGetReplicationSlotMinimumLSN() > before slots synchronization then use it for WAL truncation. > Generally looks good. But what about the "if > (InvalidateObsoleteReplicationSlots(...))" branch? It calls > XLogGetReplicationSlotMinimumLSN() again. Why would the value > obtained from the latter call reflect slots as they are synchronized > to the disk? In patch 0004 I call XLogGetReplicationSlotMinimumLSN() again to keep the old behaviour - this function was called in KeepLogSeg prior to my change. I also call CheckPointReplicationSlots at the next line to save the invalidated and other dirty slots on disk again to make sure, the new oldest LSN is in sync. The problem I tried to solve in this if-branch is to fix test src/test/recovery/t/019_replslot_limit.pl which was failed because the WAL was not truncated enought for the test to pass ok. In general, this branch is not necessary and we may fix the test by calling checkpoint twice (please, see the alternative.rej patch for this case). If you think, we should incorporate this new change, I'm ok to do it. But the WAL will be truncating more lazily. Furthermore, I think we can save slots on disk right after invalidation, not in CheckPointGuts to avoid saving invalidated slots twice. With best regards, Vitaly
Attachment
- 0002-Add-TAP-test-to-check-logical-repl-slot-advance-duri.v3.patch
- 0004-Keep-WAL-segments-by-slot-s-flushed-restart-LSN.v3.patch
- alternative.rej
- 0003-Add-TAP-test-to-check-physical-repl-slot-advance-dur.v3.patch
- 0005-Remove-redundant-ReplicationSlotsComputeRequiredLSN-.v3.patch
- 0001-Add-injection-points-to-test-replication-slot-advanc.v3.patch
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, Amit! Thank you for your attention to this patchset! On Sat, May 24, 2025 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Sat, May 24, 2025 at 4:08 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > I spend more time on this. The next revision is attached. It > > contains revised comments and other cosmetic changes. I'm going to > > backpatch 0001 to all supported branches, > > > > Is my understanding correct that we need 0001 because > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after > changing the slot's restart_lsn? Yes. Also, even if it would save slot to the disk, there is still race condition that concurrent checkpoint could use updated value from the shared memory to clean old WAL segments, and then crash happens before we managed to write the slot to the disk. > If so, shouldn't the comments (One > could argue that the slot should be saved to disk now, but that'd be > energy wasted - the worst thing lost information could cause here is > to give wrong information in a statistics view) in > PhysicalConfirmReceivedLocation() be changed to mention the risk of > not saving the slot? > > Also, after 0001, even the same solution will be true for logical > slots as well, where we are already careful to save the slot to disk > if its restart_lsn is changed, see LogicalConfirmReceivedLocation(). > So, won't that effort be wasted? Even if we don't want to do anything > about it (which doesn't sound like a good idea), we should note that > in comments somewhere. I have added the comments about both points in the attached revision of the patchset. ------ Regards, Alexander Korotkov Supabase
Attachment
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Sat, May 24, 2025 at 6:59 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Hi, Amit! > > Thank you for your attention to this patchset! > > On Sat, May 24, 2025 at 2:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sat, May 24, 2025 at 4:08 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > > > I spend more time on this. The next revision is attached. It > > > contains revised comments and other cosmetic changes. I'm going to > > > backpatch 0001 to all supported branches, > > > > > > > Is my understanding correct that we need 0001 because > > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after > > changing the slot's restart_lsn? > > Yes. Also, even if it would save slot to the disk, there is still > race condition that concurrent checkpoint could use updated value from > the shared memory to clean old WAL segments, and then crash happens > before we managed to write the slot to the disk. > How can that happen, if we first write the updated value to disk and then update the shared memory as we do in LogicalConfirmReceivedLocation? BTW, won't there be a similar problem with physical slot's xmin computation as well? In PhysicalReplicationSlotNewXmin(), after updating the slot's xmin computation, we mark it as dirty and update shared memory values. Now, say after checkpointer writes these xmin values to disk, walsender receives another feedback message and updates the slot's xmin values. Now using these updated shared memory values, vacuum removes the rows, however, a restart will show the older xmin values in the slot, which mean vacuum would have removed the required rows before restart. As per my understanding, neither the xmin nor the LSN problem exists for logical slots. I am pointing this out to indicate we may need to think of a different solution for physical slots, if these are problems only for physical slots. -- With Regards, Amit Kapila.
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Alexander, Amit, All > Amit wrote: > > Is my understanding correct that we need 0001 because > > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after > > changing the slot's restart_lsn? > > Yes. Also, even if it would save slot to the disk, there is still > race condition that concurrent checkpoint could use updated value from > the shared memory to clean old WAL segments, and then crash happens > before we managed to write the slot to the disk. > > How can that happen, if we first write the updated value to disk and > then update the shared memory as we do in > LogicalConfirmReceivedLocation? I guess, that the problem with logical slots still exist. Please, see the tap test: src/test/recovery/t/046_logical_slot.pl from the v6 version of the patch. A race condition may happen when logical slot's restart_lsn was changed but not yet written to the disk. Imagine, there is another physical slot which is advanced at this moment. It recomputes oldest min LSN and takes into account changed but not saved to disk restart_lsn of the logical slot. We come to the situation when the WAL segment for the logical slot's restart_lsn may be truncated after immediate restart. I'm not sure what may happen with two checkpoints which execute in parallel, but I would say that the patch 0001 guarantees that every checkpoint run will trim the WAL segments based on the already saved on disk restart LSNs of slots. The rule to trim the WAL by saved slot's restart_lsn will not be violated. > Amit wrote: > As per my understanding, neither the xmin nor the LSN problem exists > for logical slots. I am pointing this out to indicate we may need to > think of a different solution for physical slots, if these are > problems only for physical slots. As I've already told, it indirectly affects the logical slots as well. > Alexander wrote: > I spend more time on this. The next revision is attached. It > contains revised comments and other cosmetic changes. I'm going to > backpatch 0001 to all supported branches, and 0002 to 17 where > injection points were introduced. Alexander, thank you for polishing the patch. Just my opinion, I would prefer to put tests before the fix due to reason that you can reproduce the problem when simply checkout the commit with tests. Once, the tests are after the fix you are not able to do this way. Anyway, I'm ok with your changes. Thank you! I did some changes in the patch (v7 is attached): * Removed modules/test_replslot_required_lsn directory. It is not needed anymore, once you've moved test files to another directory. * Renamed tests to 046_checkpoint_logical_slot.pl, 047_checkpoint_physical_slot.pl. I believe, such names are more descriptive. Please, consider these changes. With best regards, Vitaly
Attachment
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Mon, May 26, 2025 at 3:52 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > Dear Alexander, Amit, All > > > Amit wrote: > > > Is my understanding correct that we need 0001 because > > > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after > > > changing the slot's restart_lsn? > > > > Yes. Also, even if it would save slot to the disk, there is still > > race condition that concurrent checkpoint could use updated value from > > the shared memory to clean old WAL segments, and then crash happens > > before we managed to write the slot to the disk. > > > > How can that happen, if we first write the updated value to disk and > > then update the shared memory as we do in > > LogicalConfirmReceivedLocation? > > I guess, that the problem with logical slots still exist. Please, see the tap > test: src/test/recovery/t/046_logical_slot.pl from the v6 version of the patch. > A race condition may happen when logical slot's restart_lsn was changed but not > yet written to the disk. Imagine, there is another physical slot which is > advanced at this moment. It recomputes oldest min LSN and takes into account > changed but not saved to disk restart_lsn of the logical slot. We come to the > situation when the WAL segment for the logical slot's restart_lsn may be > truncated after immediate restart. > Okay, so I was missing the point that the physical slots can consider the updated value of the logical slot's restart_lsn. The point I was advocating for logical slots sanctity was when no physical slots are involved. When updating replicationSlotMinLSN value in shared memory, the logical slot machinery took care that the value we use should be flushed to disk. One can argue that we should improve physical slots machinery so that it also takes care to write the slot to disk before updating the replicationSlotMinLSN, which is used to remove WAL. I understand that the downside is physical slots will be written to disk with a greater frequency, which will not be good from the performance point of view, but can we think of doing it for the period when a checkpoint is in progress? OTOH, if we don't want to adjust physical slot machinery, it seems saving the logical slots to disk immediately when its restart_lsn is updated is a waste of effort after your patch, no? If so, why are we okay with that? I understand that your proposed patch fixes the reported problem but I am slightly afraid that the proposed solution is not a good idea w.r.t logical slots so I am trying to see if there are any other alternative ideas to fix this issue. -- With Regards, Amit Kapila.
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Amit, > OTOH, if we don't want to adjust physical > slot machinery, it seems saving the logical slots to disk immediately > when its restart_lsn is updated is a waste of effort after your patch, > no? If so, why are we okay with that? I agree, that saving logical slots at advance is a possible waste of effort. But I don't understand original ideas behind it. I haven't touched it to make the minimal patch which should not break the existing functionality. We trim WAL in checkpoint (or restart point) operation only. The slots' restart_lsn is used to keep the wal from truncation. I believe, we need to compute the slots' oldest lsn as the minimal value of restart_lsn values only when executing checkpoint (or restart point). I guess, it doesn't depend on slot's type (logical or physical). We have 0003 patch to fix it. I haven't deeply investigated yet slot's xmin values but I guess the xmin values are a different story than restart_lsn. It is used to avoid tuple deletions by vacuum and it is updated by a different way. I can't say that LogicalConfirmReceivedLocation is the right place to update saved on disk xmin values. I would propose to update these values in SaveSlotToPath under some lock to avoid concurrent reads of unsaved values or do in a checkpoint like as for restart_lsn. We may investigate and improve it in an another patch. With best regards, Vitaly
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Mon, May 26, 2025 at 2:43 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, May 26, 2025 at 3:52 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > > > Dear Alexander, Amit, All > > > > > Amit wrote: > > > > Is my understanding correct that we need 0001 because > > > > PhysicalConfirmReceivedLocation() doesn't save the slot to disk after > > > > changing the slot's restart_lsn? > > > > > > Yes. Also, even if it would save slot to the disk, there is still > > > race condition that concurrent checkpoint could use updated value from > > > the shared memory to clean old WAL segments, and then crash happens > > > before we managed to write the slot to the disk. > > > > > > How can that happen, if we first write the updated value to disk and > > > then update the shared memory as we do in > > > LogicalConfirmReceivedLocation? > > > > I guess, that the problem with logical slots still exist. Please, see the tap > > test: src/test/recovery/t/046_logical_slot.pl from the v6 version of the patch. > > A race condition may happen when logical slot's restart_lsn was changed but not > > yet written to the disk. Imagine, there is another physical slot which is > > advanced at this moment. It recomputes oldest min LSN and takes into account > > changed but not saved to disk restart_lsn of the logical slot. We come to the > > situation when the WAL segment for the logical slot's restart_lsn may be > > truncated after immediate restart. > > > > Okay, so I was missing the point that the physical slots can consider > the updated value of the logical slot's restart_lsn. The point I was > advocating for logical slots sanctity was when no physical slots are > involved. When updating replicationSlotMinLSN value in shared memory, > the logical slot machinery took care that the value we use should be > flushed to disk. One can argue that we should improve physical slots > machinery so that it also takes care to write the slot to disk before > updating the replicationSlotMinLSN, which is used to remove WAL. I > understand that the downside is physical slots will be written to disk > with a greater frequency, which will not be good from the performance > point of view, but can we think of doing it for the period when a > checkpoint is in progress? That could cause replication slowdown while checkpointing is in-progress. This is certainly better than slowing down the replication permanently, but still doesn't look good. > OTOH, if we don't want to adjust physical > slot machinery, it seems saving the logical slots to disk immediately > when its restart_lsn is updated is a waste of effort after your patch, > no? If so, why are we okay with that? I don't think so. I think the reason why logical slots are synced to disk immediately after update is that logical changes are not idempotent (you can't safely apply the same change twice) unlike physical block-level changes. This is why logical slots need to be synced to prevent double replication of same changes, which could lead, for example, to double insertion. > I understand that your proposed patch fixes the reported problem but I > am slightly afraid that the proposed solution is not a good idea w.r.t > logical slots so I am trying to see if there are any other alternative > ideas to fix this issue. I don't understand exact concerns about this fix. For sure, we can try to implement a fix hacking LogicalConfirmReceivedLocation() and PhysicalConfirmReceivedLocation(). But that would be way more cumbersome, especially if we have to keep ABI compatibility. Also, it doesn't seem to me that either LogicalConfirmReceivedLocation() or PhysicalConfirmReceivedLocation() currently try to address this issue: LogicalConfirmReceivedLocation() implements immediate sync for different reasons. ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Thu, May 29, 2025 at 5:29 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Tue, May 27, 2025 at 2:26 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Yeah, we should be able to change ABI during beta, but I can't comment > > on the idea of effective_restart_lsn without seeing the patch or a > > detailed explanation of this idea. > > Could you, please, check the patch [1]. It implements this idea > except it names new field restart_lsn_flushed instead of > effective_restart_lsn. > This appears to be a better direction than the other patch, at least for HEAD. I noticed a few points while looking at the patch. 1. restart_lsn_flushed: Can we name it as last_saved_restart_lsn based on existing variable last_saved_confirmed_flush? 2. There are no comments as to why this is considered only for persistent slots when CheckPointReplicationSlots doesn't have any such check. 3. Please see if it makes sense to copy it in the copy_replication_slot. Apart from these, I am not sure if there are still any pending comments in the thread to be handled for this patch, so please see to avoid missing anything. > > Now, you see my point related to restart_lsn computation for logical > > slots, it is better to also do some analysis of the problem related to > > xmin I have highlighted in one of my previous emails [1]. I see your > > response to it, but I feel someone needs to give it a try by writing a > > test and see the behavior. I am saying because logical slots took > > precaution of flushing to disk before updating shared values of xmin > > for a reason, whereas similar precautions are not taken for physical > > slots, so there could be a problem with that computation as well. > > I see LogicalConfirmReceivedLocation() performs correctly while > updating effective_catalog_xmin only after syncing the slot to the > disk. I don't see how effective_xmin gets updates with the logical > replication progress though. Could you get me some clue on this, > please? > As per my understanding, for logical slots, effective_xmin is only set during the initial copy phase (or say if one has to export a snapshot), after that, its value won't change. Please read the comments in CreateInitDecodingContext() where we set its value. If you still have questions about it, we can discuss further. -- With Regards, Amit Kapila.
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Alexander, Amit Alexander Korotkov wrote: > Also, I've changed ReplicationSlotsComputeRequiredLSN() call to > CheckPointReplicationSlots() to update required LSN after > SaveSlotToPath() updated last_saved_restart_lsn. This helps to pass > checks in 001_stream_rep.pl without additional hacks. Thank you for the improvement and patch preparation. I confirm the test is passed without additional hacks now. I sill do not understand why this solution is favored. It is, in my opinion, a non backward-compatible solution. In any case, I'm ok to go with this patch. If needed, I may prepare a backward-compatible solution where last_saved_restart_lsn values will be in an another place of the shmem, rather than in ReplicationSlot struct. I still would like to add my 5 cents to the discussion. The purpose of the xmin value is to prevent tuples from vacuuming. Slots' restart_lsn values are used to calculate the oldest lsn to keep WAL segments from removal in checkpoint. These processes are pretty independent. The logical slots are advanced in 2 steps. At the first step, the logical decoding stuff periodically sets consistent candidate values for catalog_xmin and restart_lsn. At the second step, when LogicalConfirmReceivedLocation is called, the candidate values are assigned on catalog_xmin and restart_lsn values based on the confirmed lsn value. The slot is saved with these consistent values. It is important, that the candidate values are consistent, decoding guarantees it. In case of a crash, we should guarantee that the loaded from the disk catalog_xmin and restart_lsn values are consistent and valid for logical slots. LogicalConfirmReceivedLocation function keeps this consistency by updating them from consistent candidate values in a single operation. We have to guarantee that we use saved to disk values to calculate xmin horizon and slots' oldest lsn. For this purpose, effective_catalog_xmin is used. We update effective_catalog_xmin in LogicalConfirmReceivedLocation just after saving slot to disk. Another place where we update effective_catalog_xmin is when walsender receives hot standby feedback message. Once, we have two independent processes (vacuuming, checkpoint), we can calculate xmin horizon and oldest WAL lsn values independently (at different times) from the saved to disk values. Note, these values are updated in a non atomic way. The xmin value is set when the node receives hot standby feedback and it is used to keep tuples from vacuuming as well as catalog_xmin for decoding stuff. Not sure, xmin is applicable for logical replication. The confirmed flush lsn is used as a startpoint when a peer node doesn't provide the start lsn and to check that the start lsn is not older than the latest confirmed flush lsn. The saving of the slot on disk at each call of LogicalConfirmReceivedLocation doesn't help to avoid conflicts completely, but it helps to decrease the probability of conflicts. So, i'm still not sure, we need to save logical slots on each advance to avoid conflicts, because it doesn't help in general. The conflicts should be resolved by other means. Once, we truncate old wal segments in checkpoint only. I believe, it is ok if we calculate the oldest lsn only at the beginning of the checkpoint, as it was in the alternative solution. I think, we can update xmin horizon in checkpoint only but the horizon advancing will be more lazy in this case. Taking into account these thoughts, I can't see any problems with the alternative patch where oldest wal lsn is calculated only in checkpoint. With best regards, Vitaly
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Thu, Jun 5, 2025 at 8:51 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > Dear Alexander, Amit > > Alexander Korotkov wrote: > > Also, I've changed ReplicationSlotsComputeRequiredLSN() call to > > CheckPointReplicationSlots() to update required LSN after > > SaveSlotToPath() updated last_saved_restart_lsn. This helps to pass > > checks in 001_stream_rep.pl without additional hacks. > > Thank you for the improvement and patch preparation. I confirm the test is > passed without additional hacks now. > > I sill do not understand why this solution is favored. It is, in my opinion, > a non backward-compatible solution. In any case, I'm ok to go with this patch. > If needed, I may prepare a backward-compatible solution where > last_saved_restart_lsn values will be in an another place of the shmem, rather > than in ReplicationSlot struct. > I think we can use this approach for HEAD and probably keep the previous idea for backbranches. Keeping some value in shared_memory per slot sounds risky to me in terms of introducing new bugs. > I still would like to add my 5 cents to the discussion. > > The purpose of the xmin value is to prevent tuples from vacuuming. Slots' > restart_lsn values are used to calculate the oldest lsn to keep WAL segments > from removal in checkpoint. These processes are pretty independent. > > The logical slots are advanced in 2 steps. At the first step, the logical > decoding stuff periodically sets consistent candidate values for catalog_xmin and > restart_lsn. At the second step, when LogicalConfirmReceivedLocation is called, > the candidate values are assigned on catalog_xmin and restart_lsn values based > on the confirmed lsn value. The slot is saved with these consistent values. > It is important, that the candidate values are consistent, decoding guarantees > it. In case of a crash, we should guarantee that the loaded from the disk > catalog_xmin and restart_lsn values are consistent and valid for logical slots. > LogicalConfirmReceivedLocation function keeps this consistency by updating them > from consistent candidate values in a single operation. > > We have to guarantee that we use saved to disk values to calculate xmin horizon > and slots' oldest lsn. For this purpose, effective_catalog_xmin is used. We > update effective_catalog_xmin in LogicalConfirmReceivedLocation just > after saving slot to disk. Another place where we update effective_catalog_xmin > is when walsender receives hot standby feedback message. > > Once, we have two independent processes (vacuuming, checkpoint), we can calculate > xmin horizon and oldest WAL lsn values independently (at different times) from > the saved to disk values. Note, these values are updated in a non atomic way. > > The xmin value is set when the node receives hot standby feedback and it is used > to keep tuples from vacuuming as well as catalog_xmin for decoding stuff. > Yeah, but with physical slots, it is possible that the slot's xmin value is pointing to some value, say 700 (after restart), but vacuum would have removed tuples from transaction IDs greater than 700 as explained in email [1]. Not > sure, xmin is applicable for logical replication. > > The confirmed flush lsn is used as a startpoint when a peer node doesn't provide > the start lsn and to check that the start lsn is not older than the latest > confirmed flush lsn. The saving of the slot on disk at each call of > LogicalConfirmReceivedLocation doesn't help to avoid conflicts completely, but > it helps to decrease the probability of conflicts. > We don't save slots at each call of LogicalConfirmReceivedLocation() and when we save also, it is not to avoid conflicts but to avoid removing required WAL segments and tuples. > So, i'm still not sure, we > need to save logical slots on each advance to avoid conflicts, because it > doesn't help in general. The conflicts should be resolved by other means. > > Once, we truncate old wal segments in checkpoint only. I believe, it is ok if we > calculate the oldest lsn only at the beginning of the checkpoint, as it was in > the alternative solution. I think, we can update xmin horizon in checkpoint only > but the horizon advancing will be more lazy in this case. > > Taking into account these thoughts, I can't see any problems with the alternative > patch where oldest wal lsn is calculated only in checkpoint. > The alternative will needlessly prevent removing WAL segments in some cases when logical slots are in use. [1] - https://www.postgresql.org/message-id/CAA4eK1KMaPA5jir_SFu%2Bqr3qu55OOdFWVZpuUkqTSGZ9fyPpHA%40mail.gmail.com -- With Regards, Amit Kapila.
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Tue, Jun 3, 2025 at 6:51 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > > > As per my understanding, for logical slots, effective_xmin is only set > > during the initial copy phase (or say if one has to export a > > snapshot), after that, its value won't change. Please read the > > comments in CreateInitDecodingContext() where we set its value. If you > > still have questions about it, we can discuss further. > > OK, thank you for the clarification. I've read the comments in > CreateInitDecodingContext() as you suggested. All of above makes me > think *_xmin fields are handled properly. > Yes, they handled properly for logical slots, but there is no similar safety mechanism for physical slots. One minor comment: + + /* Latest restart_lsn that has been flushed to disk. For persistent slots + * the flushed LSN should be taken into account when calculating the oldest This doesn't follow our practice for multi-line comments. -- With Regards, Amit Kapila.
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi Amit, > I think we can use this approach for HEAD and probably keep the > previous idea for backbranches. Keeping some value in shared_memory > per slot sounds risky to me in terms of introducing new bugs. Not sure, what kind of problems may occur. I propose to allocate in shmem an array of last_saved_restart_lsn like below which is not a part of the public api (see below). It will be allocated and deallocated in shmem the same way as ReplicationSlotCtlData. I can prepare a patch, if needed. typedef struct ReplicationSlotCtlDataExt { XLogRecPtr last_saved_restart_lsn[1]; } ReplicationSlotCtlDataExt; > Yeah, but with physical slots, it is possible that the slot's xmin > value is pointing to some value, say 700 (after restart), but vacuum > would have removed tuples from transaction IDs greater than 700 as > explained in email [1]. I think, we have no xmin problem for physical slots. The xmin values of physical slots are used to process HSF messages. If I correctly understood what you mean, you are telling about the problem which is solved by hot standby feedback messages. This message is used to disable tuples vacuuming on the primary to avoid delete conflicts on the replica in queries (some queries may select some tuples which were vacuumed on the primary and deletions are replicated to the standby). If the primary receives a HSF message after slot saving, I believe, it is allowable if autovacuum cleans tuples with xmin later than the last saved value. If the primary restarts, the older value will be loaded but the replica already confirmed the newer value. Concerning replica, it is the obligation of the replica to send such HSF xmin that will survive replica's immediate restart. >> Taking into account these thoughts, I can't see any problems with the alternative >> patch where oldest wal lsn is calculated only in checkpoint. >> >The alternative will needlessly prevent removing WAL segments in some >cases when logical slots are in use. IMHO, I'm not sure, it will significantly impact the wal removal. We remove WAL segments only in checkpoint. The alternate solution gets the oldest WAL segment at the beginning of checkpoint, then saves dirty slots to disk, and removes old WAL segments at the end of checkpoint using the oldest WAL segment obtained at the beginning of checkpoint. The alternate solution may not be so effective in terms of WAL segments removal, if a logical slot is advanced during checkpoint, but I do not think it is a significant issue. From the other hand, the alternate solution simplifies the logic of WAL removal, backward compatible (avoids addition new in-memory states), decreases the number of locks in ReplicationSlotsComputeRequiredLSN - no need to recalculate oldest slots' restart lsn every time when a slot is advanced. With best regards, Vitaly
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
10.06.2025 23:14, Alexander Korotkov wrote:
So, my proposal is to commit the attached patchset to the HEAD, and commit [1] to the back branches. Any objections?
As the buildfarm animal prion shows [1], the 046_checkpoint_logical_slot
test fails with "-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE":
# poll_query_until timed out executing this query:
#
# SELECT count(*) > 0 FROM pg_stat_activity
# WHERE backend_type = 'client backend' AND wait_event = 'logical-replication-slot-advance-segment'
#
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
[04:16:27] t/046_checkpoint_logical_slot.pl ......
Dubious, test returned 29 (wstat 7424, 0x1d00)
No subtests run
[04:20:58] t/047_checkpoint_physical_slot.pl ..... ok 271294 ms ( 0.00 usr 0.00 sys + 0.37 cusr 0.26 csys = 0.63 CPU)
I'm able to reproduce this locally as well. Though the test passes for me
with the increased timeout, that is it's not stuck:
PG_TEST_TIMEOUT_DEFAULT=360 PROVE_TESTS="t/046*" make -s check -C src/test/recovery/
# +++ tap check in src/test/recovery +++
t/046_checkpoint_logical_slot.pl .. ok
All tests successful.
Files=1, Tests=1, 533 wallclock secs ( 0.01 usr 0.00 sys + 4.70 cusr 9.61 csys = 14.32 CPU)
Result: PASS
Could you have a look?
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-06-14%2001%3A58%3A06
Best regards,
Alexander
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, Alexander! On Sun, Jun 15, 2025 at 12:00 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Alexander, > > 10.06.2025 23:14, Alexander Korotkov wrote: > > So, my proposal is to commit the attached patchset to the HEAD, and > commit [1] to the back branches. Any objections? > > > As the buildfarm animal prion shows [1], the 046_checkpoint_logical_slot > test fails with "-DRELCACHE_FORCE_RELEASE -DCATCACHE_FORCE_RELEASE": > # poll_query_until timed out executing this query: > # > # SELECT count(*) > 0 FROM pg_stat_activity > # WHERE backend_type = 'client backend' AND wait_event = 'logical-replication-slot-advance-segment' > # > # expecting this output: > # t > # last actual query output: > # f > # with stderr: > [04:16:27] t/046_checkpoint_logical_slot.pl ...... > Dubious, test returned 29 (wstat 7424, 0x1d00) > No subtests run > [04:20:58] t/047_checkpoint_physical_slot.pl ..... ok 271294 ms ( 0.00 usr 0.00 sys + 0.37 cusr 0.26 csys = 0.63CPU) > > I'm able to reproduce this locally as well. Though the test passes for me > with the increased timeout, that is it's not stuck: > PG_TEST_TIMEOUT_DEFAULT=360 PROVE_TESTS="t/046*" make -s check -C src/test/recovery/ > # +++ tap check in src/test/recovery +++ > t/046_checkpoint_logical_slot.pl .. ok > All tests successful. > Files=1, Tests=1, 533 wallclock secs ( 0.01 usr 0.00 sys + 4.70 cusr 9.61 csys = 14.32 CPU) > Result: PASS > > Could you have a look? > > [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prion&dt=2025-06-14%2001%3A58%3A06 Hmm... It seems to take too long to advance the segment with these options on. Sure, I'll check this! ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
BTW, while you're cleaning up this commit, could you remove the excess newlines in some of the "note" commands in 046 and 047, like note('starting checkpoint\n'); This produces bizarre output, as shown in the buildfarm logs: [04:04:38.953](603.550s) # starting checkpoint\\n regards, tom lane
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, Tom! On Sun, Jun 15, 2025 at 7:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > BTW, while you're cleaning up this commit, could you remove the > excess newlines in some of the "note" commands in 046 and 047, like > > note('starting checkpoint\n'); > > This produces bizarre output, as shown in the buildfarm logs: Thank you for reporting this. The revised patch is attached. In addition to reducing tests runtime, it removes excess newlines from some note() calls. The commit message is here. I'm going to push this if no objections. ------ Regards, Alexander Korotkov Supabase
Attachment
RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Alexander, Thanks for pushing the fix patch! BTW, I have few comments for your commits. Can you check and include them if needed? 01. ``` $node->append_conf('postgresql.conf', "shared_preload_libraries = 'injection_points'"); ``` No need to set shared_preload_libraries in 046/047. ISTM it must be set when we enable the statistics. 02. We should also check whether the injection_points can be installed or not. You can check check_extension() and callers. Best regards, Hayato Kuroda FUJITSU LIMITED
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Kuroda-san, On Mon, Jun 16, 2025 at 12:11 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > Thanks for pushing the fix patch! BTW, I have few comments for your commits. > Can you check and include them if needed? > > 01. > ``` > $node->append_conf('postgresql.conf', > "shared_preload_libraries = 'injection_points'"); > ``` > > No need to set shared_preload_libraries in 046/047. ISTM it must be set when we > enable the statistics. > > 02. > We should also check whether the injection_points can be installed or not. > You can check check_extension() and callers. Thank you! All of these totally make sense. The updated patch is attached. ------ Regards, Alexander Korotkov Supabase
Attachment
RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Alexander, > Thank you! All of these totally make sense. The updated patch is attached. Thanks for the update. I found another point. ``` -# Another 2M rows; that's about 260MB (~20 segments) worth of WAL. +# Another 50K rows; that's about 86MB (~5 segments) worth of WAL. $node->safe_psql('postgres', - q{insert into t (b) select md5(i::text) from generate_series(1,1000000) s(i)} + q{insert into t (b) select repeat(md5(i::text),50) from generate_series(1,50000) s(i)} ); ``` I think a perl function advance_wal() can be used instead of doing actual INSERT command because no one refers the replicated result. Same thing can be said in 046/047. Best regards, Hayato Kuroda FUJITSU LIMITED
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, Vitaly! On Tue, Jun 17, 2025 at 6:02 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > Thank you for reporting the issue. > > >While tracking buildfarm for one of other commits, I noticed this failure: > >TRAP: failed Assert("s->data.restart_lsn >= > >s->last_saved_restart_lsn"), File: > >"../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797 > >postgres: standby: checkpointer (ExceptionalCondition+0x83) [0x55fa69b79f5e] > >postgres: standby: checkpointer > >(InvalidateObsoleteReplicationSlots+0x53c) [0x55fa69982171] > >postgres: standby: checkpointer (CreateCheckPoint+0x9ad) [0x55fa6971feb2] > > This assert was introduced in the patch. Now, I think, it is a wrong one. Let me > please explain one of the possible scenarios when it can be triggered. In case > of physical replication, when walsender receives a standby reply message, it > calls PhysicalConfirmReceivedLocation function which updates slots' restart_lsn > from received flush_lsn value. This value may be older than the saved value. If > it happens during checkpoint, after slot saving to disk, this assert will be > triggered, because last_saved_restart_lsn value may be lesser than the new > restart_lsn value, updated from walsender. > > I propose to remove this assert. Yes, but is it OK for restart_lsn to move backward? That might mean that if checkpoint happen faster than PhysicalConfirmReceivedLocation(), then crash could cause this WAL location to be unavailable. Is that true? Also, what do you think about proposed changes in [1]? I wonder if it could somehow decrease the coverage. Links. 1. https://www.postgresql.org/message-id/OSCPR01MB149665B3F0629D10731B18E5AF570A%40OSCPR01MB14966.jpnprd01.prod.outlook.com ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
vignesh C <vignesh21@gmail.com> writes: > While tracking buildfarm for one of other commits, I noticed this failure: > TRAP: failed Assert("s->data.restart_lsn >= > s->last_saved_restart_lsn"), File: > "../pgsql/src/backend/replication/slot.c", Line: 1813, PID: 3945797 My animal mamba is also showing this assertion failure, but in a different test (recovery/t/040_standby_failover_slots_sync.pl). It's failed in two out of its three runs since ca307d5ce went in, so it's more reproducible than scorpion's report, though still not perfectly so. I suspect that mamba is prone to this simply because it's slow, although perhaps there's a different reason. Anyway, happy to investigate manually if there's something you'd like me to check for. regards, tom lane
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Wed, 18 Jun 2025 at 14:35, Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > Dear Hayato, > > > To confirm, can you tell me the theory why the walsender received old LSN? > > It is sent by the walreceiver, so is there a case that LogstreamResult.Flush can go backward? > > Not sure we can accept the situation. > > I can't say anything about the origin of the issue, but it can be easily reproduced > on the master branch: > > 1. Add an assert in PhysicalConfirmReceivedLocation (apply the attached patch) > 2. Compile & install with tap tests and assertions enabled > 3. cd src/bin/pg_basebackup/ > 3. PROVE_TESTS=t/020_pg_receivewal.pl gmake check Thanks for the steps, I was able to reproduce the issue with the suggested steps. > The test will fail because of the assertion. I plan to investigate the issue > but I need some more time for it. Once, it happens on the original master > branch, I think, this problem already exists. The proposed patch seems > to be not guilty. This issue occurs even prior to this commit, I was able to reproduce it on a version just before it. I’ll also look into analyzing the root cause further. > It may be the same problem as discussed in: > https://www.postgresql.org/message-id/CALDaNm2uQbhEVJzvnja6rw7Q9AYu9FpVmET%3DTbwLjV3DcPRhLw%40mail.gmail.com This issue was related to confirmed_flush and was addressed in commit d1ffcc7fa3c54de8b2a677a3e503fc808c7b419c. It is not related to restart_lsn. Regards, Vignesh
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Wed, Jun 18, 2025 at 10:17 PM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Wed, Jun 18, 2025 at 6:50 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > > I think, it is a good idea. Once we do not use the generated data, it is ok > > > just to generate WAL segments using the proposed function. I've tested this > > > function. The tests worked as expected with and without the fix. The attached > > > patch does the change. > > > > Sorry, forgot to attach the patch. It is created on the current master branch. > > It may conflict with your corrections. I hope, it could be useful. > > Thank you. I've integrated this into a patch to improve these tests. > > Regarding assertion failure, I've found that assert in > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > previously set by ReplicationSlotReserveWal(). As I can see, > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > So, it doesn't seems there is a guarantee that restart_lsn never goes > backward. The commit in ReplicationSlotReserveWal() even states there > is a "chance that we have to retry". > I don't see how this theory can lead to a restart_lsn of a slot going backwards. The retry mentioned there is just a retry to reserve the slot's position again if the required WAL is already removed. Such a retry can only get the position later than the previous restart_lsn. > Thus, I propose to remove the > assertion introduced by ca307d5cec90. > If what I said above is correct, then the following part of the commit message will be incorrect: "As stated in the ReplicationSlotReserveWal() comment, this is not always true. Additionally, this issue has been spotted by some buildfarm members." -- With Regards, Amit Kapila.
RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Amit, Alexander, > > Regarding assertion failure, I've found that assert in > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > previously set by ReplicationSlotReserveWal(). As I can see, > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > backward. The commit in ReplicationSlotReserveWal() even states there > > is a "chance that we have to retry". > > > > I don't see how this theory can lead to a restart_lsn of a slot going > backwards. The retry mentioned there is just a retry to reserve the > slot's position again if the required WAL is already removed. Such a > retry can only get the position later than the previous restart_lsn. We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal, and confirmed that restart_lsn can go backward. This meant that Assert() added by the ca307d5 can cause crash. Background =========== When pg_receivewal starts the replication and it uses the replication slot, it sets as the beginning of the segment where restart_lsn exists, as the startpoint. E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs from 0/B00000. More detail of this behavior, see f61e1dd2 and d9bae531. What happened here ================== Based on above theory, walsender sent from the beginning of segment (0/B00000). When walreceiver receives, it tried to send reply. At that time the flushed location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000). The assertion failure could happen if CHECKPOINT happened at that time. Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot(). Note ==== 1. In this case, starting from the beginning of the segment is not a problem, because the checkpoint process only removes WAL files from segments that precede the restart_lsn's wal segment. The current segment (0/B00000) will not be removed, so there is no risk of data loss or inconsistency. 2. A similar pattern applies to pg_basebackup. Both use logic that adjusts the requested streaming position to the start of the segment, and it replies the received LSN as flushed. 3. I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync because it is a timing issue. Have someone else reproduced? We are still investigating failure caused at 040_standby_failover_slots_sync. [1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=scorpion&dt=2025-06-17%2000%3A40%3A46&stg=pg_basebackup-check Best regards, Hayato Kuroda FUJITSU LIMITED
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Thu, Jun 19, 2025 at 1:29 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > On Wed, Jun 18, 2025 at 10:17 PM Alexander Korotkov > <aekorotkov@gmail.com> wrote: > > > > On Wed, Jun 18, 2025 at 6:50 PM Vitaly Davydov <v.davydov@postgrespro.ru> wrote: > > > > I think, it is a good idea. Once we do not use the generated data, it is ok > > > > just to generate WAL segments using the proposed function. I've tested this > > > > function. The tests worked as expected with and without the fix. The attached > > > > patch does the change. > > > > > > Sorry, forgot to attach the patch. It is created on the current master branch. > > > It may conflict with your corrections. I hope, it could be useful. > > > > Thank you. I've integrated this into a patch to improve these tests. > > > > Regarding assertion failure, I've found that assert in > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > previously set by ReplicationSlotReserveWal(). As I can see, > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > backward. The commit in ReplicationSlotReserveWal() even states there > > is a "chance that we have to retry". > > > > I don't see how this theory can lead to a restart_lsn of a slot going > backwards. The retry mentioned there is just a retry to reserve the > slot's position again if the required WAL is already removed. Such a > retry can only get the position later than the previous restart_lsn. Yes, if retry is needed, then the new position must be later for sure. What I mean is that ReplicationSlotReserveWal() can reserve something later than what standby is going to read (and correspondingly report with PhysicalConfirmReceivedLocation()). > > Thus, I propose to remove the > > assertion introduced by ca307d5cec90. > > > > If what I said above is correct, then the following part of the commit > message will be incorrect: > "As stated in the ReplicationSlotReserveWal() comment, this is not > always true. Additionally, this issue has been spotted by some > buildfarm > members." I agree, this comment needs improvement in terms of clarity. Meanwhile I've pushed the patch for TAP tests, which I think didn't get any objections. ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Kuroda-san, On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > > Regarding assertion failure, I've found that assert in > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > > previously set by ReplicationSlotReserveWal(). As I can see, > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > > backward. The commit in ReplicationSlotReserveWal() even states there > > > is a "chance that we have to retry". > > > > > > > I don't see how this theory can lead to a restart_lsn of a slot going > > backwards. The retry mentioned there is just a retry to reserve the > > slot's position again if the required WAL is already removed. Such a > > retry can only get the position later than the previous restart_lsn. > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal, > and confirmed that restart_lsn can go backward. This meant that Assert() added > by the ca307d5 can cause crash. > > Background > =========== > When pg_receivewal starts the replication and it uses the replication slot, it > sets as the beginning of the segment where restart_lsn exists, as the startpoint. > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs > from 0/B00000. > More detail of this behavior, see f61e1dd2 and d9bae531. > > What happened here > ================== > Based on above theory, walsender sent from the beginning of segment (0/B00000). > When walreceiver receives, it tried to send reply. At that time the flushed > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000). > > The assertion failure could happen if CHECKPOINT happened at that time. > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot(). Thank you for your detailed explanation! > Note > ==== > 1. > In this case, starting from the beginning of the segment is not a problem, because > the checkpoint process only removes WAL files from segments that precede the > restart_lsn's wal segment. The current segment (0/B00000) will not be removed, > so there is no risk of data loss or inconsistency. > > 2. > A similar pattern applies to pg_basebackup. Both use logic that adjusts the > requested streaming position to the start of the segment, and it replies the > received LSN as flushed. > > 3. > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync > because it is a timing issue. Have someone else reproduced? > > We are still investigating failure caused at 040_standby_failover_slots_sync. I didn't manage to reproduce this. But as I see from the logs [1] on mamba that START_REPLICATION command was issued just before assert trap. Could it be something similar to what I described in [2]. Namely: 1. ReplicationSlotReserveWal() sets restart_lsn for the slot. 2. Concurrent checkpoint flushes that restart_lsn to the disk. 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to the beginning of the segment. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=mamba&dt=2025-06-17%2005%3A10%3A36&stg=recovery-check [2] https://www.postgresql.org/message-id/CAPpHfdv3UEUBjsLhB_CwJT0xX9LmN6U%2B__myYopq4KcgvCSbTg%40mail.gmail.com ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > Dear Kuroda-san, > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Regarding assertion failure, I've found that assert in > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > > > previously set by ReplicationSlotReserveWal(). As I can see, > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > > > backward. The commit in ReplicationSlotReserveWal() even states there > > > > is a "chance that we have to retry". > > > > > > > > > > I don't see how this theory can lead to a restart_lsn of a slot going > > > backwards. The retry mentioned there is just a retry to reserve the > > > slot's position again if the required WAL is already removed. Such a > > > retry can only get the position later than the previous restart_lsn. > > > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal, > > and confirmed that restart_lsn can go backward. This meant that Assert() added > > by the ca307d5 can cause crash. > > > > Background > > =========== > > When pg_receivewal starts the replication and it uses the replication slot, it > > sets as the beginning of the segment where restart_lsn exists, as the startpoint. > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs > > from 0/B00000. > > More detail of this behavior, see f61e1dd2 and d9bae531. > > > > What happened here > > ================== > > Based on above theory, walsender sent from the beginning of segment (0/B00000). > > When walreceiver receives, it tried to send reply. At that time the flushed > > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000). > > > > The assertion failure could happen if CHECKPOINT happened at that time. > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn > > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot(). > > Thank you for your detailed explanation! > > > Note > > ==== > > 1. > > In this case, starting from the beginning of the segment is not a problem, because > > the checkpoint process only removes WAL files from segments that precede the > > restart_lsn's wal segment. The current segment (0/B00000) will not be removed, > > so there is no risk of data loss or inconsistency. > > > > 2. > > A similar pattern applies to pg_basebackup. Both use logic that adjusts the > > requested streaming position to the start of the segment, and it replies the > > received LSN as flushed. > > > > 3. > > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync > > because it is a timing issue. Have someone else reproduced? > > > > We are still investigating failure caused at 040_standby_failover_slots_sync. > > I didn't manage to reproduce this. But as I see from the logs [1] on > mamba that START_REPLICATION command was issued just before assert > trap. Could it be something similar to what I described in [2]. > Namely: > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot. > 2. Concurrent checkpoint flushes that restart_lsn to the disk. > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to > the beginning of the segment. Here is my analysis for the 040_standby_failover_slots_sync test failure where in physical replication slot can point to backward lsn: In certain rare cases the restart LSN can go backwards. This scenario can be reproduced by performing checkpoint continuously and slowing the WAL applying. I have a patch with changes to handle this. Here's a summary of the sequence of events: 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents: After standby writes the received WAL contents in XLogWalRcvWrite, the standby sends this lsn 0/40369C8 as the confirmed flush location. The primary acknowledges this and updates the replication slot's restart_lsn accordingly: 2025-06-20 14:33:21.777 IST [134998] standby1 LOG: PhysicalConfirmReceivedLocation replication slot "sb1_slot" set restart_lsn to 0/40369C8 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT: START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1 Checkpoint persists the new restart_lsn: 2) Shortly after receiving the new LSN, a checkpoint occurs which saves this restart_lsn: 2025-06-20 14:33:21.780 IST [134913] LOG: checkpoint complete: wrote 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB; lsn=0/4036A20, redo lsn=0/40369C8 3)Streaming replication is restarted because of primary_conninfo change and reload The WAL receiver process is restarted: 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver process due to administrator command 4) Standby sends an older flush pointer after restart: Upon restart, the WAL receiver sends a flush location (0/401D448) derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older than the previously confirmed restart_lsn. It is important to note that we are sending the lastReplayedEndRecPtr which is the last successfully replayed lsn in this case: 5-06-20 14:33:21.796 IST [135135] LOG: WalReceiverMain LogstreamResult.Flush initialized to 0/401D448 2025-06-20 14:33:21.796 IST [135135] LOG: sending write 0/401D448 flush 0/401D448 apply 0/401D448 This is done from here: .... /* Initialize LogstreamResult and buffers for processing messages */ LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL); initStringInfo(&reply_message); /* Initialize nap wakeup times. */ now = GetCurrentTimestamp(); for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) WalRcvComputeNextWakeup(i, now); /* Send initial reply/feedback messages. */ XLogWalRcvSendReply(true, false); ... In case of step 1, we are sending the lsn of the WAL that is written, since we have slowed down the WAL applying the replay location is lesser and the replay location is being sent here. 5) I have added logs to detect this inconsistency: This leads to a scenario where the standby tries to confirm a restart_lsn older than the one currently held by the primary: 2025-06-20 14:33:21.797 IST [135136] standby1 LOG: crash scenario - slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older than the current one (0/40369C8) If a checkpoint happens concurrently during this condition, it may trigger an assertion failure on the primary due to the restart_lsn being less than the last_saved_restart_lsn. Currently this does not break physical replication, but I'm not sure if the gap increases to many WAL files and if the WAL files get deleted, how it will behave. Attached the patch changes with which you can reproduce. Grep for "crash scenario" in the logs. For me it occurs with every run. The reproduced logs are attached. This proves that the restart_lsn can go backward in cases where the standby is slowly applying. But this has nothing to do with this thread, I felt you can commit the assert removal patch. I will continue the analysis further and see if there is any impact or not and we can later add comments accordingly. Regards, Vignesh
Attachment
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, Vignesh! On Fri, Jun 20, 2025 at 3:42 PM vignesh C <vignesh21@gmail.com> wrote: > On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > > > Regarding assertion failure, I've found that assert in > > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > > > > previously set by ReplicationSlotReserveWal(). As I can see, > > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > > > > backward. The commit in ReplicationSlotReserveWal() even states there > > > > > is a "chance that we have to retry". > > > > > > > > > > > > > I don't see how this theory can lead to a restart_lsn of a slot going > > > > backwards. The retry mentioned there is just a retry to reserve the > > > > slot's position again if the required WAL is already removed. Such a > > > > retry can only get the position later than the previous restart_lsn. > > > > > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal, > > > and confirmed that restart_lsn can go backward. This meant that Assert() added > > > by the ca307d5 can cause crash. > > > > > > Background > > > =========== > > > When pg_receivewal starts the replication and it uses the replication slot, it > > > sets as the beginning of the segment where restart_lsn exists, as the startpoint. > > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs > > > from 0/B00000. > > > More detail of this behavior, see f61e1dd2 and d9bae531. > > > > > > What happened here > > > ================== > > > Based on above theory, walsender sent from the beginning of segment (0/B00000). > > > When walreceiver receives, it tried to send reply. At that time the flushed > > > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn > > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000). > > > > > > The assertion failure could happen if CHECKPOINT happened at that time. > > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn > > > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot(). > > > > Thank you for your detailed explanation! > > > > > Note > > > ==== > > > 1. > > > In this case, starting from the beginning of the segment is not a problem, because > > > the checkpoint process only removes WAL files from segments that precede the > > > restart_lsn's wal segment. The current segment (0/B00000) will not be removed, > > > so there is no risk of data loss or inconsistency. > > > > > > 2. > > > A similar pattern applies to pg_basebackup. Both use logic that adjusts the > > > requested streaming position to the start of the segment, and it replies the > > > received LSN as flushed. > > > > > > 3. > > > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync > > > because it is a timing issue. Have someone else reproduced? > > > > > > We are still investigating failure caused at 040_standby_failover_slots_sync. > > > > I didn't manage to reproduce this. But as I see from the logs [1] on > > mamba that START_REPLICATION command was issued just before assert > > trap. Could it be something similar to what I described in [2]. > > Namely: > > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot. > > 2. Concurrent checkpoint flushes that restart_lsn to the disk. > > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to > > the beginning of the segment. > > Here is my analysis for the 040_standby_failover_slots_sync test > failure where in physical replication slot can point to backward lsn: > In certain rare cases the restart LSN can go backwards. This scenario > can be reproduced by performing checkpoint continuously and slowing > the WAL applying. I have a patch with changes to handle this. > > Here's a summary of the sequence of events: > 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents: > After standby writes the received WAL contents in XLogWalRcvWrite, the > standby sends this lsn 0/40369C8 as the confirmed flush location. The > primary acknowledges this and updates the replication slot's > restart_lsn accordingly: > 2025-06-20 14:33:21.777 IST [134998] standby1 LOG: > PhysicalConfirmReceivedLocation replication slot "sb1_slot" set > restart_lsn to 0/40369C8 > 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT: > START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1 > Checkpoint persists the new restart_lsn: > > 2) Shortly after receiving the new LSN, a checkpoint occurs which > saves this restart_lsn: > 2025-06-20 14:33:21.780 IST [134913] LOG: checkpoint complete: wrote > 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 > removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync > files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 > kB; lsn=0/4036A20, redo lsn=0/40369C8 > > 3)Streaming replication is restarted because of primary_conninfo > change and reload > The WAL receiver process is restarted: > 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver > process due to administrator command > > 4) Standby sends an older flush pointer after restart: > Upon restart, the WAL receiver sends a flush location (0/401D448) > derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older > than the previously confirmed restart_lsn. It is important to note > that we are sending the lastReplayedEndRecPtr which is the last > successfully replayed lsn in this case: > 5-06-20 14:33:21.796 IST [135135] LOG: WalReceiverMain > LogstreamResult.Flush initialized to 0/401D448 > 2025-06-20 14:33:21.796 IST [135135] LOG: sending write 0/401D448 > flush 0/401D448 apply 0/401D448 > > This is done from here: > .... > /* Initialize LogstreamResult and buffers for processing messages */ > LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL); > > initStringInfo(&reply_message); > > /* Initialize nap wakeup times. */ > now = GetCurrentTimestamp(); > for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > WalRcvComputeNextWakeup(i, now); > > /* Send initial reply/feedback messages. */ > XLogWalRcvSendReply(true, false); > ... > > In case of step 1, we are sending the lsn of the WAL that is written, > since we have slowed down the WAL applying the replay location is > lesser and the replay location is being sent here. > > 5) I have added logs to detect this inconsistency: > This leads to a scenario where the standby tries to confirm a > restart_lsn older than the one currently held by the primary: > 2025-06-20 14:33:21.797 IST [135136] standby1 LOG: crash scenario - > slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older > than the current one (0/40369C8) > > If a checkpoint happens concurrently during this condition, it may > trigger an assertion failure on the primary due to the restart_lsn > being less than the last_saved_restart_lsn. > Currently this does not break physical replication, but I'm not sure > if the gap increases to many WAL files and if the WAL files get > deleted, how it will behave. > Attached the patch changes with which you can reproduce. Grep for > "crash scenario" in the logs. For me it occurs with every run. The > reproduced logs are attached. > > This proves that the restart_lsn can go backward in cases where the > standby is slowly applying. But this has nothing to do with this > thread, I felt you can commit the assert removal patch. I will > continue the analysis further and see if there is any impact or not > and we can later add comments accordingly. Thank you and other thread participants for the analysis! I'd like to also ask for help with [1]. It seems that new test triggers another bug, which I suppose was there before. Links. 1. https://www.postgresql.org/message-id/CAAKRu_ZCOzQpEumLFgG_%2Biw3FTa%2BhJ4SRpxzaQBYxxM_ZAzWcA%40mail.gmail.com ------ Regards, Alexander Korotkov Supabase
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Mon, 23 Jun 2025 at 04:36, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Fri, Jun 20, 2025 at 2:24 PM vignesh C <vignesh21@gmail.com> wrote: > > On Fri, 20 Jun 2025 at 05:54, Alexander Korotkov <aekorotkov@gmail.com> wrote: > > > Dear Kuroda-san, > > > > > > On Thu, Jun 19, 2025 at 2:05 PM Hayato Kuroda (Fujitsu) > > > <kuroda.hayato@fujitsu.com> wrote: > > > > > > Regarding assertion failure, I've found that assert in > > > > > > PhysicalConfirmReceivedLocation() conflicts with restart_lsn > > > > > > previously set by ReplicationSlotReserveWal(). As I can see, > > > > > > ReplicationSlotReserveWal() just picks fresh XLogCtl->RedoRecPtr lsn. > > > > > > So, it doesn't seems there is a guarantee that restart_lsn never goes > > > > > > backward. The commit in ReplicationSlotReserveWal() even states there > > > > > > is a "chance that we have to retry". > > > > > > > > > > > > > > > > I don't see how this theory can lead to a restart_lsn of a slot going > > > > > backwards. The retry mentioned there is just a retry to reserve the > > > > > slot's position again if the required WAL is already removed. Such a > > > > > retry can only get the position later than the previous restart_lsn. > > > > > > > > We analyzed the assertion failure happened at pg_basebackup/020_pg_receivewal, > > > > and confirmed that restart_lsn can go backward. This meant that Assert() added > > > > by the ca307d5 can cause crash. > > > > > > > > Background > > > > =========== > > > > When pg_receivewal starts the replication and it uses the replication slot, it > > > > sets as the beginning of the segment where restart_lsn exists, as the startpoint. > > > > E.g., if the restart_lsn of the slot is 0/B000D0, pg_receivewal requests WALs > > > > from 0/B00000. > > > > More detail of this behavior, see f61e1dd2 and d9bae531. > > > > > > > > What happened here > > > > ================== > > > > Based on above theory, walsender sent from the beginning of segment (0/B00000). > > > > When walreceiver receives, it tried to send reply. At that time the flushed > > > > location of WAL would be 0/B00000. walsender sets the received lsn as restart_lsn > > > > in PhysicalConfirmReceivedLocation(). Here the restart_lsn went backward (0/B000D0->0/B00000). > > > > > > > > The assertion failure could happen if CHECKPOINT happened at that time. > > > > Attribute last_saved_restart_lsn of the slot was 0/B000D0, but the data.restart_lsn > > > > was 0/B00000. It could not satisfy the assertion added in InvalidatePossiblyObsoleteSlot(). > > > > > > Thank you for your detailed explanation! > > > > > > > Note > > > > ==== > > > > 1. > > > > In this case, starting from the beginning of the segment is not a problem, because > > > > the checkpoint process only removes WAL files from segments that precede the > > > > restart_lsn's wal segment. The current segment (0/B00000) will not be removed, > > > > so there is no risk of data loss or inconsistency. > > > > > > > > 2. > > > > A similar pattern applies to pg_basebackup. Both use logic that adjusts the > > > > requested streaming position to the start of the segment, and it replies the > > > > received LSN as flushed. > > > > > > > > 3. > > > > I considered the theory above, but I could not reproduce 040_standby_failover_slots_sync > > > > because it is a timing issue. Have someone else reproduced? > > > > > > > > We are still investigating failure caused at 040_standby_failover_slots_sync. > > > > > > I didn't manage to reproduce this. But as I see from the logs [1] on > > > mamba that START_REPLICATION command was issued just before assert > > > trap. Could it be something similar to what I described in [2]. > > > Namely: > > > 1. ReplicationSlotReserveWal() sets restart_lsn for the slot. > > > 2. Concurrent checkpoint flushes that restart_lsn to the disk. > > > 3. PhysicalConfirmReceivedLocation() sets restart_lsn of the slot to > > > the beginning of the segment. > > > > Here is my analysis for the 040_standby_failover_slots_sync test > > failure where the physical replication slot can point to backward lsn: > > In certain rare cases the restart LSN can go backwards. This scenario > > can be reproduced by performing checkpoint continuously and slowing > > the WAL applying. I have a patch with changes to handle this. > > Here's a summary of the sequence of events: > > 1) Standby confirms a new LSN (0/40369C8) when primary sends some WAL contents: > > After the standby writes the received WAL contents in XLogWalRcvWrite, > > the standby sends this lsn 0/40369C8 as the confirmed flush location. > > The primary acknowledges this and updates the replication slot's > > restart_lsn accordingly: > > 2025-06-20 14:33:21.777 IST [134998] standby1 LOG: > > PhysicalConfirmReceivedLocation replication slot "sb1_slot" set > > restart_lsn to 0/40369C8 > > 2025-06-20 14:33:21.777 IST [134998] standby1 STATEMENT: > > START_REPLICATION SLOT "sb1_slot" 0/3000000 TIMELINE 1 > > > > 2) Shortly after receiving the new LSN, a checkpoint occurs which > > saves this restart_lsn: > > 2025-06-20 14:33:21.780 IST [134913] LOG: checkpoint complete: wrote > > 0 buffers (0.0%), wrote 0 SLRU buffers; 0 WAL file(s) added, 0 > > removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.002 s; sync > > files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 > > kB; lsn=0/4036A20, redo lsn=0/40369C8 > > > > 3) Streaming replication is restarted because of primary_conninfo guc > > change and reload > > The WAL receiver process is restarted: > > 25-06-20 14:33:21.779 IST [134997] FATAL: terminating walreceiver > > process due to administrator command > > > > 4) Standby sends an older flush pointer after restart: > > Upon restart, the WAL receiver sends a flush location (0/401D448) > > derived from XLogRecoveryCtl->lastReplayedEndRecPtr, which is older > > than the previously confirmed restart_lsn. It is important to note > > that we are sending the lastReplayedEndRecPtr which is the last > > successfully replayed lsn in this case: > > 5-06-20 14:33:21.796 IST [135135] LOG: WalReceiverMain > > LogstreamResult.Flush initialized to 0/401D448 > > 2025-06-20 14:33:21.796 IST [135135] LOG: sending write 0/401D448 > > flush 0/401D448 apply 0/401D448 > > > > This is done from here: > > .... > > /* Initialize LogstreamResult and buffers for processing messages */ > > LogstreamResult.Write = LogstreamResult.Flush = GetXLogReplayRecPtr(NULL); > > > > initStringInfo(&reply_message); > > > > /* Initialize nap wakeup times. */ > > now = GetCurrentTimestamp(); > > for (int i = 0; i < NUM_WALRCV_WAKEUPS; ++i) > > WalRcvComputeNextWakeup(i, now); > > > > /* Send initial reply/feedback messages. */ > > XLogWalRcvSendReply(true, false); > > ... > > > > In case of step 1, we are sending the lsn of the WAL that is written, > > since we have slowed down the WAL applying the replay location is > > lesser. > > > > 5) I have added logs to detect this inconsistency: > > This leads to a scenario where the standby tries to confirm a > > restart_lsn older than the one currently held by the primary: > > 2025-06-20 14:33:21.797 IST [135136] standby1 LOG: crash scenario - > > slot sb1_slot, cannot confirm a restart LSN (0/401D448) that is older > > than the current one (0/40369C8) > > > > If a checkpoint happens concurrently during this condition, it may > > trigger an assertion failure on the primary due to the restart_lsn > > being less than the last_saved_restart_lsn. > > Currently this does not break physical replication, but I'm not sure > > if the gap increases to many WAL files and if the WAL files get > > deleted, how it will behave. > > > > Attached the restart_lsn_backup_repro_v1 patch changes with which you > > can reproduce the 040_standby_failover_slots_sync failure. grep for > > "crash scenario - slot sb1_slot" in the logs. For me it occurs with > > every run. The reproduced logs > > 040_standby_failover_slots_sync_publisher and > > 040_standby_failover_slots_sync_standby1.log are attached. > > > > This proves that the restart_lsn can go backward in cases where the > > standby is slowly applying. But this has nothing to do with this > > thread, I felt you can commit the assert removal patch. I will > > continue the analysis further and see if there is any impact or not on > > physical replication, and we can later add comments accordingly. > > I think this thread now have well-documented cases when restart_lsn > goes backward. I suppose the first thing we should do is to remove > the assert to make buildfarm stop failing on this. I've prepared a > patch with revised commit message. Then we can focus on documenting > that correctly in the code comments. +1 to proceed this way, one minor suggestion we can include pg_basebackup also in the commit message as I noticed this can happen with pg_basebackup too. Regards, Vignesh
RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Hi, After commit ca307d5, I noticed another crash when testing some other logical replication features. The server with max_replication_slots set to 0 would crash when executing CHECKPOINT. TRAP: failed Assert("ReplicationSlotCtl != NULL"), File: "slot.c", Line: 1162, PID: 577315 postgres: checkpointer (ExceptionalCondition+0x9e)[0xc046cb] postgres: checkpointer (ReplicationSlotsComputeRequiredLSN+0x30)[0x99697f] postgres: checkpointer (CheckPointReplicationSlots+0x191)[0x997dc1] postgres: checkpointer [0x597b1b] postgres: checkpointer (CreateCheckPoint+0x6d1)[0x59729e] postgres: checkpointer (CheckpointerMain+0x559)[0x93ee79] postgres: checkpointer (postmaster_child_launch+0x15f)[0x940311] postgres: checkpointer [0x9468b0] postgres: checkpointer (PostmasterMain+0x1258)[0x9434f8] postgres: checkpointer (main+0x2fe)[0x7f5f9c] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f7585f81d85] postgres: checkpointer (_start+0x2e)[0x4958ee] I think it is trying to access the replication slots when the shared memory for them was not allocated. Best Regards, Hou zj
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Wed, Jun 25, 2025 at 10:57 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Hi, > > After commit ca307d5, I noticed another crash when testing > some other logical replication features. > > The server with max_replication_slots set to 0 would crash when executing CHECKPOINT. > > TRAP: failed Assert("ReplicationSlotCtl != NULL"), File: "slot.c", Line: 1162, PID: 577315 > postgres: checkpointer (ExceptionalCondition+0x9e)[0xc046cb] > postgres: checkpointer (ReplicationSlotsComputeRequiredLSN+0x30)[0x99697f] > postgres: checkpointer (CheckPointReplicationSlots+0x191)[0x997dc1] > postgres: checkpointer [0x597b1b] > postgres: checkpointer (CreateCheckPoint+0x6d1)[0x59729e] > postgres: checkpointer (CheckpointerMain+0x559)[0x93ee79] > postgres: checkpointer (postmaster_child_launch+0x15f)[0x940311] > postgres: checkpointer [0x9468b0] > postgres: checkpointer (PostmasterMain+0x1258)[0x9434f8] > postgres: checkpointer (main+0x2fe)[0x7f5f9c] > /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f7585f81d85] > postgres: checkpointer (_start+0x2e)[0x4958ee] > > I think it is trying to access the replication slots when the shared memory > for them was not allocated. I do not understand why CheckPointReplicationSlots() calls ReplicationSlotsComputeRequiredLSN() unconditionally, shouldn't this be called under the check[1], If not then instead of asserting Assert("ReplicationSlotCtl != NULL"), this should just return if ReplicationSlotCtl is NULL, isn't it, because ReplicationSlotCtl is not allocated if max_replication_slots is 0. [1] --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -2131,7 +2131,8 @@ CheckPointReplicationSlots(bool is_shutdown) * Recompute the required LSN as SaveSlotToPath() updated * last_saved_restart_lsn for slots. */ - ReplicationSlotsComputeRequiredLSN(); + if (max_replication_slots > 0) + ReplicationSlotsComputeRequiredLSN(); } -- Regards, Dilip Kumar Google
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Wed, Jun 25, 2025 at 11:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > On Wed, Jun 25, 2025 at 1:18 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > Another idea is to call ReplicationSlotsComputeRequiredLSN() when at least one > > of the restart_lsn is updated, like attached. I feel this could reduce the computation > > bit more. > > Right, that makes sense, if there is nothing updated on disk then we > can avoid computing this. Good idea. But I think we should associate the "updated" flag directly to the fact that one slot (no matter logical or physical) changed its last_saved_restart_lsn. See the attached patch. I'm going to push it if no objections. ------ Regards, Alexander Korotkov Supabase
Attachment
RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Alexander, > > Good idea. But I think we should associate the "updated" flag > directly to the fact that one slot (no matter logical or physical) > changed its last_saved_restart_lsn. See the attached patch. I'm > going to push it if no objections. + /* + * Track if we're going to update slot's last_saved_restart_lsn. + * We need this to know if we need to recompute the required LSN. + */ + if (s->last_saved_restart_lsn != s->data.restart_lsn) + last_saved_restart_lsn_updated = true; I feel no need to set to true if last_saved_restart_lsn_updated is already true. Other than that it's OK for me. Best regards, Hayato Kuroda FUJITSU LIMITED
RE: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
Dear Alexander, > Regarding last_saved_restart_lsn_updated, I think the opposite. I > think we should check if last_saved_restart_lsn_updated is set already > only if it could promise us some economy of resources. In our case > the main check only compares two fields of slot. And that fields are > to be accessed anyway. So, we are not going to save any RAM accesses. > Therefore, checking for last_saved_restart_lsn_updated seems like > unnecessary code complication (and I don't see we're doing that in > other places). So, I'm going to push this patch "as is". To clarify: I have no objections. Thanks for giving the knowledge. Best regards, Hayato Kuroda FUJITSU LIMITED
Re: Slot's restart_lsn may point to removed WAL segment after hard restart unexpectedly
On Thu, Jun 26, 2025 at 3:20 AM Alexander Korotkov <aekorotkov@gmail.com> wrote: > > On Wed, Jun 25, 2025 at 11:25 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Jun 25, 2025 at 1:18 PM Hayato Kuroda (Fujitsu) > > <kuroda.hayato@fujitsu.com> wrote: > > > Another idea is to call ReplicationSlotsComputeRequiredLSN() when at least one > > > of the restart_lsn is updated, like attached. I feel this could reduce the computation > > > bit more. > > > > Right, that makes sense, if there is nothing updated on disk then we > > can avoid computing this. > > Good idea. But I think we should associate the "updated" flag > directly to the fact that one slot (no matter logical or physical) > changed its last_saved_restart_lsn. See the attached patch. I'm > going to push it if no objections. > Looks good to me. -- Regards, Dilip Kumar Google